Have not had the time to do much work or ANY testing so please treat this
with caution.
What I am proposing is that the contents of a Word document be converted
into an ArrayList. That ArrayList will contain instances of the DocumentPart
class and these will facilitate the comparison operation. I have not given
those a great deal of thought yet but believe that we should check for any
paragraphs - not tables yet - being inserted, deleted, modified (not sure
how to proceed with this one yet) or moved. As you can see, I have provided
constants in the DocumentPart class to support these different results. The
comparison status flag is there to prevent a paragraph being checked again
once a match has been found but I am thinking of another use if the logic
holds.
As yet, I have not coded the compare methods or the save results method as I
think it is wise to throughly test the loading method firstly. We need to be
certain that the ArrayList of DocumentPart(s) accurately describes the
documents. I think that you are in a 'better' time-zone and that you may
have the opportunity to test the code before me. If you look at the main
method of the DocumentComparator class, you will see how to run the code.
All you need to do for now is make sure that the first two parameters to the
compareDocuments() method point to Word files and then run the code. To
check the results, you can either modify DocumentPart to add a toString()
method that outputs the instances contents or simply call the
getParagraphText() and getCellContents() methods from the compareDocument()
method.
Anyway, here is the code so far. Have a look and see if it is the way you
want to go - or think makes sense. Do not feel that you cannot criticise or
alter the code or the approach as, for now we are not committed to any
particular strategy, just exploring what is possible.
package comparedocuments;
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.hwpf.usermodel.Paragraph;
/**
* An instance of this calss can be used to perform a comparison between two
* binary (OLE2CDF) Microsoft Word documents.
*
* @author Mark B
* @version 1.00 27th July 2009
*/
public class DocumentComparator {
/**
* Called to compare the two documents and output the results of the
* comparison to a third Microsoft Word document.
*
* @param originalDoc The path to and name of the original document, the
* document that is the basis for the comparison.
* @param compareToDoc The path to and name of the document that should
* be compared with the original for any
modifications.
* @param resultDoc The path to and name of the document that should
contain
* the results of the comparison process.
* @param docTemplate The path to and name of the empty Word document
that
* should be used as the basis for the rusults
document.
* @throws java.io.IOException Thrown to signal that some sort of I/O
* Exception has occurred.
* @throws java.io.FileNotFoundException Thrown to signal that a file
* could not be located.
*/
public void compareDocuments(String originalDoc, String compareToDoc,
String resultDoc, String docTemplate)
throws IOException, FileNotFoundException {
ArrayList<DocumentPart> originalDocParts =
this.loadDocument(originalDoc);
ArrayList<DocumentPart> compareToDocParts =
this.loadDocument(compareToDoc);
this.compareDocs(originalDocParts, compareToDocParts);
this.saveResults(originalDocParts, compareToDocParts, resultDoc);
}
/**
* Opens a named binary (OLE2CDF) Microsoft Word document and converts
that
* documents contents into an ArrayList of instances of the DocumentPart
* class.
* @param docName The path to and name of a Microsoft Word document
file.
* @return An instance of the ArrayList class encapsulating instances
* of the DocumentPart class. Each DocumentPart will encapsulate
* information about a paragraph of text or a table recovered
from
* the Microsoft Word document.
* @throws java.io.IOException If an I/O Exception occurs
* @throws java.io.FileNotFoundException Thrown to indicate that the
* named Microsoft Word file could
* not be located.
*/
public ArrayList<DocumentPart> loadDocument(String docName)
throws IOException,
FileNotFoundException {
File file = null;
FileInputStream fis = null;
HWPFDocument document = null;
Range overallRange = null;
Paragraph para = null;
int numParas = 0;
boolean inTable = false;
ArrayList<DocumentPart> docParts = null;
try {
// Open the Word file.
file = new File(docName);
fis = new FileInputStream(file);
document = new HWPFDocument(fis);
// Get the overall Range for the document and the number
// of paragraphs from this Range.
overallRange = document.getOverallRange();
numParas = overallRange.numParagraphs();
for(int i = 0; i < numParas; i++) {
para = overallRange.getParagraph(i);
// Is the paragraph 'in' a table? If so, it is possible to
// recover a reference to that Table from the first
paragraph
// only. If calls are made to the getTable() method using
// subsequent paragraphs then an exception will be thrown.
So,
// after getting the Table, a flag is set to prevent further
// calls to the getTable() method.
if(para.isInTable()) {
if(!inTable) {
// Get a reference to the Table and pass it to the
// constructor of the DocumentPart class. Add the
// DocumentPart instance to the ArrayLlist.
docParts.add(new DocumentPart(
overallRange.getTable(para)));
inTable = true;
}
}
// The paragraph is not in a table so simply add a new
instance
// to the ArrayList that encapsulates the paragraph of text.
else {
docParts.add(new DocumentPart(para));
inTable = false;
}
}
return(docParts);
}
finally {
if(fis != null) {
try {
fis.close();
}
catch(IOException ioEx) {
// I G N O R E
}
}
}
}
public void compareDocs(ArrayList<DocumentPart> originalDocParts,
ArrayList<DocumentPart> compareToDocParts) {
// TO DO: Code comparsion
}
public void saveResults(ArrayList<DocumentPart> originalDocParts,
ArrayList<DocumentPart> compareToDocParts,
String resultDoc)
throws IOException,
FileNotFoundException {
// TO DO: Code saving of results.
}
/**
* Main entry point to the program.
*
* @param args
*/
public static void main(String[] args) {
try {
DocumentComparator docComp = new DocumentComparator();
docComp.compareDocuments("original document",
"compare to document",
"results document",
"results document template");
}
catch(FileNotFoundException fnfEx) {
// TO DO: Code exception handling.
}
catch(IOException ioEx) {
// TO DO: Code exception handling.
}
}
}
package comparedocuments;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Table;
import org.apache.poi.hwpf.usermodel.TableRow;
/**
* Encapsulates a 'part' of a Microsoft Word document. Currently, that part
can
* either be a Table or a paragraph of text.
*
* @author Mark B
* @version 1.00 27th July 2009.
*/
public class DocumentPart {
private Range docPart = null;
private boolean comparisonStatus = false;
private int comparisonResult = 0;
public static final int INSERTED = 0;
public static final int DELETED = 1;
public static final int MODIFIED = 2;
public static final int UN_MODIFIED = 3;
public static final int MOVED = 4;
/**
* Create a new instance of the DocumentPart class using the following
* paramater.
*
* @param docPart An instance of the org.apache.poi.hwpf.usermodel.Range
* class that will encapsulate an instance of the
* org.apache.poi.hwpf.usermodel.Paragraph or an instance
* of the org.apache.poi.hwpf.usermodel.Table class.
*/
public DocumentPart(Range docPart) {
this.docPart = docPart;
// Note that as the part has not been successfully compared to
another
// part the status is false.
this.comparisonStatus = false;
// and that the type is set to un-modified. Any parts that have not
been
// checked or that are not un-modified will be written away to the
// results document.
this.comparisonResult = DocumentPart.UN_MODIFIED;
}
/**
* Has a match been foound for this document part?
*
* @return A boolean value that indicates whether a match was found
between
* two document parts.
*/
public boolean isMatched() {
return(this.comparisonStatus);
}
/**
* Get the result of the comparison.
*
* @return A primitive int value that indicates the result of comparing
* this document part to others. The following constants have
been
* declared;
* DocumentPart.INSERTED = 0;
* DocumentPart.DELETED = 1;
* DocumentPart.MODIFIED = 2;
* DocumentPart.UN_MODIFIED = 3;
* DocumentPart.MOVED = 4;
*
*/
public int getComparisonResult() {
return(this.comparisonResult);
}
/**
* Store the result of the domnparsion between document parts.
*
* @param comparisonResult A primitive int whose value indicates the
result
* of comparing one document part with others.
*/
public void setComparisonResult(int comparisonResult) {
this.comparisonResult = comparisonResult;
}
/**
* Does a DocumentPart encapsulate a table?
* @return A primitive boolean value; true if the DocumentPart
encapsulates
* a Table, false otherwise.
*/
public boolean isTable() {
return(this.docPart instanceof Table);
}
/**
* If the DocumentPart encapsulates a Table, get the number of rows in
the
* rable.
*
* @return A primitive int whose value indicates how many rows there are
in
* the table.
* @throws java.lang.UnsupportedOperationException Thrown if this method
is
* called for a DocumentPart instance that encapsulates a
Paragraph.
*/
public int getNumRows() throws UnsupportedOperationException {
int numRows = 0;
if(this.isTable()) {
Table table = (Table)this.docPart;
numRows = table.numRows();
}
else {
throw new UnsupportedOperationException("The DocumentPart does "
+
"not encapsulate a Table.");
}
return(numRows);
}
/**
* How many columns are there in the Table. This method assumes that the
* table is 'square', i.e. that each row of the Table holds the same
number
* of columns.
*
* @return A primitive int whose value indicates how many columns there
are
* in the Table.
* @throws java.lang.UnsupportedOperationException Thrown if this method
is
* called for a DocumentPart instance that encapsulates a
Paragraph.
*/
public int getNumColumns() throws UnsupportedOperationException {
return(this.getNumColumns(0));
}
/**
* How many columns are there in a specific row of the Table.
*
* @return A primitive int whose value indicates how many columns there
are
* in the Table row.
* @throws java.lang.UnsupportedOperationException Thrown if this method
is
* called for a DocumentPart instance that encapsulates a
Paragraph.
*/
public int getNumColumns(int rowNum) throws
UnsupportedOperationException {
int numColumns = 0;
if(this.isTable()) {
Table table = (Table)this.docPart;
TableRow row = table.getRow(rowNum);
numColumns = row.numCells();
}
else {
throw new UnsupportedOperationException("The DocumentPart does "
+
"not encapsulate a Table.");
}
return(numColumns);
}
/**
* Return the contents of a specific cell.
*
* @param rowNum A primitive int that indicates the row the cell is on.
* Remember that row indices are zero based.
* @param colNum A primitive int that indicates the column the cell is
in.
* Remember that column indices are zero based.
* @return An instance of the String class that encapsulates the cells
* contents
* @throws java.lang.UnsupportedOperationException Thrown if this method
is
* called for a DocumentPart instance that encapsulates a
Paragraph.
*/
public String getCellContents(int rowNum, int colNum)
throws
UnsupportedOperationException {
return(null);
}
/**
* Return the text of the Paragraph.
*
* @return An instance of the String class that encapsulates the text
* the Paragraph contained. Note that this will be stripped of
* all fields.
* @throws java.lang.UnsupportedOperationException Thrown if this method
is
* called for a DocumentPart instance that encapsulates a Table.
*/
public String getParagraphText() throws UnsupportedOperationException {
String returnValue = null;
if(!this.isTable()) {
Paragraph para = (Paragraph)this.docPart;
returnValue = Range.stripFields(para.text());
}
else {
throw new IllegalStateException("The DocumentPart does not " +
"encapsulate a Paragraph.");
}
return(returnValue);
}
}
bihag wrote:
>
> Hi All,
>
> We want to compare two document and what ever things are not common that
> we have to highlight with some color or any other way ... So I thing we
> have to merge document or create new document which has content of both
> the document, and show difference with some color, like deleted with red,
> newly added with blue ...
>
> Mainly we are looking for OLE2CDF doc compare solution ...
>
> please provide some code sniplet if possible ...
>
> Thanking you in advance ...
>
--
View this message in context:
http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24687061.html
Sent from the POI - Dev mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]