Had the chance to re-think the work last night and would like to propose a
few changes.

Firstly, the DocumentPart class should be refactored IMO; it is trying to do
two things at once and this is never a good idea. I propose removing the
following;

public static final int INSERTED = 0;
public static final int DELETED = 1;
public static final int MODIFIED = 2;
public static final int UN_MODIFIED = 3;
public static final int MOVED = 4;

and

    /**
     * Get the result of the comparison.
     *
     * @return A primitive int value that indicates the result of comparing
     *         this document part to others. The following constants have
been
     *         declared;
     *             DocumentPart.INSERTED = 0;
     *             DocumentPart.DELETED = 1;
     *             DocumentPart.MODIFIED = 2;
     *             DocumentPart.UN_MODIFIED = 3;
     *             DocumentPart.MOVED = 4;
     *
     */
    public int getComparisonResult() {
        return(this.comparisonResult);
    }

    /**
     * Store the result of the domnparsion between document parts.
     *
     * @param comparisonResult A primitive int whose value indicates the
result
     *                         of comparing one document part with others.
     */
    public void setComparisonResult(int comparisonResult) {
        this.comparisonResult = comparisonResult;
    }

 
Next, adding a new class called something like ComparisonResult or
ReportableComparison. It will encapsulate the information removed from the
DocumentPart calss along with a Range object. It's purpose is to track a
reportable comparison, an insertion, deletion, modification or
transformation. I propose that as we detect one of these when comparing the
two documents an instance of the new clas is created, the affected Range is
copied over and the appropriate comparsion result setting made. Following
the comparison between the two documents, an ArrayList of comparison result
objects can be used to create the report.

Thirdly, there is something you may wish to discuss with your colleagues; it
is a fall back position for us in case HWPF cannot successfully create the
comparison report. The Rich Text File Format is an open, relatively simple
file format developed by Microsoft. It is possible to create an rtf file
with a .doc extension such that when the user double clicks on the file,
Word is used to open it. A rather good api exists - called iText - that
makes it possible to create rtf files using Java code, so it should be
possible for this application to output it's results as an rtf file, with a
.doc extension. The users may never know any different.

Anyway, if I have the time today, I will make the changes suggested above
and write the easy comparisions. I think the easy comparisions are
paragraphs that have been added to the 'new' document, those that have been
deleted from the original document and those that have been moved. The basic
approach I am going to take;

1. Get a paragraph from the original document.
2. Try to match it with the paragraph at the same position in the new
document. If a match is found here then no further action is necessary.
3. If a match cannot be found at the same location in the new document,
start from the first paragraph and cehck every currently un-matched
paragraph. If a match is found then mark this as a paragraph that has been
moved. If a match cannot be found then mark it as a pargarph that has been
deleted.
4. Once all of the paragraphs in the original document have been checked,
any un-matched paragraphs that remain in the new document can be marked as
new paragraphs or insertions.

When I say 'mark' in the above, I mean to create an instance of the new
ComparsionResult class, copy over the reference to the Range and set the
result of the comparsion.

For now, I will not spend time looking at tables and will try to ignore the
possibile complexities of deciding if a paragraph has been modified. It may
prove a step too far but I would also like to try adding the result
producing step just to see if HWPF can produce a suitable report for us.

As before, will post again if I make any progress. Do not feel you cannot
say 'stop' or 'wait I want to think about something' or that you cannot
suggest changes, modifications or a completely different approach yourself
if the current solution is veering away from your original requirement.
Ideally, we should produce the solution together and I ought not to 'force'
you into something; I am only too well aware of how easy it is to be swept
along as a project gathers momentum.

Yours

Mark B


bihag wrote:
> 
> Hi Mark,
> 
> I would sincerely like to convey my Thanks to you.
> The tips you have given is really helpful.
> appreciated your time and efforts.
> 
> Regards,
> Bihag 
> 
> 
> 
> MSB wrote:
>> 
>> Have not had the time to do much work or ANY testing so please treat this
>> with caution.
>> 
>> What I am proposing is that the contents of a Word document be converted
>> into an ArrayList. That ArrayList will contain instances of the
>> DocumentPart class and these will facilitate the comparison operation. I
>> have not given those a great deal of thought yet but believe that we
>> should check for any paragraphs - not tables yet - being inserted,
>> deleted, modified (not sure how to proceed with this one yet) or moved.
>> As you can see, I have provided constants in the DocumentPart class to
>> support these different results. The comparison status flag is there to
>> prevent a paragraph being checked again once a match has been found but I
>> am thinking of another use if the logic holds.
>> 
>> As yet, I have not coded the compare methods or the save results method
>> as I think it is wise to throughly test the loading method firstly. We
>> need to be certain that the ArrayList of DocumentPart(s) accurately
>> describes the documents. I think that you are in a 'better' time-zone and
>> that you may have the opportunity to test the code before me. If you look
>> at the main method of the DocumentComparator class, you will see how to
>> run the code. All you need to do for now is make sure that the first two
>> parameters to the compareDocuments() method point to Word files and then
>> run the code. To check the results, you can either modify DocumentPart to
>> add a toString() method that outputs  the instances contents or simply
>> call the getParagraphText() and getCellContents() methods from the
>> compareDocument() method.
>> 
>> Anyway, here is the code so far. Have a look and see if it is the way you
>> want to go - or think makes sense. Do not feel that you cannot criticise
>> or alter the code or the approach as, for now we are not committed to any
>> particular strategy, just exploring what is possible.
>> 
>> package comparedocuments;
>> 
>> import java.io.File;
>> import java.io.FileInputStream;
>> import java.util.ArrayList;
>> import java.io.FileNotFoundException;
>> import java.io.IOException;
>> 
>> import org.apache.poi.hwpf.HWPFDocument;
>> import org.apache.poi.hwpf.usermodel.Range;
>> import org.apache.poi.hwpf.usermodel.Paragraph;
>> 
>> /**
>>  * An instance of this calss can be used to perform a comparison between
>> two
>>  * binary (OLE2CDF) Microsoft Word documents.
>>  *
>>  * @author Mark B
>>  * @version 1.00 27th July 2009
>>  */
>> public class DocumentComparator {
>>     
>>     /**
>>      * Called to compare the two documents and output the results of the
>>      * comparison to a third Microsoft Word document.
>>      * 
>>      * @param originalDoc The path to and name of the original document,
>> the
>>      *                    document that is the basis for the comparison.
>>      * @param compareToDoc The path to and name of the document that
>> should
>>      *                     be compared with the original for any
>> modifications.
>>      * @param resultDoc The path to and name of the document that should
>> contain
>>      *                  the results of the comparison process.
>>      * @param docTemplate The path to and name of the empty Word document
>> that
>>      *                    should be used as the basis for the rusults
>> document.
>>      * @throws java.io.IOException Thrown to signal that some sort of I/O
>>      *                             Exception has occurred.
>>      * @throws java.io.FileNotFoundException Thrown to signal that a file
>>      *                                       could not be located.
>>      */
>>     public void compareDocuments(String originalDoc, String compareToDoc,
>>                                  String resultDoc, String docTemplate)
>>                                  throws IOException,
>> FileNotFoundException {
>>         ArrayList<DocumentPart> originalDocParts =
>> this.loadDocument(originalDoc);
>>         ArrayList<DocumentPart> compareToDocParts =
>> this.loadDocument(compareToDoc);
>>         this.compareDocs(originalDocParts, compareToDocParts);
>>         this.saveResults(originalDocParts, compareToDocParts, resultDoc);
>>     }
>>     
>>     /**
>>      * Opens a named binary (OLE2CDF) Microsoft Word document and
>> converts that
>>      * documents contents into an ArrayList of instances of the
>> DocumentPart
>>      * class.
>>      * @param docName The path to and name of a Microsoft Word document
>> file.
>>      * @return An instance of the ArrayList class encapsulating instances
>>      *         of the DocumentPart class. Each DocumentPart will
>> encapsulate
>>      *         information about a paragraph of text or a table recovered
>> from
>>      *         the Microsoft Word document.
>>      * @throws java.io.IOException If an I/O Exception occurs
>>      * @throws java.io.FileNotFoundException Thrown to indicate that the
>>      *                                       named Microsoft Word file
>> could
>>      *                                       not be located.
>>      */
>>     public ArrayList<DocumentPart> loadDocument(String docName)
>>                                      throws IOException,
>> FileNotFoundException {
>>         File file = null;
>>         FileInputStream fis = null;
>>         HWPFDocument document = null;
>>         Range overallRange = null;
>>         Paragraph para = null;
>>         int numParas = 0;
>>         boolean inTable = false;
>>         ArrayList<DocumentPart> docParts = null;
>>         try {
>>             // Open the Word file.
>>             file = new File(docName);
>>             fis = new FileInputStream(file);
>>             document = new HWPFDocument(fis);
>>             // Get the overall Range for the document and the number
>>             // of paragraphs from this Range.
>>             overallRange = document.getOverallRange();
>>             numParas = overallRange.numParagraphs();
>>             for(int i = 0; i < numParas; i++) {
>>                 para = overallRange.getParagraph(i);
>>                 // Is the paragraph 'in' a table? If so, it is possible
>> to
>>                 // recover a reference to that Table from the first
>> paragraph
>>                 // only. If calls are made to the getTable() method using
>>                 // subsequent paragraphs then an exception will be
>> thrown. So,
>>                 // after getting the Table, a flag is set to prevent
>> further
>>                 // calls to the getTable() method.
>>                 if(para.isInTable()) {
>>                     if(!inTable) {
>>                         // Get a reference to the Table and pass it to
>> the
>>                         // constructor of the DocumentPart class. Add the
>>                         // DocumentPart instance to the ArrayLlist.
>>                         docParts.add(new DocumentPart(
>>                                 overallRange.getTable(para)));
>>                         inTable = true;
>>                     }
>>                 }
>>                 // The paragraph is not in a table so simply add a new
>> instance
>>                 // to the ArrayList that encapsulates the paragraph of
>> text.
>>                 else {
>>                     docParts.add(new DocumentPart(para));
>>                     inTable = false;
>>                 }
>>             }
>>             return(docParts);
>>         }
>>         finally {
>>             if(fis != null) {
>>                 try {
>>                   fis.close();  
>>                 }
>>                 catch(IOException ioEx) {
>>                     // I G N O R E
>>                 }
>>             }
>>         }
>>     }
>>     
>>     public void compareDocs(ArrayList<DocumentPart> originalDocParts,
>>                             ArrayList<DocumentPart> compareToDocParts) {
>>         // TO DO: Code comparsion
>>     }
>>     
>>     public void saveResults(ArrayList<DocumentPart> originalDocParts,
>>                             ArrayList<DocumentPart> compareToDocParts,
>>                             String resultDoc)
>>                                      throws IOException,
>> FileNotFoundException {
>>         // TO DO: Code saving of results.
>>     }
>> 
>>     /**
>>      * Main entry point to the program.
>>      *
>>      * @param args
>>      */
>>     public static void main(String[] args) {
>>         try {
>>             DocumentComparator docComp = new DocumentComparator();
>>             docComp.compareDocuments("original document",
>>                                      "compare to document",
>>                                      "results document",
>>                                      "results document template");
>>         }
>>         catch(FileNotFoundException fnfEx) {
>>             // TO DO: Code exception handling.
>>         }
>>         catch(IOException ioEx) {
>>             // TO DO: Code exception handling.
>>         }
>>     }
>> }
>> 
>> package comparedocuments;
>> 
>> import org.apache.poi.hwpf.usermodel.Range;
>> import org.apache.poi.hwpf.usermodel.Paragraph;
>> import org.apache.poi.hwpf.usermodel.Table;
>> import org.apache.poi.hwpf.usermodel.TableRow;
>> 
>> /**
>>  * Encapsulates a 'part' of a Microsoft Word document. Currently, that
>> part can
>>  * either be a Table or a paragraph of text.
>>  *
>>  * @author Mark B
>>  * @version 1.00 27th July 2009.
>>  */
>> public class DocumentPart {
>> 
>>     private Range docPart = null;
>>     private boolean comparisonStatus = false;
>>     private int comparisonResult = 0;
>> 
>>     public static final int INSERTED = 0;
>>     public static final int DELETED = 1;
>>     public static final int MODIFIED = 2;
>>     public static final int UN_MODIFIED = 3;
>>     public static final int MOVED = 4;
>> 
>>     /**
>>      * Create a new instance of the DocumentPart class using the
>> following
>>      * paramater.
>>      *
>>      * @param docPart An instance of the
>> org.apache.poi.hwpf.usermodel.Range
>>      *                class that will encapsulate an instance of the
>>      *                org.apache.poi.hwpf.usermodel.Paragraph or an
>> instance
>>      *                of the org.apache.poi.hwpf.usermodel.Table class.
>>      */
>>     public DocumentPart(Range docPart) {
>>         this.docPart = docPart;
>>         // Note that as the part has not been successfully compared to
>> another
>>         // part the status is false.
>>         this.comparisonStatus = false;
>>         // and that the type is set to un-modified. Any parts that have
>> not been
>>         // checked or that are not un-modified will be written away to
>> the
>>         // results document.
>>         this.comparisonResult = DocumentPart.UN_MODIFIED;
>>     }
>> 
>>     /**
>>      * Has a match been foound for this document part?
>>      *
>>      * @return A boolean value that indicates whether a match was found
>> between
>>      *         two document parts.
>>      */
>>     public boolean isMatched() {
>>         return(this.comparisonStatus);
>>     }
>> 
>>     /**
>>      * Get the result of the comparison.
>>      *
>>      * @return A primitive int value that indicates the result of
>> comparing
>>      *         this document part to others. The following constants have
>> been
>>      *         declared;
>>      *             DocumentPart.INSERTED = 0;
>>      *             DocumentPart.DELETED = 1;
>>      *             DocumentPart.MODIFIED = 2;
>>      *             DocumentPart.UN_MODIFIED = 3;
>>      *             DocumentPart.MOVED = 4;
>>      *
>>      */
>>     public int getComparisonResult() {
>>         return(this.comparisonResult);
>>     }
>> 
>>     /**
>>      * Store the result of the domnparsion between document parts.
>>      *
>>      * @param comparisonResult A primitive int whose value indicates the
>> result
>>      *                         of comparing one document part with
>> others.
>>      */
>>     public void setComparisonResult(int comparisonResult) {
>>         this.comparisonResult = comparisonResult;
>>     }
>> 
>>     /**
>>      * Does a DocumentPart encapsulate a table?
>>      * @return A primitive boolean value; true if the DocumentPart
>> encapsulates
>>      *         a Table, false otherwise.
>>      */
>>     public boolean isTable() {
>>         return(this.docPart instanceof Table);
>>     }
>> 
>>     /**
>>      * If the DocumentPart encapsulates a Table, get the number of rows
>> in the
>>      * rable.
>>      *
>>      * @return A primitive int whose value indicates how many rows there
>> are in
>>      *         the table.
>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>> method is
>>      *         called for a DocumentPart instance that encapsulates a
>> Paragraph.
>>      */
>>     public int getNumRows() throws UnsupportedOperationException {
>>         int numRows = 0;
>>         if(this.isTable()) {
>>             Table table = (Table)this.docPart;
>>             numRows = table.numRows();
>>         }
>>         else {
>>             throw new UnsupportedOperationException("The DocumentPart
>> does " +
>>                     "not encapsulate a Table.");
>>         }
>>         return(numRows);
>>     }
>> 
>>     /**
>>      * How many columns are there in the Table. This method assumes that
>> the
>>      * table is 'square', i.e. that each row of the Table holds the same
>> number
>>      * of columns.
>>      *
>>      * @return A primitive int whose value indicates how many columns
>> there are
>>      *         in the Table.
>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>> method is
>>      *         called for a DocumentPart instance that encapsulates a
>> Paragraph.
>>      */
>>     public int getNumColumns() throws UnsupportedOperationException {
>>         return(this.getNumColumns(0));
>>     }
>> 
>>     /**
>>      * How many columns are there in a specific row of the Table.
>>      *
>>      * @return A primitive int whose value indicates how many columns
>> there are
>>      *         in the Table row.
>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>> method is
>>      *         called for a DocumentPart instance that encapsulates a
>> Paragraph.
>>      */
>>     public int getNumColumns(int rowNum) throws
>> UnsupportedOperationException {
>>         int numColumns = 0;
>>         if(this.isTable()) {
>>             Table table = (Table)this.docPart;
>>             TableRow row = table.getRow(rowNum);
>>             numColumns = row.numCells();
>>         }
>>         else {
>>             throw new UnsupportedOperationException("The DocumentPart
>> does " +
>>                     "not encapsulate a Table.");
>>         }
>>         return(numColumns);
>>     }
>> 
>>     /**
>>      * Return the contents of a specific cell.
>>      *
>>      * @param rowNum A primitive int that indicates the row the cell is
>> on.
>>      *               Remember that row indices are zero based.
>>      * @param colNum A primitive int that indicates the column the cell
>> is in.
>>      *               Remember that column indices are zero based.
>>      * @return An instance of the String class that encapsulates the
>> cells
>>      *         contents
>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>> method is
>>      *         called for a DocumentPart instance that encapsulates a
>> Paragraph.
>>      */
>>     public String getCellContents(int rowNum, int colNum)
>>                                           throws
>> UnsupportedOperationException {
>>         return(null);
>>     }
>> 
>>     /**
>>      * Return the text of the Paragraph.
>>      *
>>      * @return An instance of the String class that encapsulates the text
>>      *         the Paragraph contained. Note that this will be stripped
>> of
>>      *         all fields.
>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>> method is
>>      *         called for a DocumentPart instance that encapsulates a
>> Table.
>>      */
>>     public String getParagraphText() throws UnsupportedOperationException
>> {
>>         String returnValue = null;
>>         if(!this.isTable()) {
>>             Paragraph para = (Paragraph)this.docPart;
>>             returnValue = Range.stripFields(para.text());
>>         }
>>         else {
>>             throw new IllegalStateException("The DocumentPart does not "
>> +
>>                     "encapsulate a Paragraph.");
>>         }
>>         return(returnValue);
>>     }
>> }
>> 
>> 
>> 
>> bihag wrote:
>>> 
>>> Hi All,
>>> 
>>> We want to compare two document and what ever things are not common that
>>> we have to highlight with some color or any other way ... So I thing we
>>> have to merge document or create new document which has content of both
>>> the document, and show difference with some color, like deleted with
>>> red, newly added with blue ... 
>>> 
>>> Mainly we are looking for OLE2CDF doc compare solution ...
>>> 
>>> please provide some code sniplet if possible ...
>>> 
>>> Thanking you in advance ...
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24693429.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to