Hi Mark, 

sorry but I didn't started working on it ... and want be able start it for
next 2 days as we have our product demo tomorrow ...

anyway thanks a lot for your time and efforts.

Regards,
Bihag Raval.



MSB wrote:
> 
> Had the chance to re-think the work last night and would like to propose a
> few changes.
> 
> Firstly, the DocumentPart class should be refactored IMO; it is trying to
> do two things at once and this is never a good idea. I propose removing
> the following;
> 
> public static final int INSERTED = 0;
> public static final int DELETED = 1;
> public static final int MODIFIED = 2;
> public static final int UN_MODIFIED = 3;
> public static final int MOVED = 4;
> 
> and
> 
>     /**
>      * Get the result of the comparison.
>      *
>      * @return A primitive int value that indicates the result of
> comparing
>      *         this document part to others. The following constants have
> been
>      *         declared;
>      *             DocumentPart.INSERTED = 0;
>      *             DocumentPart.DELETED = 1;
>      *             DocumentPart.MODIFIED = 2;
>      *             DocumentPart.UN_MODIFIED = 3;
>      *             DocumentPart.MOVED = 4;
>      *
>      */
>     public int getComparisonResult() {
>         return(this.comparisonResult);
>     }
> 
>     /**
>      * Store the result of the domnparsion between document parts.
>      *
>      * @param comparisonResult A primitive int whose value indicates the
> result
>      *                         of comparing one document part with others.
>      */
>     public void setComparisonResult(int comparisonResult) {
>         this.comparisonResult = comparisonResult;
>     }
> 
>  
> Next, adding a new class called something like ComparisonResult or
> ReportableComparison. It will encapsulate the information removed from the
> DocumentPart calss along with a Range object. It's purpose is to track a
> reportable comparison, an insertion, deletion, modification or
> transformation. I propose that as we detect one of these when comparing
> the two documents an instance of the new clas is created, the affected
> Range is copied over and the appropriate comparsion result setting made.
> Following the comparison between the two documents, an ArrayList of
> comparison result objects can be used to create the report.
> 
> Thirdly, there is something you may wish to discuss with your colleagues;
> it is a fall back position for us in case HWPF cannot successfully create
> the comparison report. The Rich Text File Format is an open, relatively
> simple file format developed by Microsoft. It is possible to create an rtf
> file with a .doc extension such that when the user double clicks on the
> file, Word is used to open it. A rather good api exists - called iText -
> that makes it possible to create rtf files using Java code, so it should
> be possible for this application to output it's results as an rtf file,
> with a .doc extension. The users may never know any different.
> 
> Anyway, if I have the time today, I will make the changes suggested above
> and write the easy comparisions. I think the easy comparisions are
> paragraphs that have been added to the 'new' document, those that have
> been deleted from the original document and those that have been moved.
> The basic approach I am going to take;
> 
> 1. Get a paragraph from the original document.
> 2. Try to match it with the paragraph at the same position in the new
> document. If a match is found here then no further action is necessary.
> 3. If a match cannot be found at the same location in the new document,
> start from the first paragraph and cehck every currently un-matched
> paragraph. If a match is found then mark this as a paragraph that has been
> moved. If a match cannot be found then mark it as a pargarph that has been
> deleted.
> 4. Once all of the paragraphs in the original document have been checked,
> any un-matched paragraphs that remain in the new document can be marked as
> new paragraphs or insertions.
> 
> When I say 'mark' in the above, I mean to create an instance of the new
> ComparsionResult class, copy over the reference to the Range and set the
> result of the comparsion.
> 
> For now, I will not spend time looking at tables and will try to ignore
> the possibile complexities of deciding if a paragraph has been modified.
> It may prove a step too far but I would also like to try adding the result
> producing step just to see if HWPF can produce a suitable report for us.
> 
> As before, will post again if I make any progress. Do not feel you cannot
> say 'stop' or 'wait I want to think about something' or that you cannot
> suggest changes, modifications or a completely different approach yourself
> if the current solution is veering away from your original requirement.
> Ideally, we should produce the solution together and I ought not to
> 'force' you into something; I am only too well aware of how easy it is to
> be swept along as a project gathers momentum.
> 
> Yours
> 
> Mark B
> 
> 
> bihag wrote:
>> 
>> Hi Mark,
>> 
>> I would sincerely like to convey my Thanks to you.
>> The tips you have given is really helpful.
>> appreciated your time and efforts.
>> 
>> Regards,
>> Bihag 
>> 
>> 
>> 
>> MSB wrote:
>>> 
>>> Have not had the time to do much work or ANY testing so please treat
>>> this with caution.
>>> 
>>> What I am proposing is that the contents of a Word document be converted
>>> into an ArrayList. That ArrayList will contain instances of the
>>> DocumentPart class and these will facilitate the comparison operation. I
>>> have not given those a great deal of thought yet but believe that we
>>> should check for any paragraphs - not tables yet - being inserted,
>>> deleted, modified (not sure how to proceed with this one yet) or moved.
>>> As you can see, I have provided constants in the DocumentPart class to
>>> support these different results. The comparison status flag is there to
>>> prevent a paragraph being checked again once a match has been found but
>>> I am thinking of another use if the logic holds.
>>> 
>>> As yet, I have not coded the compare methods or the save results method
>>> as I think it is wise to throughly test the loading method firstly. We
>>> need to be certain that the ArrayList of DocumentPart(s) accurately
>>> describes the documents. I think that you are in a 'better' time-zone
>>> and that you may have the opportunity to test the code before me. If you
>>> look at the main method of the DocumentComparator class, you will see
>>> how to run the code. All you need to do for now is make sure that the
>>> first two parameters to the compareDocuments() method point to Word
>>> files and then run the code. To check the results, you can either modify
>>> DocumentPart to add a toString() method that outputs  the instances
>>> contents or simply call the getParagraphText() and getCellContents()
>>> methods from the compareDocument() method.
>>> 
>>> Anyway, here is the code so far. Have a look and see if it is the way
>>> you want to go - or think makes sense. Do not feel that you cannot
>>> criticise or alter the code or the approach as, for now we are not
>>> committed to any particular strategy, just exploring what is possible.
>>> 
>>> package comparedocuments;
>>> 
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.util.ArrayList;
>>> import java.io.FileNotFoundException;
>>> import java.io.IOException;
>>> 
>>> import org.apache.poi.hwpf.HWPFDocument;
>>> import org.apache.poi.hwpf.usermodel.Range;
>>> import org.apache.poi.hwpf.usermodel.Paragraph;
>>> 
>>> /**
>>>  * An instance of this calss can be used to perform a comparison between
>>> two
>>>  * binary (OLE2CDF) Microsoft Word documents.
>>>  *
>>>  * @author Mark B
>>>  * @version 1.00 27th July 2009
>>>  */
>>> public class DocumentComparator {
>>>     
>>>     /**
>>>      * Called to compare the two documents and output the results of the
>>>      * comparison to a third Microsoft Word document.
>>>      * 
>>>      * @param originalDoc The path to and name of the original document,
>>> the
>>>      *                    document that is the basis for the comparison.
>>>      * @param compareToDoc The path to and name of the document that
>>> should
>>>      *                     be compared with the original for any
>>> modifications.
>>>      * @param resultDoc The path to and name of the document that should
>>> contain
>>>      *                  the results of the comparison process.
>>>      * @param docTemplate The path to and name of the empty Word
>>> document that
>>>      *                    should be used as the basis for the rusults
>>> document.
>>>      * @throws java.io.IOException Thrown to signal that some sort of
>>> I/O
>>>      *                             Exception has occurred.
>>>      * @throws java.io.FileNotFoundException Thrown to signal that a
>>> file
>>>      *                                       could not be located.
>>>      */
>>>     public void compareDocuments(String originalDoc, String
>>> compareToDoc,
>>>                                  String resultDoc, String docTemplate)
>>>                                  throws IOException,
>>> FileNotFoundException {
>>>         ArrayList<DocumentPart> originalDocParts =
>>> this.loadDocument(originalDoc);
>>>         ArrayList<DocumentPart> compareToDocParts =
>>> this.loadDocument(compareToDoc);
>>>         this.compareDocs(originalDocParts, compareToDocParts);
>>>         this.saveResults(originalDocParts, compareToDocParts,
>>> resultDoc);
>>>     }
>>>     
>>>     /**
>>>      * Opens a named binary (OLE2CDF) Microsoft Word document and
>>> converts that
>>>      * documents contents into an ArrayList of instances of the
>>> DocumentPart
>>>      * class.
>>>      * @param docName The path to and name of a Microsoft Word document
>>> file.
>>>      * @return An instance of the ArrayList class encapsulating
>>> instances
>>>      *         of the DocumentPart class. Each DocumentPart will
>>> encapsulate
>>>      *         information about a paragraph of text or a table
>>> recovered from
>>>      *         the Microsoft Word document.
>>>      * @throws java.io.IOException If an I/O Exception occurs
>>>      * @throws java.io.FileNotFoundException Thrown to indicate that the
>>>      *                                       named Microsoft Word file
>>> could
>>>      *                                       not be located.
>>>      */
>>>     public ArrayList<DocumentPart> loadDocument(String docName)
>>>                                      throws IOException,
>>> FileNotFoundException {
>>>         File file = null;
>>>         FileInputStream fis = null;
>>>         HWPFDocument document = null;
>>>         Range overallRange = null;
>>>         Paragraph para = null;
>>>         int numParas = 0;
>>>         boolean inTable = false;
>>>         ArrayList<DocumentPart> docParts = null;
>>>         try {
>>>             // Open the Word file.
>>>             file = new File(docName);
>>>             fis = new FileInputStream(file);
>>>             document = new HWPFDocument(fis);
>>>             // Get the overall Range for the document and the number
>>>             // of paragraphs from this Range.
>>>             overallRange = document.getOverallRange();
>>>             numParas = overallRange.numParagraphs();
>>>             for(int i = 0; i < numParas; i++) {
>>>                 para = overallRange.getParagraph(i);
>>>                 // Is the paragraph 'in' a table? If so, it is possible
>>> to
>>>                 // recover a reference to that Table from the first
>>> paragraph
>>>                 // only. If calls are made to the getTable() method
>>> using
>>>                 // subsequent paragraphs then an exception will be
>>> thrown. So,
>>>                 // after getting the Table, a flag is set to prevent
>>> further
>>>                 // calls to the getTable() method.
>>>                 if(para.isInTable()) {
>>>                     if(!inTable) {
>>>                         // Get a reference to the Table and pass it to
>>> the
>>>                         // constructor of the DocumentPart class. Add
>>> the
>>>                         // DocumentPart instance to the ArrayLlist.
>>>                         docParts.add(new DocumentPart(
>>>                                 overallRange.getTable(para)));
>>>                         inTable = true;
>>>                     }
>>>                 }
>>>                 // The paragraph is not in a table so simply add a new
>>> instance
>>>                 // to the ArrayList that encapsulates the paragraph of
>>> text.
>>>                 else {
>>>                     docParts.add(new DocumentPart(para));
>>>                     inTable = false;
>>>                 }
>>>             }
>>>             return(docParts);
>>>         }
>>>         finally {
>>>             if(fis != null) {
>>>                 try {
>>>                   fis.close();  
>>>                 }
>>>                 catch(IOException ioEx) {
>>>                     // I G N O R E
>>>                 }
>>>             }
>>>         }
>>>     }
>>>     
>>>     public void compareDocs(ArrayList<DocumentPart> originalDocParts,
>>>                             ArrayList<DocumentPart> compareToDocParts) {
>>>         // TO DO: Code comparsion
>>>     }
>>>     
>>>     public void saveResults(ArrayList<DocumentPart> originalDocParts,
>>>                             ArrayList<DocumentPart> compareToDocParts,
>>>                             String resultDoc)
>>>                                      throws IOException,
>>> FileNotFoundException {
>>>         // TO DO: Code saving of results.
>>>     }
>>> 
>>>     /**
>>>      * Main entry point to the program.
>>>      *
>>>      * @param args
>>>      */
>>>     public static void main(String[] args) {
>>>         try {
>>>             DocumentComparator docComp = new DocumentComparator();
>>>             docComp.compareDocuments("original document",
>>>                                      "compare to document",
>>>                                      "results document",
>>>                                      "results document template");
>>>         }
>>>         catch(FileNotFoundException fnfEx) {
>>>             // TO DO: Code exception handling.
>>>         }
>>>         catch(IOException ioEx) {
>>>             // TO DO: Code exception handling.
>>>         }
>>>     }
>>> }
>>> 
>>> package comparedocuments;
>>> 
>>> import org.apache.poi.hwpf.usermodel.Range;
>>> import org.apache.poi.hwpf.usermodel.Paragraph;
>>> import org.apache.poi.hwpf.usermodel.Table;
>>> import org.apache.poi.hwpf.usermodel.TableRow;
>>> 
>>> /**
>>>  * Encapsulates a 'part' of a Microsoft Word document. Currently, that
>>> part can
>>>  * either be a Table or a paragraph of text.
>>>  *
>>>  * @author Mark B
>>>  * @version 1.00 27th July 2009.
>>>  */
>>> public class DocumentPart {
>>> 
>>>     private Range docPart = null;
>>>     private boolean comparisonStatus = false;
>>>     private int comparisonResult = 0;
>>> 
>>>     public static final int INSERTED = 0;
>>>     public static final int DELETED = 1;
>>>     public static final int MODIFIED = 2;
>>>     public static final int UN_MODIFIED = 3;
>>>     public static final int MOVED = 4;
>>> 
>>>     /**
>>>      * Create a new instance of the DocumentPart class using the
>>> following
>>>      * paramater.
>>>      *
>>>      * @param docPart An instance of the
>>> org.apache.poi.hwpf.usermodel.Range
>>>      *                class that will encapsulate an instance of the
>>>      *                org.apache.poi.hwpf.usermodel.Paragraph or an
>>> instance
>>>      *                of the org.apache.poi.hwpf.usermodel.Table class.
>>>      */
>>>     public DocumentPart(Range docPart) {
>>>         this.docPart = docPart;
>>>         // Note that as the part has not been successfully compared to
>>> another
>>>         // part the status is false.
>>>         this.comparisonStatus = false;
>>>         // and that the type is set to un-modified. Any parts that have
>>> not been
>>>         // checked or that are not un-modified will be written away to
>>> the
>>>         // results document.
>>>         this.comparisonResult = DocumentPart.UN_MODIFIED;
>>>     }
>>> 
>>>     /**
>>>      * Has a match been foound for this document part?
>>>      *
>>>      * @return A boolean value that indicates whether a match was found
>>> between
>>>      *         two document parts.
>>>      */
>>>     public boolean isMatched() {
>>>         return(this.comparisonStatus);
>>>     }
>>> 
>>>     /**
>>>      * Get the result of the comparison.
>>>      *
>>>      * @return A primitive int value that indicates the result of
>>> comparing
>>>      *         this document part to others. The following constants
>>> have been
>>>      *         declared;
>>>      *             DocumentPart.INSERTED = 0;
>>>      *             DocumentPart.DELETED = 1;
>>>      *             DocumentPart.MODIFIED = 2;
>>>      *             DocumentPart.UN_MODIFIED = 3;
>>>      *             DocumentPart.MOVED = 4;
>>>      *
>>>      */
>>>     public int getComparisonResult() {
>>>         return(this.comparisonResult);
>>>     }
>>> 
>>>     /**
>>>      * Store the result of the domnparsion between document parts.
>>>      *
>>>      * @param comparisonResult A primitive int whose value indicates the
>>> result
>>>      *                         of comparing one document part with
>>> others.
>>>      */
>>>     public void setComparisonResult(int comparisonResult) {
>>>         this.comparisonResult = comparisonResult;
>>>     }
>>> 
>>>     /**
>>>      * Does a DocumentPart encapsulate a table?
>>>      * @return A primitive boolean value; true if the DocumentPart
>>> encapsulates
>>>      *         a Table, false otherwise.
>>>      */
>>>     public boolean isTable() {
>>>         return(this.docPart instanceof Table);
>>>     }
>>> 
>>>     /**
>>>      * If the DocumentPart encapsulates a Table, get the number of rows
>>> in the
>>>      * rable.
>>>      *
>>>      * @return A primitive int whose value indicates how many rows there
>>> are in
>>>      *         the table.
>>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>>      *         called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>>      */
>>>     public int getNumRows() throws UnsupportedOperationException {
>>>         int numRows = 0;
>>>         if(this.isTable()) {
>>>             Table table = (Table)this.docPart;
>>>             numRows = table.numRows();
>>>         }
>>>         else {
>>>             throw new UnsupportedOperationException("The DocumentPart
>>> does " +
>>>                     "not encapsulate a Table.");
>>>         }
>>>         return(numRows);
>>>     }
>>> 
>>>     /**
>>>      * How many columns are there in the Table. This method assumes that
>>> the
>>>      * table is 'square', i.e. that each row of the Table holds the same
>>> number
>>>      * of columns.
>>>      *
>>>      * @return A primitive int whose value indicates how many columns
>>> there are
>>>      *         in the Table.
>>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>>      *         called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>>      */
>>>     public int getNumColumns() throws UnsupportedOperationException {
>>>         return(this.getNumColumns(0));
>>>     }
>>> 
>>>     /**
>>>      * How many columns are there in a specific row of the Table.
>>>      *
>>>      * @return A primitive int whose value indicates how many columns
>>> there are
>>>      *         in the Table row.
>>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>>      *         called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>>      */
>>>     public int getNumColumns(int rowNum) throws
>>> UnsupportedOperationException {
>>>         int numColumns = 0;
>>>         if(this.isTable()) {
>>>             Table table = (Table)this.docPart;
>>>             TableRow row = table.getRow(rowNum);
>>>             numColumns = row.numCells();
>>>         }
>>>         else {
>>>             throw new UnsupportedOperationException("The DocumentPart
>>> does " +
>>>                     "not encapsulate a Table.");
>>>         }
>>>         return(numColumns);
>>>     }
>>> 
>>>     /**
>>>      * Return the contents of a specific cell.
>>>      *
>>>      * @param rowNum A primitive int that indicates the row the cell is
>>> on.
>>>      *               Remember that row indices are zero based.
>>>      * @param colNum A primitive int that indicates the column the cell
>>> is in.
>>>      *               Remember that column indices are zero based.
>>>      * @return An instance of the String class that encapsulates the
>>> cells
>>>      *         contents
>>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>>      *         called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>>      */
>>>     public String getCellContents(int rowNum, int colNum)
>>>                                           throws
>>> UnsupportedOperationException {
>>>         return(null);
>>>     }
>>> 
>>>     /**
>>>      * Return the text of the Paragraph.
>>>      *
>>>      * @return An instance of the String class that encapsulates the
>>> text
>>>      *         the Paragraph contained. Note that this will be stripped
>>> of
>>>      *         all fields.
>>>      * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>>      *         called for a DocumentPart instance that encapsulates a
>>> Table.
>>>      */
>>>     public String getParagraphText() throws
>>> UnsupportedOperationException {
>>>         String returnValue = null;
>>>         if(!this.isTable()) {
>>>             Paragraph para = (Paragraph)this.docPart;
>>>             returnValue = Range.stripFields(para.text());
>>>         }
>>>         else {
>>>             throw new IllegalStateException("The DocumentPart does not "
>>> +
>>>                     "encapsulate a Paragraph.");
>>>         }
>>>         return(returnValue);
>>>     }
>>> }
>>> 
>>> 
>>> 
>>> bihag wrote:
>>>> 
>>>> Hi All,
>>>> 
>>>> We want to compare two document and what ever things are not common
>>>> that we have to highlight with some color or any other way ... So I
>>>> thing we have to merge document or create new document which has
>>>> content of both the document, and show difference with some color, like
>>>> deleted with red, newly added with blue ... 
>>>> 
>>>> Mainly we are looking for OLE2CDF doc compare solution ...
>>>> 
>>>> please provide some code sniplet if possible ...
>>>> 
>>>> Thanking you in advance ...
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24695694.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to