Hi Mark,
sorry but I didn't started working on it ... and want be able start it for
next 2 days as we have our product demo tomorrow ...
anyway thanks a lot for your time and efforts.
Regards,
Bihag Raval.
MSB wrote:
>
> Had the chance to re-think the work last night and would like to propose a
> few changes.
>
> Firstly, the DocumentPart class should be refactored IMO; it is trying to
> do two things at once and this is never a good idea. I propose removing
> the following;
>
> public static final int INSERTED = 0;
> public static final int DELETED = 1;
> public static final int MODIFIED = 2;
> public static final int UN_MODIFIED = 3;
> public static final int MOVED = 4;
>
> and
>
> /**
> * Get the result of the comparison.
> *
> * @return A primitive int value that indicates the result of
> comparing
> * this document part to others. The following constants have
> been
> * declared;
> * DocumentPart.INSERTED = 0;
> * DocumentPart.DELETED = 1;
> * DocumentPart.MODIFIED = 2;
> * DocumentPart.UN_MODIFIED = 3;
> * DocumentPart.MOVED = 4;
> *
> */
> public int getComparisonResult() {
> return(this.comparisonResult);
> }
>
> /**
> * Store the result of the domnparsion between document parts.
> *
> * @param comparisonResult A primitive int whose value indicates the
> result
> * of comparing one document part with others.
> */
> public void setComparisonResult(int comparisonResult) {
> this.comparisonResult = comparisonResult;
> }
>
>
> Next, adding a new class called something like ComparisonResult or
> ReportableComparison. It will encapsulate the information removed from the
> DocumentPart calss along with a Range object. It's purpose is to track a
> reportable comparison, an insertion, deletion, modification or
> transformation. I propose that as we detect one of these when comparing
> the two documents an instance of the new clas is created, the affected
> Range is copied over and the appropriate comparsion result setting made.
> Following the comparison between the two documents, an ArrayList of
> comparison result objects can be used to create the report.
>
> Thirdly, there is something you may wish to discuss with your colleagues;
> it is a fall back position for us in case HWPF cannot successfully create
> the comparison report. The Rich Text File Format is an open, relatively
> simple file format developed by Microsoft. It is possible to create an rtf
> file with a .doc extension such that when the user double clicks on the
> file, Word is used to open it. A rather good api exists - called iText -
> that makes it possible to create rtf files using Java code, so it should
> be possible for this application to output it's results as an rtf file,
> with a .doc extension. The users may never know any different.
>
> Anyway, if I have the time today, I will make the changes suggested above
> and write the easy comparisions. I think the easy comparisions are
> paragraphs that have been added to the 'new' document, those that have
> been deleted from the original document and those that have been moved.
> The basic approach I am going to take;
>
> 1. Get a paragraph from the original document.
> 2. Try to match it with the paragraph at the same position in the new
> document. If a match is found here then no further action is necessary.
> 3. If a match cannot be found at the same location in the new document,
> start from the first paragraph and cehck every currently un-matched
> paragraph. If a match is found then mark this as a paragraph that has been
> moved. If a match cannot be found then mark it as a pargarph that has been
> deleted.
> 4. Once all of the paragraphs in the original document have been checked,
> any un-matched paragraphs that remain in the new document can be marked as
> new paragraphs or insertions.
>
> When I say 'mark' in the above, I mean to create an instance of the new
> ComparsionResult class, copy over the reference to the Range and set the
> result of the comparsion.
>
> For now, I will not spend time looking at tables and will try to ignore
> the possibile complexities of deciding if a paragraph has been modified.
> It may prove a step too far but I would also like to try adding the result
> producing step just to see if HWPF can produce a suitable report for us.
>
> As before, will post again if I make any progress. Do not feel you cannot
> say 'stop' or 'wait I want to think about something' or that you cannot
> suggest changes, modifications or a completely different approach yourself
> if the current solution is veering away from your original requirement.
> Ideally, we should produce the solution together and I ought not to
> 'force' you into something; I am only too well aware of how easy it is to
> be swept along as a project gathers momentum.
>
> Yours
>
> Mark B
>
>
> bihag wrote:
>>
>> Hi Mark,
>>
>> I would sincerely like to convey my Thanks to you.
>> The tips you have given is really helpful.
>> appreciated your time and efforts.
>>
>> Regards,
>> Bihag
>>
>>
>>
>> MSB wrote:
>>>
>>> Have not had the time to do much work or ANY testing so please treat
>>> this with caution.
>>>
>>> What I am proposing is that the contents of a Word document be converted
>>> into an ArrayList. That ArrayList will contain instances of the
>>> DocumentPart class and these will facilitate the comparison operation. I
>>> have not given those a great deal of thought yet but believe that we
>>> should check for any paragraphs - not tables yet - being inserted,
>>> deleted, modified (not sure how to proceed with this one yet) or moved.
>>> As you can see, I have provided constants in the DocumentPart class to
>>> support these different results. The comparison status flag is there to
>>> prevent a paragraph being checked again once a match has been found but
>>> I am thinking of another use if the logic holds.
>>>
>>> As yet, I have not coded the compare methods or the save results method
>>> as I think it is wise to throughly test the loading method firstly. We
>>> need to be certain that the ArrayList of DocumentPart(s) accurately
>>> describes the documents. I think that you are in a 'better' time-zone
>>> and that you may have the opportunity to test the code before me. If you
>>> look at the main method of the DocumentComparator class, you will see
>>> how to run the code. All you need to do for now is make sure that the
>>> first two parameters to the compareDocuments() method point to Word
>>> files and then run the code. To check the results, you can either modify
>>> DocumentPart to add a toString() method that outputs the instances
>>> contents or simply call the getParagraphText() and getCellContents()
>>> methods from the compareDocument() method.
>>>
>>> Anyway, here is the code so far. Have a look and see if it is the way
>>> you want to go - or think makes sense. Do not feel that you cannot
>>> criticise or alter the code or the approach as, for now we are not
>>> committed to any particular strategy, just exploring what is possible.
>>>
>>> package comparedocuments;
>>>
>>> import java.io.File;
>>> import java.io.FileInputStream;
>>> import java.util.ArrayList;
>>> import java.io.FileNotFoundException;
>>> import java.io.IOException;
>>>
>>> import org.apache.poi.hwpf.HWPFDocument;
>>> import org.apache.poi.hwpf.usermodel.Range;
>>> import org.apache.poi.hwpf.usermodel.Paragraph;
>>>
>>> /**
>>> * An instance of this calss can be used to perform a comparison between
>>> two
>>> * binary (OLE2CDF) Microsoft Word documents.
>>> *
>>> * @author Mark B
>>> * @version 1.00 27th July 2009
>>> */
>>> public class DocumentComparator {
>>>
>>> /**
>>> * Called to compare the two documents and output the results of the
>>> * comparison to a third Microsoft Word document.
>>> *
>>> * @param originalDoc The path to and name of the original document,
>>> the
>>> * document that is the basis for the comparison.
>>> * @param compareToDoc The path to and name of the document that
>>> should
>>> * be compared with the original for any
>>> modifications.
>>> * @param resultDoc The path to and name of the document that should
>>> contain
>>> * the results of the comparison process.
>>> * @param docTemplate The path to and name of the empty Word
>>> document that
>>> * should be used as the basis for the rusults
>>> document.
>>> * @throws java.io.IOException Thrown to signal that some sort of
>>> I/O
>>> * Exception has occurred.
>>> * @throws java.io.FileNotFoundException Thrown to signal that a
>>> file
>>> * could not be located.
>>> */
>>> public void compareDocuments(String originalDoc, String
>>> compareToDoc,
>>> String resultDoc, String docTemplate)
>>> throws IOException,
>>> FileNotFoundException {
>>> ArrayList<DocumentPart> originalDocParts =
>>> this.loadDocument(originalDoc);
>>> ArrayList<DocumentPart> compareToDocParts =
>>> this.loadDocument(compareToDoc);
>>> this.compareDocs(originalDocParts, compareToDocParts);
>>> this.saveResults(originalDocParts, compareToDocParts,
>>> resultDoc);
>>> }
>>>
>>> /**
>>> * Opens a named binary (OLE2CDF) Microsoft Word document and
>>> converts that
>>> * documents contents into an ArrayList of instances of the
>>> DocumentPart
>>> * class.
>>> * @param docName The path to and name of a Microsoft Word document
>>> file.
>>> * @return An instance of the ArrayList class encapsulating
>>> instances
>>> * of the DocumentPart class. Each DocumentPart will
>>> encapsulate
>>> * information about a paragraph of text or a table
>>> recovered from
>>> * the Microsoft Word document.
>>> * @throws java.io.IOException If an I/O Exception occurs
>>> * @throws java.io.FileNotFoundException Thrown to indicate that the
>>> * named Microsoft Word file
>>> could
>>> * not be located.
>>> */
>>> public ArrayList<DocumentPart> loadDocument(String docName)
>>> throws IOException,
>>> FileNotFoundException {
>>> File file = null;
>>> FileInputStream fis = null;
>>> HWPFDocument document = null;
>>> Range overallRange = null;
>>> Paragraph para = null;
>>> int numParas = 0;
>>> boolean inTable = false;
>>> ArrayList<DocumentPart> docParts = null;
>>> try {
>>> // Open the Word file.
>>> file = new File(docName);
>>> fis = new FileInputStream(file);
>>> document = new HWPFDocument(fis);
>>> // Get the overall Range for the document and the number
>>> // of paragraphs from this Range.
>>> overallRange = document.getOverallRange();
>>> numParas = overallRange.numParagraphs();
>>> for(int i = 0; i < numParas; i++) {
>>> para = overallRange.getParagraph(i);
>>> // Is the paragraph 'in' a table? If so, it is possible
>>> to
>>> // recover a reference to that Table from the first
>>> paragraph
>>> // only. If calls are made to the getTable() method
>>> using
>>> // subsequent paragraphs then an exception will be
>>> thrown. So,
>>> // after getting the Table, a flag is set to prevent
>>> further
>>> // calls to the getTable() method.
>>> if(para.isInTable()) {
>>> if(!inTable) {
>>> // Get a reference to the Table and pass it to
>>> the
>>> // constructor of the DocumentPart class. Add
>>> the
>>> // DocumentPart instance to the ArrayLlist.
>>> docParts.add(new DocumentPart(
>>> overallRange.getTable(para)));
>>> inTable = true;
>>> }
>>> }
>>> // The paragraph is not in a table so simply add a new
>>> instance
>>> // to the ArrayList that encapsulates the paragraph of
>>> text.
>>> else {
>>> docParts.add(new DocumentPart(para));
>>> inTable = false;
>>> }
>>> }
>>> return(docParts);
>>> }
>>> finally {
>>> if(fis != null) {
>>> try {
>>> fis.close();
>>> }
>>> catch(IOException ioEx) {
>>> // I G N O R E
>>> }
>>> }
>>> }
>>> }
>>>
>>> public void compareDocs(ArrayList<DocumentPart> originalDocParts,
>>> ArrayList<DocumentPart> compareToDocParts) {
>>> // TO DO: Code comparsion
>>> }
>>>
>>> public void saveResults(ArrayList<DocumentPart> originalDocParts,
>>> ArrayList<DocumentPart> compareToDocParts,
>>> String resultDoc)
>>> throws IOException,
>>> FileNotFoundException {
>>> // TO DO: Code saving of results.
>>> }
>>>
>>> /**
>>> * Main entry point to the program.
>>> *
>>> * @param args
>>> */
>>> public static void main(String[] args) {
>>> try {
>>> DocumentComparator docComp = new DocumentComparator();
>>> docComp.compareDocuments("original document",
>>> "compare to document",
>>> "results document",
>>> "results document template");
>>> }
>>> catch(FileNotFoundException fnfEx) {
>>> // TO DO: Code exception handling.
>>> }
>>> catch(IOException ioEx) {
>>> // TO DO: Code exception handling.
>>> }
>>> }
>>> }
>>>
>>> package comparedocuments;
>>>
>>> import org.apache.poi.hwpf.usermodel.Range;
>>> import org.apache.poi.hwpf.usermodel.Paragraph;
>>> import org.apache.poi.hwpf.usermodel.Table;
>>> import org.apache.poi.hwpf.usermodel.TableRow;
>>>
>>> /**
>>> * Encapsulates a 'part' of a Microsoft Word document. Currently, that
>>> part can
>>> * either be a Table or a paragraph of text.
>>> *
>>> * @author Mark B
>>> * @version 1.00 27th July 2009.
>>> */
>>> public class DocumentPart {
>>>
>>> private Range docPart = null;
>>> private boolean comparisonStatus = false;
>>> private int comparisonResult = 0;
>>>
>>> public static final int INSERTED = 0;
>>> public static final int DELETED = 1;
>>> public static final int MODIFIED = 2;
>>> public static final int UN_MODIFIED = 3;
>>> public static final int MOVED = 4;
>>>
>>> /**
>>> * Create a new instance of the DocumentPart class using the
>>> following
>>> * paramater.
>>> *
>>> * @param docPart An instance of the
>>> org.apache.poi.hwpf.usermodel.Range
>>> * class that will encapsulate an instance of the
>>> * org.apache.poi.hwpf.usermodel.Paragraph or an
>>> instance
>>> * of the org.apache.poi.hwpf.usermodel.Table class.
>>> */
>>> public DocumentPart(Range docPart) {
>>> this.docPart = docPart;
>>> // Note that as the part has not been successfully compared to
>>> another
>>> // part the status is false.
>>> this.comparisonStatus = false;
>>> // and that the type is set to un-modified. Any parts that have
>>> not been
>>> // checked or that are not un-modified will be written away to
>>> the
>>> // results document.
>>> this.comparisonResult = DocumentPart.UN_MODIFIED;
>>> }
>>>
>>> /**
>>> * Has a match been foound for this document part?
>>> *
>>> * @return A boolean value that indicates whether a match was found
>>> between
>>> * two document parts.
>>> */
>>> public boolean isMatched() {
>>> return(this.comparisonStatus);
>>> }
>>>
>>> /**
>>> * Get the result of the comparison.
>>> *
>>> * @return A primitive int value that indicates the result of
>>> comparing
>>> * this document part to others. The following constants
>>> have been
>>> * declared;
>>> * DocumentPart.INSERTED = 0;
>>> * DocumentPart.DELETED = 1;
>>> * DocumentPart.MODIFIED = 2;
>>> * DocumentPart.UN_MODIFIED = 3;
>>> * DocumentPart.MOVED = 4;
>>> *
>>> */
>>> public int getComparisonResult() {
>>> return(this.comparisonResult);
>>> }
>>>
>>> /**
>>> * Store the result of the domnparsion between document parts.
>>> *
>>> * @param comparisonResult A primitive int whose value indicates the
>>> result
>>> * of comparing one document part with
>>> others.
>>> */
>>> public void setComparisonResult(int comparisonResult) {
>>> this.comparisonResult = comparisonResult;
>>> }
>>>
>>> /**
>>> * Does a DocumentPart encapsulate a table?
>>> * @return A primitive boolean value; true if the DocumentPart
>>> encapsulates
>>> * a Table, false otherwise.
>>> */
>>> public boolean isTable() {
>>> return(this.docPart instanceof Table);
>>> }
>>>
>>> /**
>>> * If the DocumentPart encapsulates a Table, get the number of rows
>>> in the
>>> * rable.
>>> *
>>> * @return A primitive int whose value indicates how many rows there
>>> are in
>>> * the table.
>>> * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>> * called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>> */
>>> public int getNumRows() throws UnsupportedOperationException {
>>> int numRows = 0;
>>> if(this.isTable()) {
>>> Table table = (Table)this.docPart;
>>> numRows = table.numRows();
>>> }
>>> else {
>>> throw new UnsupportedOperationException("The DocumentPart
>>> does " +
>>> "not encapsulate a Table.");
>>> }
>>> return(numRows);
>>> }
>>>
>>> /**
>>> * How many columns are there in the Table. This method assumes that
>>> the
>>> * table is 'square', i.e. that each row of the Table holds the same
>>> number
>>> * of columns.
>>> *
>>> * @return A primitive int whose value indicates how many columns
>>> there are
>>> * in the Table.
>>> * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>> * called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>> */
>>> public int getNumColumns() throws UnsupportedOperationException {
>>> return(this.getNumColumns(0));
>>> }
>>>
>>> /**
>>> * How many columns are there in a specific row of the Table.
>>> *
>>> * @return A primitive int whose value indicates how many columns
>>> there are
>>> * in the Table row.
>>> * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>> * called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>> */
>>> public int getNumColumns(int rowNum) throws
>>> UnsupportedOperationException {
>>> int numColumns = 0;
>>> if(this.isTable()) {
>>> Table table = (Table)this.docPart;
>>> TableRow row = table.getRow(rowNum);
>>> numColumns = row.numCells();
>>> }
>>> else {
>>> throw new UnsupportedOperationException("The DocumentPart
>>> does " +
>>> "not encapsulate a Table.");
>>> }
>>> return(numColumns);
>>> }
>>>
>>> /**
>>> * Return the contents of a specific cell.
>>> *
>>> * @param rowNum A primitive int that indicates the row the cell is
>>> on.
>>> * Remember that row indices are zero based.
>>> * @param colNum A primitive int that indicates the column the cell
>>> is in.
>>> * Remember that column indices are zero based.
>>> * @return An instance of the String class that encapsulates the
>>> cells
>>> * contents
>>> * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>> * called for a DocumentPart instance that encapsulates a
>>> Paragraph.
>>> */
>>> public String getCellContents(int rowNum, int colNum)
>>> throws
>>> UnsupportedOperationException {
>>> return(null);
>>> }
>>>
>>> /**
>>> * Return the text of the Paragraph.
>>> *
>>> * @return An instance of the String class that encapsulates the
>>> text
>>> * the Paragraph contained. Note that this will be stripped
>>> of
>>> * all fields.
>>> * @throws java.lang.UnsupportedOperationException Thrown if this
>>> method is
>>> * called for a DocumentPart instance that encapsulates a
>>> Table.
>>> */
>>> public String getParagraphText() throws
>>> UnsupportedOperationException {
>>> String returnValue = null;
>>> if(!this.isTable()) {
>>> Paragraph para = (Paragraph)this.docPart;
>>> returnValue = Range.stripFields(para.text());
>>> }
>>> else {
>>> throw new IllegalStateException("The DocumentPart does not "
>>> +
>>> "encapsulate a Paragraph.");
>>> }
>>> return(returnValue);
>>> }
>>> }
>>>
>>>
>>>
>>> bihag wrote:
>>>>
>>>> Hi All,
>>>>
>>>> We want to compare two document and what ever things are not common
>>>> that we have to highlight with some color or any other way ... So I
>>>> thing we have to merge document or create new document which has
>>>> content of both the document, and show difference with some color, like
>>>> deleted with red, newly added with blue ...
>>>>
>>>> Mainly we are looking for OLE2CDF doc compare solution ...
>>>>
>>>> please provide some code sniplet if possible ...
>>>>
>>>> Thanking you in advance ...
>>>>
>>>
>>>
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24695694.html
Sent from the POI - Dev mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]