Yes, actually determining what has changed and how is a challenge but I think
that the first step is to 'convert' the two documents into arrays - most
likely ArrayList(s) as they can grow dynamically and we do not need to worry
about setting their capacities. Then, it will be possible to step through
each ArrayList, grab elements and compare them. The best type for the
ArrayList(s) is the org.apache.poi.hwpf.usermodel.Range class in my opinion
because both Paragraph and Table are subclasses of it. If you can write code
that opens a Word document, gets the overall Range object, recovers the
number of Paragraphs from that and then recovers each Paragraph from the
document, then they can be stored into the ArrayList(s). Once both
ArrayList(s) are full, it will be possible to step through each and compare
the elements.

One of the bits of code you need to find - and I will search for it to as I
am certain it is somewhere on my PC - is able to recover the tables from a
document in-line. By this, I mean that it is able to recover them in their
correct location with regards to the other paragraphs. That way, you will be
able to test whether a table has moved, whether a paragrph has been insetred
before the table, and so on. The code, if I remember correctly, simply steps
through all of the paragraphs recovered from the document and calls the
isInTable() method. If this method returns 'true', then it is possible get
the table the paragraph is contained by. Once you have the table then it can
be stored to the array as well.

I should have the time to put something together this evening; that is some
code that creates these arrays from Word documents. I will post to the list
later if I have any success. Also, I wil try to put together some code that
creates the 'output' document. It should be quite straightforward to do once
we have decided what needs to be recorded; by this I mean insertions,
deletions, modifications, translations (moving the same element to a new
location within the document, changes to the number of rows in a table,
changes to the number of columns in a table, new values added to new cells
and existing values changed in existing cells.

Good luck.

Yours

Mark B


bihag wrote:
> 
> Hi,
> 
> Thanks Mark, I think first We have to create some algorithm ... I already
> tried some code will compare two paragraph but the problem is like u
> written ... paragraph one can be come as paragraph 3rd in document that
> time it is creating some problem ...
> 
> I will go through the API ... let's see if found any clue ...
> 
> Anyway thanks and take care ...
> 
> Regards,
> Bihag Raval.
> 
> 
> MSB wrote:
>> 
>> That should be do-able, with one caveat, images would be the only thing I
>> am unsure about at this point; the complicating factor will be the depth
>> of the comparison. Your first task IMO would be to decide exactly how the
>> comparison should proceed; to sketch out an algorithm that will determine
>> 'differentness'. Imagine that we have the first paragraph from document
>> one, what should we compare it to in document two? Should we only compare
>> corresponding paragraphs, i.e. only compare paragraph one in document one
>> with paragrph two in document two? What happens if a new paragraph was
>> inserted into document two so that now paragraph one in document one
>> matches paragraph two in document two?
>> 
>> If you have a good search around the list, there is code that
>> demonstrates how to extract the text from a document along with the
>> tables. I am guessing that you will need to get at the tables 'in line'
>> so to speak as a change in the position of the table within the document
>> will be a change as far as your algorithm is concerned. At this time, I
>> cannot offer to help any further as I am about to leave for the 'office'
>> - a damp, rainy nature reserve in actuality. If I have the time tonight,
>> I will try to put something together but would suggest that you search
>> through the posts to the list to track down some code that will allow you
>> to get at the documents contents as a starting point; I am confident that
>> there is code there that demonstrates how to get at the tables contents
>> in-line. As always though, I cannot promise anything - I am grappling
>> with other Word 'issues' that are absorbing quite a bit of time - but
>> will help out where I can. Finally, XSSF is still a bit of a mystery to
>> me, I have not done any 'real' work with the API.
>> 
>> Yours
>> 
>> Mark B
>> 
>> 
>> bihag wrote:
>>> 
>>> Hi Mark,
>>> 
>>> Thanks for replay ... 
>>> 
>>> What I want is compare two same versions of the document and note any
>>> changes that have been made.
>>> If I can get image, table changes that's really great ... but if I can
>>> only get text changes thats more than enough for current requirement ... 
>>> 
>>> What I will do is, I will pass 2 documents to function that function
>>> should create new document with both file content and changes like ms
>>> word is doing with compare option in it's menu.
>>> 
>>> ex. 
>>> 
>>> File A.doc contains:- The brown fox jumps from lazy dog.
>>> File B.doc contains:- The fox jumps from lazy donkey.
>>> 
>>> File generated after compare A.doc and B.doc contains :- this image file
>>> 
>>>  http://www.nabble.com/file/p24674962/compare.jpg 
>>> 
>>> 
>>> MSB wrote:
>>>> 
>>>> This could very well be possible; I have certainly had some success
>>>> creating new Word documents using the API. Merging one document into
>>>> another is more tricky and not something I would try to do myself with
>>>> the API just yet. The first thing is to be clear on is exactly how you
>>>> wish to compare the documents. Are you saying that you want to compare
>>>> two versions of the same document and note any changes that have been
>>>> made? Are you looking just at the text and not at any formatting
>>>> applied to the text?
>>>> 
>>>> If so, then you could use the WordExtractor class to get at the text of
>>>> the two documents. This class can return an array of String(s) where
>>>> each element maps to a paragraph (I think) in the source document.
>>>> Next, you could compare the elements within the arrays to determine if
>>>> a paragraph had been deleted, added, moved, modified, etc. If you found
>>>> a difference and identified what it was, then that paragraph could be
>>>> written away a new 'results' document. To be completely honest, I have
>>>> never tried to do much work with the formatting of the text and I
>>>> cannot claim sole authorship of this code because I got a start from an
>>>> example I found on the 'net. Anyway, here is some very simple code to
>>>> create a Word document;
>>>> 
>>>> POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("..empty
>>>> file.."));
>>>> HWPFDocument doc = new HWPFDocument(fs);
>>>> // centered paragraph with large font size
>>>> Range range = doc.getRange();
>>>> Paragraph par1 = range.insertAfter(new ParagraphProperties(), 0);
>>>> par1.setSpacingAfter(200);
>>>> // justification: 0=left, 1=center, 2=right, 3=left and right
>>>> par1.setJustification((byte) 1);
>>>> 
>>>> 
>>>> CharacterRun run1 = par1.insertAfter("one");
>>>> run1.setFontSize(2 * 18);
>>>> 
>>>> // paragraph with bold typeface
>>>> Paragraph par2 = run1.insertAfter(new ParagraphProperties(), 0);
>>>> par2.setSpacingAfter(200);
>>>> CharacterRun run2 = par2.insertAfter("two two two two two two two two
>>>> two two two two two");
>>>> run2.setBold(true);
>>>> 
>>>> // paragraph with italic typeface and a line indent in the first line
>>>> Paragraph par3 = run2.insertAfter(new ParagraphProperties(), 0);
>>>> par3.setFirstLineIndent(200);
>>>> par3.setSpacingAfter(200);
>>>> CharacterRun run3 = par3.insertAfter("three three three three three
>>>> three three three three "
>>>>     + "three three three three three three three three three three
>>>> three three three three "
>>>>     + "three three three three three three three three three three
>>>> three three three three");
>>>> run3.setItalic(true);
>>>> 
>>>> // add a custom document property (needs POI 3.5; POI 3.2 doesn't save
>>>> custom properties)
>>>> DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>>>> CustomProperties cp = dsi.getCustomProperties();
>>>> if (cp == null) {
>>>>     cp = new CustomProperties();
>>>> }
>>>> cp.put("myProperty", "prop prop prop");
>>>> dsi.setCustomProperties(cp);
>>>> 
>>>> doc.write(new FileOutputStream("..final file.."));
>>>> 
>>>> The key wrinkle is that HWPF cannot actually create a new, empty Word
>>>> document; you will need to use Word itself to create a new file that
>>>> can be used as the input to this process - I have called it the empty
>>>> file in the code above. All you need to do is open Word, select
>>>> New->Document and then save this away. Use this empty file as the input
>>>> to the process and you should be away.
>>>> 
>>>> There is a setColor() method defined on the CharacterRun class but I
>>>> have never used it myself. The only advice I can offer is to play with
>>>> it and see what the effect is on a simple bit of code such as this one.
>>>> You will have access to the usual effects such as strikethrough again
>>>> using the CharacterRun class.
>>>> 
>>>> Yours
>>>> 
>>>> Mark B
>>>> 
>>>> 
>>>> bihag wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>> We want to compare two document and what ever things are not common
>>>>> that we have to highlight with some color or any other way ... So I
>>>>> thing we have to merge document or create new document which has
>>>>> content of both the document, and show difference with some color,
>>>>> like deleted with red, newly added with blue ... 
>>>>> 
>>>>> Mainly we are looking for OLE2CDF doc compare solution ...
>>>>> 
>>>>> please provide some code sniplet if possible ...
>>>>> 
>>>>> Thanking you in advance ...
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-compare-2-word-doc-%28OLE2CDF-or-OpenXML%29.-tp24673506p24682849.html
Sent from the POI - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to