I am working on Bug 61787 where documents containing rsidDel=000000 are not
extracting the correct text. The issue is that rsidxxx attributes are just
there to indicate which revision a particular change belongs to, but does
not necessarily indicate that a particular revision actually occurred. Bug
58067 corrected an issue where deleted text was being returned by the
XWPFParagraph.getText() method. Unfortunately the patch for 58067 was
keying on the rsidDel attribute rather than the delText tag which
specifically means that this is the deleted text.

So I corrected this in XWPFParagraph and XWPFRun. Now a test on document
Tika-792 is failing because it is expecting getText to return deleted text.
So what do you want to do?

The options as I see them are

   1. To allow getText() to return all text, even deleted text, than add
   another method to only return undeleted text.
   2. Change getText() to return undeleted text, and add another method to
   retrieve all text.

I prefer the second option, and I suspect that the Tika test is not
particularly valid as it's comment is that it's purpose is to include
CTBookmark classes in ooxmlLite. Tim, do you have a preference here as my
change will likely affect you the most.

Reply via email to