https://bz.apache.org/bugzilla/show_bug.cgi?id=64418

j-lawyer.org <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #6 from j-lawyer.org <[email protected]> ---
Well, I would love to get rid of the expensive XML handling - however, I do not
see how I could avoid it given POIs API. 

Is there an alternative approach for "getting all text content of text fields /
text boxes"?

Even Apache Tika seems to use the exact same approach in their
XWPFWordExtractorDecorator.java:

  331         // Also extract any paragraphs embedded in text boxes
  332         //Note "w:txbxContent//"...must look for all descendant
paragraphs
  333         //not just the immediate children of txbxContent -- TIKA-2807
  334         if (config.getIncludeShapeBasedContent()) {
  335             for (XmlObject embeddedParagraph :
paragraph.getCTP().selectPath("declare namespace
w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare
namespace
wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape'
.//*/wps:txbx/w:txbxContent//w:p")) {
  336                 extractParagraph(new
XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()),
paragraph.getBody()), listManager, xhtml);
  337             }
  338         }


Am I missing something?

Thanks,
Jens

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to