[ https://issues.apache.org/jira/browse/TIKA-195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-195. ----------------------------- Resolution: Later I believe that all text from Word files is now extracted, and has been for at least a little while now If there's a case where document text isn't being extracted, please re-open the bug and attach a problematic file > MSWORD: Tika ignores text from Pieces > ------------------------------------- > > Key: TIKA-195 > URL: https://issues.apache.org/jira/browse/TIKA-195 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.2 > Reporter: Andrzej Rusin > Priority: Minor > > If a Word document contains text which is not in paragraphs, but rather in > some frames, the text is ignored. > The following code extracts ALL text, however I am not sure how it fits the > Paragraps model used ty Tika: > HWPFDocument doc = new HWPFDocument(filesystem); > List textPieces = doc.getTextTable().getTextPieces(); > for (Object o : textPieces) { > TextPiece piece = (TextPiece) o; > xhtml.element("p", piece.getStringBuffer().toString()); > } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira