[ https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960546#comment-16960546 ]
ASF GitHub Bot commented on TIKA-2900: -------------------------------------- dameikle commented on pull request #294: TIKA-2900: Add ability to exclude comments in Word extraction URL: https://github.com/apache/tika/pull/294 Adds an option in OfficeParserConfig to allow comments to be explicitly included or excluded in Word extractions. The default remains as included for backwards compatibility. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Removing comments from *.docx, *.pdf files > ------------------------------------------ > > Key: TIKA-2900 > URL: https://issues.apache.org/jira/browse/TIKA-2900 > Project: Tika > Issue Type: Wish > Components: app, example > Affects Versions: 1.21 > Reporter: Md > Assignee: Dave Meikle > Priority: Major > Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, > Document_with_Comments_Text_extarction_Tika_APP.docx.txt > > > Hello, > I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. > Sometimes there are comments in the file and tika is extracting them and > adding them at the end of the file. I am wondering to know is there a way to > exclude comments when it will be extracting text. > Here is the following code I am using > {code:java} > StringBuilder fileContent = new StringBuilder(); > Parser parser = new AutoDetectParser(); > ContentHandlerFactory factory = new > BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, > -1); > //InputStream inputStream = new BufferedInputStream(new > FileInputStream(inputFileName)); > RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, > factory); > Metadata metadata = new Metadata(); > ParseContext parseContext = new ParseContext(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setUseSAXDocxExtractor(true); > officeParserConfig.setIncludeDeletedContent(false); > officeParserConfig.setIncludeMoveFromContent(false); > officeParserConfig.setIncludeHeadersAndFooters(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > wrapper.parse(inputStream, new DefaultHandler(), metadata, > parseContext); > String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)