[ https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dave Meikle reassigned TIKA-2900: --------------------------------- Assignee: Dave Meikle > Removing comments from *.docx, *.pdf files > ------------------------------------------ > > Key: TIKA-2900 > URL: https://issues.apache.org/jira/browse/TIKA-2900 > Project: Tika > Issue Type: Wish > Components: app, example > Affects Versions: 1.21 > Reporter: Md > Assignee: Dave Meikle > Priority: Major > Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, > Document_with_Comments_Text_extarction_Tika_APP.docx.txt > > > Hello, > I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. > Sometimes there are comments in the file and tika is extracting them and > adding them at the end of the file. I am wondering to know is there a way to > exclude comments when it will be extracting text. > Here is the following code I am using > {code:java} > StringBuilder fileContent = new StringBuilder(); > Parser parser = new AutoDetectParser(); > ContentHandlerFactory factory = new > BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, > -1); > //InputStream inputStream = new BufferedInputStream(new > FileInputStream(inputFileName)); > RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, > factory); > Metadata metadata = new Metadata(); > ParseContext parseContext = new ParseContext(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setUseSAXDocxExtractor(true); > officeParserConfig.setIncludeDeletedContent(false); > officeParserConfig.setIncludeMoveFromContent(false); > officeParserConfig.setIncludeHeadersAndFooters(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > wrapper.parse(inputStream, new DefaultHandler(), metadata, > parseContext); > String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)