Md created TIKA-2900: ------------------------ Summary: Removing comments from *.docx, *.pdf files Key: TIKA-2900 URL: https://issues.apache.org/jira/browse/TIKA-2900 Project: Tika Issue Type: Wish Components: app, example Affects Versions: 1.21 Reporter: Md
Hello, I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there are comments in the file and tika is extracting them and adding them at the end of the file. I am wondering to know is there a way to exclude comments when it will be extracting text. Here is the following code I am using ``` StringBuilder fileContent = new StringBuilder(); Parserparser=newAutoDetectParser(); ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1); //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory); Metadatametadata=newMetadata(); ParseContextparseContext=newParseContext(); OfficeParserConfigofficeParserConfig=newOfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); officeParserConfig.setIncludeMoveFromContent(false); officeParserConfig.setIncludeHeadersAndFooters(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext); Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT); ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)