[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382220#comment-16382220 ]
Md commented on TIKA-2593: -------------------------- I would like to do few things * exclude comments * possibly exclude header and footer (Working fine) * Exclude deleted contents(I find a way and it's working) Please ignore my comment about RecursiveParserWrapper as I will not be using that currently here is what I am doing {code:java} ParseContext parseContext = new ParseContext(); AutoDetectParser parser = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString()); {code} I am getting following output {code:java} <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="date" content="2018-02-28T18:33:00Z" /> . . <meta name > <title></title> </head> <body><html> <meta name="date" content="2018-02-28T18:33:00Z" /> . . <title></title> <body><p> this is test. </p> <p> MORE TEXT </p> <p class="annotation_text"> Please be more specific. Testing points for what? </p> <p class="annotation_text"> Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text.</p> <p class="annotation_text"> Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text.</p> <p class="annotation_text"> Please be more specific. What was the previous item? </p> <p class="annotation_text">Version of what? </p> <p class="annotation_text">How exactly does this benefit Ontario? </p> </body></html></body></html> {code} The comments are coming as {code:java} <p class="annotation_text">{code} I can use regular expression to remove comments from here. I am interested to know is there anyway I can tell tika to exclude "annotation_text" from extraction Thanks > docx with track change producing incorrect output > ------------------------------------------------- > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler > Affects Versions: 1.17 > Reporter: Md > Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); > System.out.println(contentHandler.toString()); > {code} > When I am sending track revised files it's adding all the text deleted with > the actual text and inserted text. Is there a way to tell parser to exclude > the deleted text? > Here is an example > input Text: This is a sample text. -This part will- be deleted. +This is > inserted.+ > outputText: This is a sample text. This part will be deleted. This is > inserted. > Desired output: This is a sample text. be deleted. This is inserted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)