[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382167#comment-16382167 ]
Md commented on TIKA-2593: -------------------------- No deleted content is not showing if we do officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); I wanted to exclude shape based content by officeParserConfig.setIncludeShapeBasedContent(false); but not working Also Is there anyway I can exclude comments in docx? Thanks so much for your help > docx with track change producing incorrect output > ------------------------------------------------- > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler > Affects Versions: 1.17 > Reporter: Md > Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); > System.out.println(contentHandler.toString()); > {code} > When I am sending track revised files it's adding all the text deleted with > the actual text and inserted text. Is there a way to tell parser to exclude > the deleted text? > Here is an example > input Text: This is a sample text. -This part will- be deleted. +This is > inserted.+ > outputText: This is a sample text. This part will be deleted. This is > inserted. > Desired output: This is a sample text. be deleted. This is inserted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)