[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220 ] Md edited comment on TIKA-2593 at 3/1/18 4:51 PM: -- I would like to do few things * exclude comments * possibly exclude header and footer (Working fine) * Exclude deleted contents(I find a way and it's working) Please ignore my comment about RecursiveParserWrapper as I will not be using that currently here is what I am doing {code:java} ParseContext parseContext = new ParseContext(); AutoDetectParser parser = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString()); {code} I am getting following output {code:java} http://www.w3.org/1999/xhtml;> . . . . this is test. MORE TEXT Please be more specific. Testing points for what? Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Please be more specific. What was the previous item? Version of what? How exactly does this benefit Ontario? {code} The comments are coming as. So there are 6 comments above {code:java} {code} I can use regular expression to remove comments from here. I am interested to know is there anyway I can tell/configure tika to exclude "annotation_text" from extraction Thanks was (Author: mdasadul): I would like to do few things * exclude comments * possibly exclude header and footer (Working fine) * Exclude deleted contents(I find a way and it's working) Please ignore my comment about RecursiveParserWrapper as I will not be using that currently here is what I am doing {code:java} ParseContext parseContext = new ParseContext(); AutoDetectParser parser = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString()); {code} I am getting following output {code:java} http://www.w3.org/1999/xhtml;> . . . . this is test. MORE TEXT Please be more specific. Testing points for what? Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Please be more specific. What was the previous item? Version of what? How exactly does this benefit Ontario? {code} The comments are coming as. So there are 6 comments above {code:java} {code} I can use regular expression to remove comments from here. I am interested to know is there anyway I can tell tika to exclude "annotation_text" from extraction Thanks > docx with track change producing incorrect output > - > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler >Affects Versions: 1.17 >Reporter: Md >Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata,
[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220 ] Md edited comment on TIKA-2593 at 3/1/18 4:11 PM: -- I would like to do few things * exclude comments * possibly exclude header and footer (Working fine) * Exclude deleted contents(I find a way and it's working) Please ignore my comment about RecursiveParserWrapper as I will not be using that currently here is what I am doing {code:java} ParseContext parseContext = new ParseContext(); AutoDetectParser parser = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString()); {code} I am getting following output {code:java} http://www.w3.org/1999/xhtml;> . . . . this is test. MORE TEXT Please be more specific. Testing points for what? Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Please be more specific. What was the previous item? Version of what? How exactly does this benefit Ontario? {code} The comments are coming as. So there are 6 comments above {code:java} {code} I can use regular expression to remove comments from here. I am interested to know is there anyway I can tell tika to exclude "annotation_text" from extraction Thanks was (Author: mdasadul): I would like to do few things * exclude comments * possibly exclude header and footer (Working fine) * Exclude deleted contents(I find a way and it's working) Please ignore my comment about RecursiveParserWrapper as I will not be using that currently here is what I am doing {code:java} ParseContext parseContext = new ParseContext(); AutoDetectParser parser = new AutoDetectParser(); ContentHandler handler = new ToXMLContentHandler(); Metadata metadata = new Metadata(); XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata); inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); OfficeParserConfig officeParserConfig = new OfficeParserConfig(); officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); parseContext.set(OfficeParserConfig.class, officeParserConfig); parser.parse(inputStream, contentHandler, metadata, parseContext); System.out.println(contentHandler.toString()); {code} I am getting following output {code:java} http://www.w3.org/1999/xhtml;> . . . . this is test. MORE TEXT Please be more specific. Testing points for what? Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Acronyms should be spelled out upon first use, followed by the acronym itself in parentheses. Subsequently, only the acronym needs to be used in the text. Please be more specific. What was the previous item? Version of what? How exactly does this benefit Ontario? {code} The comments are coming as {code:java} {code} I can use regular expression to remove comments from here. I am interested to know is there anyway I can tell tika to exclude "annotation_text" from extraction Thanks > docx with track change producing incorrect output > - > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler >Affects Versions: 1.17 >Reporter: Md >Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); >
[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382167#comment-16382167 ] Md edited comment on TIKA-2593 at 3/1/18 3:33 PM: -- No deleted content is not showing if we do officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); I wanted to exclude shape based content by officeParserConfig.setIncludeShapeBasedContent(false); but not working Also Is there anyway I can exclude comments in docx during tika extraction? Thanks so much for your help was (Author: mdasadul): No deleted content is not showing if we do officeParserConfig.setUseSAXDocxExtractor(true); officeParserConfig.setIncludeDeletedContent(false); I wanted to exclude shape based content by officeParserConfig.setIncludeShapeBasedContent(false); but not working Also Is there anyway I can exclude comments in docx? Thanks so much for your help > docx with track change producing incorrect output > - > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler >Affects Versions: 1.17 >Reporter: Md >Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); > System.out.println(contentHandler.toString()); > {code} > When I am sending track revised files it's adding all the text deleted with > the actual text and inserted text. Is there a way to tell parser to exclude > the deleted text? > Here is an example > input Text: This is a sample text. -This part will- be deleted. +This is > inserted.+ > outputText: This is a sample text. This part will be deleted. This is > inserted. > Desired output: This is a sample text. be deleted. This is inserted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output
[ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382125#comment-16382125 ] Tim Allison edited comment on TIKA-2593 at 3/1/18 3:09 PM: --- bq. I think I did figure it out. I need to set officeParserConfig.setUseSAXDocxExtractor(true); Sorry for not responding sooner...IIRC, we can't yet remove deleted contents with our regular DOM parser, so you do have to use the SAXDocx parser. bq. But still doesn't work for officeParserConfig.setIncludeShapeBasedContent(false); If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted content comes through?! was (Author: talli...@mitre.org): bq. I think I did figure it out. I need to set officeParserConfig.setUseSAXDocxExtractor(true); Sorry for not responsding...IIRC, we can't yet remove deleted contents with our regular DOM parser, so you do have to use the SAXDocx parser. bq. But still doesn't work for officeParserConfig.setIncludeShapeBasedContent(false); If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted content comes through?! > docx with track change producing incorrect output > - > > Key: TIKA-2593 > URL: https://issues.apache.org/jira/browse/TIKA-2593 > Project: Tika > Issue Type: Bug > Components: core, handler >Affects Versions: 1.17 >Reporter: Md >Priority: Major > Attachments: sample.docx > > > I am using following code to extract text from docx file > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > ContentHandler contentHandler = new BodyContentHandler(); > inputStream = new BufferedInputStream(new FileInputStream(inputFileName)); > Metadata metadata = new Metadata(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setIncludeDeletedContent(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > parser.parse(inputStream, contentHandler, metadata, parseContext); > System.out.println(contentHandler.toString()); > {code} > When I am sending track revised files it's adding all the text deleted with > the actual text and inserted text. Is there a way to tell parser to exclude > the deleted text? > Here is an example > input Text: This is a sample text. -This part will- be deleted. +This is > inserted.+ > outputText: This is a sample text. This part will be deleted. This is > inserted. > Desired output: This is a sample text. be deleted. This is inserted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)