[ https://issues.apache.org/jira/browse/TIKA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Md updated TIKA-2901: --------------------- Attachment: Chart_data_sample_text_possible_issue.docx.txt Chart_data_sample_text_possible_issue.docx > Tika extracting points from Chart > ---------------------------------- > > Key: TIKA-2901 > URL: https://issues.apache.org/jira/browse/TIKA-2901 > Project: Tika > Issue Type: Bug > Components: app > Affects Versions: 1.21 > Reporter: Md > Priority: Major > Attachments: Chart_data_sample_text_possible_issue.docx, > Chart_data_sample_text_possible_issue.docx.txt > > > I am using Tika to extract content from *.docx and other files. I am noticing > Tika is extracting points from charts and putting them at the end of the > file. > I am using following code for extraction > {code:java} > StringBuilder fileContent = new StringBuilder(); > Parser parser = new AutoDetectParser(); > ContentHandlerFactory factory = new > BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, > -1); > //InputStream inputStream = new BufferedInputStream(new > FileInputStream(inputFileName)); > RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, > factory); > Metadata metadata = new Metadata(); > ParseContext parseContext = new ParseContext(); > OfficeParserConfig officeParserConfig = new OfficeParserConfig(); > officeParserConfig.setUseSAXDocxExtractor(true); > officeParserConfig.setIncludeDeletedContent(false); > officeParserConfig.setIncludeMoveFromContent(false); > officeParserConfig.setIncludeHeadersAndFooters(false); > parseContext.set(OfficeParserConfig.class, officeParserConfig); > wrapper.parse(inputStream, new DefaultHandler(), metadata, > parseContext); > String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT); > {code} > Please find the attach files for input and output from Tika. -- This message was sent by Atlassian JIRA (v7.6.3#76005)