[jira] [Commented] (TIKA-2900) Removing comments from .docx, .pdf files

ASF GitHub Bot (Jira) Sun, 27 Oct 2019 02:25:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960546#comment-16960546
 ]


ASF GitHub Bot commented on TIKA-2900:
--------------------------------------

dameikle commented on pull request #294: TIKA-2900: Add ability to exclude 
comments in Word extraction
URL: https://github.com/apache/tika/pull/294
 
 
   Adds an option in OfficeParserConfig to allow comments to be explicitly 
included or excluded in Word extractions. The default remains as included for 
backwards compatibility.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Removing comments from *.docx, *.pdf files
> ------------------------------------------
>
>                 Key: TIKA-2900
>                 URL: https://issues.apache.org/jira/browse/TIKA-2900
>             Project: Tika
>          Issue Type: Wish
>          Components: app, example
>    Affects Versions: 1.21
>            Reporter: Md
>            Assignee: Dave Meikle
>            Priority: Major
>         Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, 
> Document_with_Comments_Text_extarction_Tika_APP.docx.txt
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>      StringBuilder fileContent = new StringBuilder();
>         Parser parser = new AutoDetectParser();
>         ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
>                 -1);
>         //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
>         RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
>         Metadata metadata = new Metadata();
>         ParseContext parseContext = new ParseContext();
>         OfficeParserConfig officeParserConfig = new OfficeParserConfig();
>         officeParserConfig.setUseSAXDocxExtractor(true);
>         officeParserConfig.setIncludeDeletedContent(false);
>         officeParserConfig.setIncludeMoveFromContent(false);
>         officeParserConfig.setIncludeHeadersAndFooters(false);
>         parseContext.set(OfficeParserConfig.class, officeParserConfig);
>         wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
>         String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
>         {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-2900) Removing comments from *.docx, *.pdf files

Reply via email to

[jira] [Commented] (TIKA-2900) Removing comments from .docx, .pdf files