[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Attachment: Document_with_Comments_Text_extarction_Tika_APP.docx.txt

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx, 
> Document_with_Comments_Text_extarction_Tika_APP.docx.txt
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2901) Tika extracting points data from Chart

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2901:
-
Summary: Tika extracting points data from Chart   (was: Tika extracting 
points from Chart )

> Tika extracting points data from Chart 
> ---
>
> Key: TIKA-2901
> URL: https://issues.apache.org/jira/browse/TIKA-2901
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Chart_data_sample_text_possible_issue.docx, 
> Chart_data_sample_text_possible_issue.docx.txt
>
>
> I am using Tika to extract content from *.docx and other files. I am noticing 
> Tika is extracting points from charts and putting them at the end of the 
> file. 
> I am using following code for extraction 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}
> Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2901) Tika extracting points from Chart

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2901:
-
Attachment: Chart_data_sample_text_possible_issue.docx.txt
Chart_data_sample_text_possible_issue.docx

> Tika extracting points from Chart 
> --
>
> Key: TIKA-2901
> URL: https://issues.apache.org/jira/browse/TIKA-2901
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Chart_data_sample_text_possible_issue.docx, 
> Chart_data_sample_text_possible_issue.docx.txt
>
>
> I am using Tika to extract content from *.docx and other files. I am noticing 
> Tika is extracting points from charts and putting them at the end of the 
> file. 
> I am using following code for extraction 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}
> Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Attachment: (was: Chart_data_sample_text_possible_issue.docx.txt)

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Attachment: (was: Chart_data_sample_text_possible_issue.docx)

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Attachment: Chart_data_sample_text_possible_issue.docx.txt
Chart_data_sample_text_possible_issue.docx

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2901) Tika extracting points from Chart

2019-07-08 Thread Md (JIRA)
Md created TIKA-2901:


 Summary: Tika extracting points from Chart 
 Key: TIKA-2901
 URL: https://issues.apache.org/jira/browse/TIKA-2901
 Project: Tika
  Issue Type: Bug
  Components: app
Affects Versions: 1.21
Reporter: Md


I am using Tika to extract content from *.docx and other files. I am noticing 
Tika is extracting points from charts and putting them at the end of the file. 
I am using following code for extraction 
{code:java}
 StringBuilder fileContent = new StringBuilder();
Parser parser = new AutoDetectParser();
ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new 
FileInputStream(inputFileName));
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
factory);
Metadata metadata = new Metadata();

ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, new DefaultHandler(), metadata, 
parseContext);
String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
{code}

Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Attachment: Document_with_Comments_Text_extarction_Tika_APP.docx

> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
> Attachments: Document_with_Comments_Text_extarction_Tika_APP.docx
>
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2900:
-
Description: 
Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
Sometimes there are comments in the file and tika is extracting them and adding 
them at the end of the file. I am wondering to know is there a way to exclude 
comments when it will be extracting text. 


Here is the following code I am using 

{code:java}
 StringBuilder fileContent = new StringBuilder();
Parser parser = new AutoDetectParser();
ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new 
FileInputStream(inputFileName));
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
factory);
Metadata metadata = new Metadata();

ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, new DefaultHandler(), metadata, 
parseContext);
String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
{code}

  was:
Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
Sometimes there are comments in the file and tika is extracting them and adding 
them at the end of the file. I am wondering to know is there a way to exclude 
comments when it will be extracting text. 

 

Here is the following code I am using 

```
StringBuilder fileContent = new StringBuilder();
Parserparser=newAutoDetectParser();
ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new 
FileInputStream(inputFileName));
RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory);
Metadatametadata=newMetadata();
ParseContextparseContext=newParseContext();
OfficeParserConfigofficeParserConfig=newOfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext);
Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
 
```


> Removing comments from *.docx, *.pdf files
> --
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
>  Issue Type: Wish
>  Components: app, example
>Affects Versions: 1.21
>Reporter: Md
>Priority: Major
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
> Sometimes there are comments in the file and tika is extracting them and 
> adding them at the end of the file. I am wondering to know is there a way to 
> exclude comments when it will be extracting text. 
> Here is the following code I am using 
> {code:java}
>  StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new 
> FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
> factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, 
> parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2900) Removing comments from *.docx, *.pdf files

2019-07-08 Thread Md (JIRA)
Md created TIKA-2900:


 Summary: Removing comments from *.docx, *.pdf files
 Key: TIKA-2900
 URL: https://issues.apache.org/jira/browse/TIKA-2900
 Project: Tika
  Issue Type: Wish
  Components: app, example
Affects Versions: 1.21
Reporter: Md


Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. 
Sometimes there are comments in the file and tika is extracting them and adding 
them at the end of the file. I am wondering to know is there a way to exclude 
comments when it will be extracting text. 

 

Here is the following code I am using 

```
StringBuilder fileContent = new StringBuilder();
Parserparser=newAutoDetectParser();
ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new 
FileInputStream(inputFileName));
RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory);
Metadatametadata=newMetadata();
ParseContextparseContext=newParseContext();
OfficeParserConfigofficeParserConfig=newOfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext);
Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
 
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382220#comment-16382220
 ] 

Md edited comment on TIKA-2593 at 3/1/18 4:51 PM:
--

I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml";>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as. So there are 6 comments above
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell/configure tika to exclude "annotation_text" 
from extraction

 

Thanks 

 


was (Author: mdasadul):
I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml";>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as. So there are 6 comments above
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell tika to exclude "annotation_text" from 
extraction

 

Thanks 

 

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, met

[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382220#comment-16382220
 ] 

Md edited comment on TIKA-2593 at 3/1/18 4:11 PM:
--

I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml";>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as. So there are 6 comments above
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell tika to exclude "annotation_text" from 
extraction

 

Thanks 

 


was (Author: mdasadul):
I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml";>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as 
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell tika to exclude "annotation_text" from 
extraction

 

Thanks 

 

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.print

[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382220#comment-16382220
 ] 

Md commented on TIKA-2593:
--

I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml";>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as 
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell tika to exclude "annotation_text" from 
extraction

 

Thanks 

 

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382193#comment-16382193
 ] 

Md commented on TIKA-2593:
--

I am talking about this ticket and for example you can see the attached file in 
the ticket

https://issues.apache.org/jira/browse/TIKA-1945

 

Yes parameterize sounds great for comments like deleted content and header 
footer :)

I was using 

contentHandlerFactory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1); 

recursiveParserWrapper = new RecursiveParserWrapper(new AutoDetectParser(), 
contentHandlerFactory);

 

And comments were coming like "Comment by : " 
but the problem of that parser is that it can't extract from zip file. Is it 
possible to extract from zip file by using RecursiveParserWrapper? or is there 
a way I can use BasicContentHandlerFactory.HANDLER_TYPE.TEXT with auto detect 
parser so that I can get the comment ?

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382167#comment-16382167
 ] 

Md edited comment on TIKA-2593 at 3/1/18 3:33 PM:
--

No deleted content is not showing if we do 

 officeParserConfig.setUseSAXDocxExtractor(true);

officeParserConfig.setIncludeDeletedContent(false);

I wanted to exclude shape based content by 
officeParserConfig.setIncludeShapeBasedContent(false); but not working

 

Also Is there anyway I can exclude comments in docx during tika extraction?

 

Thanks so much for your help


was (Author: mdasadul):
No deleted content is not showing if we do 

 officeParserConfig.setUseSAXDocxExtractor(true);

officeParserConfig.setIncludeDeletedContent(false);

I wanted to exclude shape based content by 
officeParserConfig.setIncludeShapeBasedContent(false); but not working

 

Also Is there anyway I can exclude comments in docx?

 

Thanks so much for your help

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382167#comment-16382167
 ] 

Md commented on TIKA-2593:
--

No deleted content is not showing if we do 

 officeParserConfig.setUseSAXDocxExtractor(true);

officeParserConfig.setIncludeDeletedContent(false);

I wanted to exclude shape based content by 
officeParserConfig.setIncludeShapeBasedContent(false); but not working

 

Also Is there anyway I can exclude comments in docx?

 

Thanks so much for your help

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382014#comment-16382014
 ] 

Md commented on TIKA-2593:
--

I think I did figure it out. I need to set 

officeParserConfig.setUseSAXDocxExtractor(true);

But still doesn't work for 

officeParserConfig.setIncludeShapeBasedContent(false);

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382005#comment-16382005
 ] 

Md commented on TIKA-2593:
--

I notice it works nicely when I am asking to exclude header and footer by 

officeParserConfig.setIncludeHeadersAndFooters(false); but not working for 

officeParserConfig.setIncludeDeletedContent(false);
 officeParserConfig.setIncludeShapeBasedContent(false);

 

Is there anything I am missing here

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2593:
-
Description: 
I am using following code to extract text from docx file 
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
When I am sending track revised files it's adding all the text deleted with the 
actual text and inserted text. Is there a way to tell parser to exclude the 
deleted text?

Here is an example 

input Text: This is a sample text. -This part will- be deleted. +This is 
inserted.+

outputText: This is a sample text. This part will be deleted. This is inserted.

Desired output: This is a sample text.  be deleted. This is inserted.

  was:
I am using following code to extract text from docx file 
{code:java}
contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();
StringBuilder fileContent = new StringBuilder();
recursiveParserWrapper.parse(inputStream, contentHandler, metadata, 
parseContext);
System.out.println("Metadata WordCount Value: "+contentHandler.toString());

{code}
When I am sending track revised files it's adding all the text deleted with the 
actual text and inserted text. Is there a way to tell parser to exclude the 
deleted text?

Here is an example 

input Text: This is a sample text. -This part will- be deleted. +This is 
inserted.+

outputText: This is a sample text. This part will be deleted. This is inserted.

Desired output: This is a sample text.  be deleted. This is inserted.


> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2593) docx with track change producing incorrect output

2018-02-28 Thread Md (JIRA)
Md created TIKA-2593:


 Summary: docx with track change producing incorrect output
 Key: TIKA-2593
 URL: https://issues.apache.org/jira/browse/TIKA-2593
 Project: Tika
  Issue Type: Bug
  Components: core, handler
Affects Versions: 1.17
Reporter: Md
 Attachments: sample.docx

I am using following code to extract text from docx file 
{code:java}
contentHandler = new BodyContentHandler();
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
Metadata metadata = new Metadata();
StringBuilder fileContent = new StringBuilder();
recursiveParserWrapper.parse(inputStream, contentHandler, metadata, 
parseContext);
System.out.println("Metadata WordCount Value: "+contentHandler.toString());

{code}
When I am sending track revised files it's adding all the text deleted with the 
actual text and inserted text. Is there a way to tell parser to exclude the 
deleted text?

Here is an example 

input Text: This is a sample text. -This part will- be deleted. +This is 
inserted.+

outputText: This is a sample text. This part will be deleted. This is inserted.

Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2018-02-28 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380773#comment-16380773
 ] 

Md commented on TIKA-207:
-

By the way I am using AutoDetectParser()

> MS word doc containing tracked changes produces incorrect text
> --
>
> Key: TIKA-207
> URL: https://issues.apache.org/jira/browse/TIKA-207
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.3
> Environment: tika-0.3-standalone.jar
>Reporter: Michael McCandless
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 0.10
>
> Attachments: TIKA-207.patch, TIKA-207.patch
>
>
> Spinoff from this discussion:
>   
> http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> When extracting text from an MS Word doc (2003 format) that has
> unapproved pending changes, the text from both old and new is glommed
> together.
> EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> changed TOKENIZED to ANALYZED with track changes enabled, and
> then when I extract text (using TikaCLI) it produces this:
>   Field.Index.TOKENIZEDANALYZED
> So, first, it'd be nice to at least get whitespace inserted between
> old & new text.
> And, second, it'd be great to have an option to control whether it's
> old or new text that's indexed (or at least an option to only see
> "new" text, ie the current document).
> From the discussion above, it seems like POI may expose the
> fine-grained APIs to allow Tika to do this; it's just that Tika's not
> leveraging these APIs  for MS Word docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

2018-02-28 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380752#comment-16380752
 ] 

Md commented on TIKA-207:
-

I am using tika 1.17 but still it's getting deleted text from track revised 
files. Is there a way to exclude deleted test from tracked revised files.

 

> MS word doc containing tracked changes produces incorrect text
> --
>
> Key: TIKA-207
> URL: https://issues.apache.org/jira/browse/TIKA-207
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.3
> Environment: tika-0.3-standalone.jar
>Reporter: Michael McCandless
>Assignee: Jukka Zitting
>Priority: Minor
> Fix For: 0.10
>
> Attachments: TIKA-207.patch, TIKA-207.patch
>
>
> Spinoff from this discussion:
>   
> http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> When extracting text from an MS Word doc (2003 format) that has
> unapproved pending changes, the text from both old and new is glommed
> together.
> EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> changed TOKENIZED to ANALYZED with track changes enabled, and
> then when I extract text (using TikaCLI) it produces this:
>   Field.Index.TOKENIZEDANALYZED
> So, first, it'd be nice to at least get whitespace inserted between
> old & new text.
> And, second, it'd be great to have an option to control whether it's
> old or new text that's indexed (or at least an option to only see
> "new" text, ie the current document).
> From the discussion above, it seems like POI may expose the
> fine-grained APIs to allow Tika to do this; it's just that Tika's not
> leveraging these APIs  for MS Word docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967812#comment-15967812
 ] 

Md commented on TIKA-2326:
--

Thanks once again, I was going through above mention discussion 
unfortunately none of the options seems feasible for our case. As I am using 
RecursiveParserWrapper so option 1 is not a choice for us.  Also option 2 and 3 
are not feasible as I am using the extracted text for further analysis and I am 
running tika along with rabbitMQ and some other services. 

Anyway thanks for making this nice piece of software open source 

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Fix For: 1.13
>
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the  
> part of my code which is doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967786#comment-15967786
 ] 

Md commented on TIKA-2326:
--

We have many files which are archived, for them, RecursiveParserWrapper is 
doing the good job that's the reason I am using it. 
Ok Thanks, I will use default handler instead. 

I think I started using tika very early stage and at that time most of this 
code is written. I did notice some difference with new AutoDetectParser() vs 
HtmlParser() and that's the reason I was using it but Now it looks same. Thank 
you for pointing it put.

Is it possible to add a timer in the parser? especially here 
parser.parse(inputStream, handler, metadata, context);
Sometimes it's taking a long time and I would like to set a prespecified 
timeout.

Thank you so much I really appriciate your help 

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Fix For: 1.13
>
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the  
> part of my code which is doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser

[jira] [Closed] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md closed TIKA-2326.

Resolution: Fixed

Fixed in 1.13 or later version 

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Fix For: 1.13
>
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the  
> part of my code which is doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2326:
-
Fix Version/s: 1.13

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Fix For: 1.13
>
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the  
> part of my code which is doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967764#comment-15967764
 ] 

Md commented on TIKA-2326:
--

Yes, you are right. It did fix in recent version(1.14). Thanks so much

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Fix For: 1.13
>
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the  
> part of my code which is doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2326:
-
Description: 
I am using RecursiveParserWrapper with AutoDetectParser() and here is the  part 
of my code which is doing parsing
 
RecursiveParserWrapper parser = null;
ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);

if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
parser = new RecursiveParserWrapper(new HtmlParser(), factory);
} else {
parser = new RecursiveParserWrapper(new AutoDetectParser(), 
factory);
}
// -1 for unlimited buffer
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
parser.parse(inputStream, handler, metadata, context);
Out of my 251000 files there is only one file where parser is unable to parse 
and proving out of memory error. Here goes the error message  

Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
at 
org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
at 
org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at 
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
at 
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
 Source)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)


  was:
I am using RecursiveParserWrapper with AutoDetectParser() and here is the major 
part where I am doing parsing
 
RecursiveParserWrapper parser = null;
ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);

if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
parser = new RecursiveParserWrapper(new HtmlParser(), factory);
} else {
parser = new RecursiveParserWrapper(new AutoDetectParser(), 
factory);
}
// -1 for unlimited buffer
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
parser.parse(inputStream, handler, metadata, context);
Out of my 251000 files there is only one file where parser is unable to parse 
and proving out of memory error. Here goes the error message  

Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506

[jira] [Updated] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Md updated TIKA-2326:
-
Attachment: 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx

Here is the file I am having issue with 

> java.lang.OutOfMemoryError: Java heap space
> ---
>
> Key: TIKA-2326
> URL: https://issues.apache.org/jira/browse/TIKA-2326
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.8
> Environment: Ubuntu 16.04, java version "1.8.0_121"
>Reporter: Md
> Attachments: 
> 5d3e815263c73061d8804e15db3ammn0789_CLEAN_REVISED.docx
>
>
> I am using RecursiveParserWrapper with AutoDetectParser() and here is the 
> major part where I am doing parsing
>  
> RecursiveParserWrapper parser = null;
>   ContentHandlerFactory factory = new 
> BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);
>   
> if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
>   parser = new RecursiveParserWrapper(new HtmlParser(), factory);
>   } else {
>   parser = new RecursiveParserWrapper(new AutoDetectParser(), 
> factory);
>   }
>   // -1 for unlimited buffer
>   ContentHandler handler = new BodyContentHandler(-1);
>   ParseContext context = new ParseContext();
>   parser.parse(inputStream, handler, metadata, context);
> Out of my 251000 files there is only one file where parser is unable to parse 
> and proving out of memory error. Here goes the error message
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
>   at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
>   at 
> org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
>   at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
>   at 
> org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
>   at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
>   at 
> org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
>   at 
> org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
>   at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
>   at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
>   at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
>   at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
>   at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
>   at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
>   at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)
>   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TIKA-2326) java.lang.OutOfMemoryError: Java heap space

2017-04-13 Thread Md (JIRA)
Md created TIKA-2326:


 Summary: java.lang.OutOfMemoryError: Java heap space
 Key: TIKA-2326
 URL: https://issues.apache.org/jira/browse/TIKA-2326
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.8
 Environment: Ubuntu 16.04, java version "1.8.0_121"

Reporter: Md


I am using RecursiveParserWrapper with AutoDetectParser() and here is the major 
part where I am doing parsing
 
RecursiveParserWrapper parser = null;
ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1);

if(fileNameWithPath.toLowerCase().contains(htmlInTxtFile.toLowerCase())) {
parser = new RecursiveParserWrapper(new HtmlParser(), factory);
} else {
parser = new RecursiveParserWrapper(new AutoDetectParser(), 
factory);
}
// -1 for unlimited buffer
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
parser.parse(inputStream, handler, metadata, context);
Out of my 251000 files there is only one file where parser is unable to parse 
and proving out of memory error. Here goes the error message  

Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:506)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:419)
at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:489)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2927)
at 
org.apache.xmlbeans.impl.store.Cur$CurLoadContext.stripText(Cur.java:3130)
at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:3143)
at 
org.apache.xmlbeans.impl.store.Locale$SaxHandler.characters(Locale.java:3291)
at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportCdata(Piccolo.java:992)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1290)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
at 
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4812)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
at 
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
at 
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3479)
at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1277)
at 
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1264)
at 
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at 
org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
 Source)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:136)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:166)
at 
org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:118)
at 
org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:59)
at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:182)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:130)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)