[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382220#comment-16382220
 ] 

Md commented on TIKA-2593:
--

I would like to do few things
 * exclude comments
 * possibly exclude header and footer (Working fine)
 * Exclude deleted contents(I find a way and it's working)

Please ignore my comment about RecursiveParserWrapper as I will not be using 
that currently here is what I am doing 

 
{code:java}
ParseContext parseContext = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
XHTMLContentHandler contentHandler=new XHTMLContentHandler(handler,metadata);
inputStream = new BufferedInputStream(new FileInputStream(inputFileName));

OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

parser.parse(inputStream, contentHandler, metadata, parseContext);
System.out.println(contentHandler.toString());
{code}
I am getting following output 

 

 
{code:java}
http://www.w3.org/1999/xhtml;>


.
.






.
.



this is test. 
 MORE TEXT 

Please be more specific. Testing points for what? 


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Acronyms should be spelled out upon first use, followed by the acronym itself 
in parentheses. Subsequently, only the acronym needs to be used in the text.


Please be more specific. What was the previous item? 

Version of what? 

How exactly does this benefit Ontario? 


{code}
The comments are coming as 
{code:java}
{code}
I can use regular expression to remove comments from here. I am interested to 
know is there anyway I can tell tika to exclude "annotation_text" from 
extraction

 

Thanks 

 

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382207#comment-16382207
 ] 

Tim Allison commented on TIKA-2593:
---

bq. Which shapes are being extracted, are you able to share an example?  
TIKA-1945
Ok, that's diagram data, and you're right, "ShapeBasedContent" so far only 
means text boxes in docx.  Fellow devs, any problem if we call diagram data 
"ShapeBasedContent"?

bq. And comments were coming like "Comment by : 
" but the problem of that parser is that it can't extract from zip 
file. Is it possible to extract from zip file by using RecursiveParserWrapper? 
or is there a way I can use BasicContentHandlerFactory.HANDLER_TYPE.TEXT with 
auto detect parser so that I can get the comment ?

I'm confused.  Which parser can't extract from zip file?  Yes, the 
RecursiveParserWrapper should handle zip files...if it can't, open a separate 
issue!  I'm not sure what zip file and comments have to do with each other?


> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382193#comment-16382193
 ] 

Md commented on TIKA-2593:
--

I am talking about this ticket and for example you can see the attached file in 
the ticket

https://issues.apache.org/jira/browse/TIKA-1945

 

Yes parameterize sounds great for comments like deleted content and header 
footer :)

I was using 

contentHandlerFactory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1); 

recursiveParserWrapper = new RecursiveParserWrapper(new AutoDetectParser(), 
contentHandlerFactory);

 

And comments were coming like "Comment by : " 
but the problem of that parser is that it can't extract from zip file. Is it 
possible to extract from zip file by using RecursiveParserWrapper? or is there 
a way I can use BasicContentHandlerFactory.HANDLER_TYPE.TEXT with auto detect 
parser so that I can get the comment ?

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382177#comment-16382177
 ] 

Tim Allison commented on TIKA-2593:
---

bq. I wanted to exclude shape based content by 
officeParserConfig.setIncludeShapeBasedContent(false); but not working

Which shapes are being extracted, are you able to share an example?

bq. Also Is there anyway I can exclude comments in docx during tika extraction?
Not currently, but we could parameterize that as well. :)

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382167#comment-16382167
 ] 

Md commented on TIKA-2593:
--

No deleted content is not showing if we do 

 officeParserConfig.setUseSAXDocxExtractor(true);

officeParserConfig.setIncludeDeletedContent(false);

I wanted to exclude shape based content by 
officeParserConfig.setIncludeShapeBasedContent(false); but not working

 

Also Is there anyway I can exclude comments in docx?

 

Thanks so much for your help

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382125#comment-16382125
 ] 

Tim Allison commented on TIKA-2593:
---

bq. I think I did figure it out. I need to set 
officeParserConfig.setUseSAXDocxExtractor(true);

Sorry for not responsding...IIRC, we can't yet remove deleted contents with our 
regular DOM parser, so you do have to use the SAXDocx parser.

bq. But still doesn't work for 
officeParserConfig.setIncludeShapeBasedContent(false);

If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted 
content comes through?!


> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382014#comment-16382014
 ] 

Md commented on TIKA-2593:
--

I think I did figure it out. I need to set 

officeParserConfig.setUseSAXDocxExtractor(true);

But still doesn't work for 

officeParserConfig.setIncludeShapeBasedContent(false);

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2593) docx with track change producing incorrect output

2018-03-01 Thread Md (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382005#comment-16382005
 ] 

Md commented on TIKA-2593:
--

I notice it works nicely when I am asking to exclude header and footer by 

officeParserConfig.setIncludeHeadersAndFooters(false); but not working for 

officeParserConfig.setIncludeDeletedContent(false);
 officeParserConfig.setIncludeShapeBasedContent(false);

 

Is there anything I am missing here

> docx with track change producing incorrect output
> -
>
> Key: TIKA-2593
> URL: https://issues.apache.org/jira/browse/TIKA-2593
> Project: Tika
>  Issue Type: Bug
>  Components: core, handler
>Affects Versions: 1.17
>Reporter: Md
>Priority: Major
> Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with 
> the actual text and inserted text. Is there a way to tell parser to exclude 
> the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is 
> inserted.+
> outputText: This is a sample text. This part will be deleted. This is 
> inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)