[jira] [Closed] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents

2019-04-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov closed TIKA-2347.
---

> Underlined text is not decorated as such when extracting from word documents
> 
>
> Key: TIKA-2347
> URL: https://issues.apache.org/jira/browse/TIKA-2347
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.0, 1.14
>Reporter: Stuart Hendren
>Assignee: Dave Meikle
>Priority: Major
> Fix For: 1.17
>
>
> When extracting from doc and docx bold and italic text decoration is 
> extracted, however underlining is not.  Can be demonstrated in WordParserTest 
> or OOXMLParserTest (change to docx) with the following test case.
> {code:title=WordParserTest.java|borderStyle=solid}
> @Test
> public void testTextDecoration() throws Exception {
>   XMLResult result = getXML("testWORD_various.doc");
>   String xml = result.xml;
>   assertTrue(xml.contains("Bold"));
>   assertTrue(xml.contains("italic"));
>   assertTrue(xml.contains("underline"));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2601) Invalid XHTML output for some WORD documents

2019-04-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov resolved TIKA-2601.
-
Resolution: Duplicate

I mark it as duplicate for TIKA-2555 which I'm currently looking into

> Invalid XHTML output for some WORD documents
> 
>
> Key: TIKA-2601
> URL: https://issues.apache.org/jira/browse/TIKA-2601
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
> Environment: Linked is a sample document with its corresponding 
> output.
>Reporter: Filip
>Priority: Major
> Attachments: Invalid-XML.doc, Test.doc, test.html
>
>
> In some WORD (.doc, .docx) documents the XHTML elements are not closed 
> properly. This usually happens when there are link elements () as well as 
> italic or bold elements ().
>  
> Fix should be done in 
> [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810288#comment-16810288
 ] 

Tim Allison commented on TIKA-2847:
---

Last hope:


{noformat}
PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setMaxMainMemoryBytes(50);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfParserConfig);
Parser p = new AutoDetectParser();
...
p.parse(inputstream, contentHandler, metadata, parseContext);
{noformat}

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810256#comment-16810256
 ] 

Tim Allison commented on TIKA-2749:
---

[~rossj], this is very helpful...any recs on how to detect "not a normal scan"?

 

I have run into individual pages that contain (1000s?) of images stitched 
together that clearly require rendering to be useful...  Aside from a high 
number of images, how do we identify flipped and/or overlapping scans? 

 

Thank you!

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Ross Johnson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810172#comment-16810172
 ] 

Ross Johnson commented on TIKA-2749:


OCRing the inlined images directly can be tricky, in my experience. Here are a 
couple classes of problematic scenarios that I've come across:

- Sometimes images have a funky transform applied, e.g. the actual page scan is 
mirrored but flipped to look right in the PDF.
- Some fancy PDF generators / scanners use multiple overlapping images, perhaps 
utilizing image masking, to reduce file size. E.g. there may be a background 
image with the color components along with a foreground grayscale or 
1-bit-per-pixel image of the black content. I've also seen the foreground text 
split up into multiple images overlaying different parts of the background 
image.

In either situation, you could probably detect the "not a normal scan" 
condition and kick it into "render page" mode.

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810165#comment-16810165
 ] 

Ashish Tiwari commented on TIKA-2847:
-

yes TikaInputStream.get(infile) gave me same error.

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.

2019-04-04 Thread Konstantin Gribov (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Gribov reassigned TIKA-2555:
---

Assignee: Konstantin Gribov

> Text with [underline] + [another format] in word document generates 
> overlapping html tags.
> --
>
> Key: TIKA-2555
> URL: https://issues.apache.org/jira/browse/TIKA-2555
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.17
>Reporter: Serban Alexe
>Assignee: Konstantin Gribov
>Priority: Minor
> Attachments: Clipboard02.jpg
>
>
> I have a sample _.docx_ document which contains one single line of text**++.
> Making that text to be:
>  * +underlined+
>  ** AND at least one of the following two
>  * _italic_
>  * *bold*
> will cause the generated _.xhtml_ file to contain overlapping tags.
>  
> _+Example+_:
> *+The quick brown fox jumps over the lazy dog.+*
> will result in
> The quick brown fox jumps over the lazy dog. 
> which causes some browser (Firefox, Chrome) to give an error and not display 
> the content of the file...
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 5:47 PM:
---

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...On how to extract lines: 
[http://stackoverflow.com/a/38933039/535646]|See for example PDFBOX-4275's 
[^rotation.pdf].  If we render the page, '', a vector graphic, is OCR'd as 
'$225'; however, if we extract inline images and run OCR on the extracted 
inline images, OCR is never triggered because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...how do we identify vector graphics?|See for example 
PDFBOX-4275's [^rotation.pdf].  If we render the page, '', a vector 
graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR 
on the extracted inline images, OCR is never triggered because there are no 
inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810150#comment-16810150
 ] 

Tim Allison commented on TIKA-2847:
---

sorry.  I meant {{TikaInputStream.get(infile)}}...

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810146#comment-16810146
 ] 

Ashish Tiwari commented on TIKA-2847:
-

Thanks Tim setting "setUseSAXDocxExtractor" to true worked for the docx.

In case of PDF file, TikaInputStream.open API is not present, i did find and 
tried couple of get API's but same issue.

PS : i am using tika1.19.1.jar

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810131#comment-16810131
 ] 

Tim Allison commented on TIKA-2847:
---

Try opening the InputStream with {{TikaInputStream.open(infile)}}...that 
triggers PDFBox to load from the underlying file instead of the InputStream.

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 5:13 PM:
---

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...how do we identify vector graphics?|See for example 
PDFBOX-4275's [^rotation.pdf].  If we render the page, '', a vector 
graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR 
on the extracted inline images, OCR is never triggered because there are no 
inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...how do we identify vector graphics?|See for example 
PDFBOX-2475's [^rotation.pdf].  If we render the page, '', a vector 
graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR 
on the extracted inline images, OCR is never triggered because there are no 
inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM:
---

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file...how do we identify vector graphics?|See for example 
PDFBOX-2475's [^rotation.pdf].  If we render the page, '', a vector 
graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR 
on the extracted inline images, OCR is never triggered because there are no 
inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file|See for example PDFBOX-2475's 
[rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf].
  If we render the page, '', a vector graphic, is OCR'd as '$225'; however, 
if we extract inline images and run OCR on the extracted inline images, OCR is 
never triggered because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM:
---

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file|See for example PDFBOX-2475's 
[rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf].
  If we render the page, '', a vector graphic, is OCR'd as '$225'; however, 
if we extract inline images and run OCR on the extracted inline images, OCR is 
never triggered because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file|See for example [^rotation.pdf].  If we render the page, 
'', a vector graphic, is OCR'd as '$225'; however, if we extract inline 
images and run OCR on the extracted inline images, OCR is never triggered 
because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 5:08 PM:
---

There are several reasons why one might want to run OCR on a PDF page. It might 
be useful to catalog those here along with a diagnostic. I offer this as a 
first draft for discussion, and I welcome modifications.
||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file|See for example [^rotation.pdf].  If we render the page, 
'', a vector graphic, is OCR'd as '$225'; however, if we extract inline 
images and run OCR on the extracted inline images, OCR is never triggered 
because there are no inline images!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |


was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file||
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |



> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810071#comment-16810071
 ] 

Tim Allison commented on TIKA-2749:
---

Thank you, [~tilman].  Fixed.

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 4:49 PM:
---

There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|Vector graphics|With vector graphics, PDFs can draw an image...with no actual 
underlying image file||
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |




was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying 
image file|I can't remember the name for this...help!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |



> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810059#comment-16810059
 ] 

Tilman Hausherr commented on TIKA-2749:
---

You probably mean "vector graphics".

> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809992#comment-16809992
 ] 

Ashish Tiwari commented on TIKA-2847:
-

Please find below code snippet, below code snippet is used for all file types.
{code:java}
Metadata  meta = new Metadata();

meta.set(Metadata.RESOURCE_NAME_KEY, attachName);

File infile= new File(attachName);

InputStream instream = FileUtils.openInputStream(infile);

String attachString = new Tika().parseToString(instream, meta);
{code}

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2840) windows batch file not detected

2019-04-04 Thread chandra (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809949#comment-16809949
 ] 

chandra commented on TIKA-2840:
---

hi tim,

Looks like simple batch files which are starting upper case @ECHO OFF are not 
being identified, however @echo off at the beginnng of the file is being 
detected by Tika Core API, is it possible to add more commands to the magic 
file to harden the detection of batch files..

thank you

> windows batch file not detected
> ---
>
> Key: TIKA-2840
> URL: https://issues.apache.org/jira/browse/TIKA-2840
> Project: Tika
>  Issue Type: Bug
>  Components: core
>Affects Versions: 1.20
> Environment: tika core not detecting windows batch file when its 
> renamed with .txt, it results in mime type text/plain
>Reporter: chandra
>Priority: Major
> Attachments: test.txt
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809928#comment-16809928
 ] 

Tim Allison commented on TIKA-2847:
---

How are you loading the PDF?  Can you attach it/share it?  You may be able to 
twiddle with the way the file is loaded.

In practice, you'll probably need to bump 512m to 1g...your mileage will vary.

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar

2019-04-04 Thread Ashish Tiwari (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809904#comment-16809904
 ] 

Ashish Tiwari commented on TIKA-2847:
-

Thanks Tim i will check by setting SAX docx parser, but what in case of pdf's ? 
if i convert same docx to pdf i get OOM as well.

> OutOfMemoryError - tika1.19.1.jar
> -
>
> Key: TIKA-2847
> URL: https://issues.apache.org/jira/browse/TIKA-2847
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19.1
>Reporter: Ashish Tiwari
>Priority: Major
> Attachments: testCmplData.docx
>
>
> I am trying to parse a docx file and getting below error. Same issue happens 
> if i convert attached docx file to a pdf. 
> Attached pdf file is of 3.7 mb, however i doubt it is related to size of the 
> file, as i am able to parse a file above 30mb without any issues.
> PS : This issue only happens if we have JVM configured to -Xmx512m if i 
> change value to 1024m it starts working fine.
>  
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842)
> at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
> at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
> at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
> at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
> at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
> at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
> at 
> org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
> at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown
>  Source)
> at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178)
> at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184)
> at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138)
> at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60)
> at 
> org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.Tika.parseToString(Tika.java:527)
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison edited comment on TIKA-2749 at 4/4/19 1:44 PM:
---

There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying 
image file|I can't remember the name for this...help!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves or if you build a custom model, it might be 
useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |




was (Author: talli...@mitre.org):
There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying 
image file|I can't remember the name for this...help!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves, it might be useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |



> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box

2019-04-04 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855
 ] 

Tim Allison commented on TIKA-2749:
---

There are several reasons why one might want to run OCR on a PDF page.  It 
might be useful to catalog those here along with a diagnostic.  I offer this as 
a first draft for discussion, and I welcome modifications.

||Issue||Diagnostic||Notes||
|Image only PDF|zero or only a few characters are extracted; inline images 
cover x% of the page|might be a non-text containing picture or might be an 
image of text...who knows?|
|"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying 
image file|I can't remember the name for this...help!|
|Scanned PDF|inline images cover x% of the page; text is extracted but it might 
be garbled (depending on quality of original scan);what are other signs of a 
scanned PDF???|As OCR improves, it might be useful to run OCR again on the PDF|
|Missing unicode mappings|TIKA-2846's statistics|anything over 10%???|
|Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else 
can we automatically identify this?| |



> OCR on PDFs should "just work" out of the box
> -
>
> Key: TIKA-2749
> URL: https://issues.apache.org/jira/browse/TIKA-2749
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> There are now two different ways (with various parameters) to trigger OCR on 
> inline images within PDFs.  The user has to 1) understand that these are 
> available and then 2) elect to turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid 
> strategy between the 2 options.  Users should still be allowed to configure 
> as they wish, of course. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)