[jira] [Closed] (TIKA-2347) Underlined text is not decorated as such when extracting from word documents
[ https://issues.apache.org/jira/browse/TIKA-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov closed TIKA-2347. --- > Underlined text is not decorated as such when extracting from word documents > > > Key: TIKA-2347 > URL: https://issues.apache.org/jira/browse/TIKA-2347 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.0, 1.14 >Reporter: Stuart Hendren >Assignee: Dave Meikle >Priority: Major > Fix For: 1.17 > > > When extracting from doc and docx bold and italic text decoration is > extracted, however underlining is not. Can be demonstrated in WordParserTest > or OOXMLParserTest (change to docx) with the following test case. > {code:title=WordParserTest.java|borderStyle=solid} > @Test > public void testTextDecoration() throws Exception { > XMLResult result = getXML("testWORD_various.doc"); > String xml = result.xml; > assertTrue(xml.contains("Bold")); > assertTrue(xml.contains("italic")); > assertTrue(xml.contains("underline")); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (TIKA-2601) Invalid XHTML output for some WORD documents
[ https://issues.apache.org/jira/browse/TIKA-2601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov resolved TIKA-2601. - Resolution: Duplicate I mark it as duplicate for TIKA-2555 which I'm currently looking into > Invalid XHTML output for some WORD documents > > > Key: TIKA-2601 > URL: https://issues.apache.org/jira/browse/TIKA-2601 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.17 > Environment: Linked is a sample document with its corresponding > output. >Reporter: Filip >Priority: Major > Attachments: Invalid-XML.doc, Test.doc, test.html > > > In some WORD (.doc, .docx) documents the XHTML elements are not closed > properly. This usually happens when there are link elements () as well as > italic or bold elements (). > > Fix should be done in > [https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810288#comment-16810288 ] Tim Allison commented on TIKA-2847: --- Last hope: {noformat} PDFParserConfig pdfParserConfig = new PDFParserConfig(); pdfParserConfig.setMaxMainMemoryBytes(50); ParseContext parseContext = new ParseContext(); parseContext.set(PDFParserConfig.class, pdfParserConfig); Parser p = new AutoDetectParser(); ... p.parse(inputstream, contentHandler, metadata, parseContext); {noformat} > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810256#comment-16810256 ] Tim Allison commented on TIKA-2749: --- [~rossj], this is very helpful...any recs on how to detect "not a normal scan"? I have run into individual pages that contain (1000s?) of images stitched together that clearly require rendering to be useful... Aside from a high number of images, how do we identify flipped and/or overlapping scans? Thank you! > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810172#comment-16810172 ] Ross Johnson commented on TIKA-2749: OCRing the inlined images directly can be tricky, in my experience. Here are a couple classes of problematic scenarios that I've come across: - Sometimes images have a funky transform applied, e.g. the actual page scan is mirrored but flipped to look right in the PDF. - Some fancy PDF generators / scanners use multiple overlapping images, perhaps utilizing image masking, to reduce file size. E.g. there may be a background image with the color components along with a foreground grayscale or 1-bit-per-pixel image of the black content. I've also seen the foreground text split up into multiple images overlaying different parts of the background image. In either situation, you could probably detect the "not a normal scan" condition and kick it into "render page" mode. > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810165#comment-16810165 ] Ashish Tiwari commented on TIKA-2847: - yes TikaInputStream.get(infile) gave me same error. > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (TIKA-2555) Text with [underline] + [another format] in word document generates overlapping html tags.
[ https://issues.apache.org/jira/browse/TIKA-2555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Gribov reassigned TIKA-2555: --- Assignee: Konstantin Gribov > Text with [underline] + [another format] in word document generates > overlapping html tags. > -- > > Key: TIKA-2555 > URL: https://issues.apache.org/jira/browse/TIKA-2555 > Project: Tika > Issue Type: Bug >Affects Versions: 1.17 >Reporter: Serban Alexe >Assignee: Konstantin Gribov >Priority: Minor > Attachments: Clipboard02.jpg > > > I have a sample _.docx_ document which contains one single line of text**++. > Making that text to be: > * +underlined+ > ** AND at least one of the following two > * _italic_ > * *bold* > will cause the generated _.xhtml_ file to contain overlapping tags. > > _+Example+_: > *+The quick brown fox jumps over the lazy dog.+* > will result in > The quick brown fox jumps over the lazy dog. > which causes some browser (Firefox, Chrome) to give an error and not display > the content of the file... > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:47 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file...On how to extract lines: [http://stackoverflow.com/a/38933039/535646]|See for example PDFBOX-4275's [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file...how do we identify vector graphics?|See for example PDFBOX-4275's [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810150#comment-16810150 ] Tim Allison commented on TIKA-2847: --- sorry. I meant {{TikaInputStream.get(infile)}}... > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810146#comment-16810146 ] Ashish Tiwari commented on TIKA-2847: - Thanks Tim setting "setUseSAXDocxExtractor" to true worked for the docx. In case of PDF file, TikaInputStream.open API is not present, i did find and tried couple of get API's but same issue. PS : i am using tika1.19.1.jar > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810131#comment-16810131 ] Tim Allison commented on TIKA-2847: --- Try opening the InputStream with {{TikaInputStream.open(infile)}}...that triggers PDFBox to load from the underlying file instead of the InputStream. > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:13 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file...how do we identify vector graphics?|See for example PDFBOX-4275's [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file...how do we identify vector graphics?|See for example PDFBOX-2475's [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file...how do we identify vector graphics?|See for example PDFBOX-2475's [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example PDFBOX-2475's [rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:12 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example PDFBOX-2475's [rotation.pdf|https://issues.apache.org/jira/secure/attachment/12933778/rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 5:08 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|See for example [^rotation.pdf]. If we render the page, '', a vector graphic, is OCR'd as '$225'; however, if we extract inline images and run OCR on the extracted inline images, OCR is never triggered because there are no inline images!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810071#comment-16810071 ] Tim Allison commented on TIKA-2749: --- Thank you, [~tilman]. Fixed. > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 4:49 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |Vector graphics|With vector graphics, PDFs can draw an image...with no actual underlying image file|| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying image file|I can't remember the name for this...help!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810059#comment-16810059 ] Tilman Hausherr commented on TIKA-2749: --- You probably mean "vector graphics". > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809992#comment-16809992 ] Ashish Tiwari commented on TIKA-2847: - Please find below code snippet, below code snippet is used for all file types. {code:java} Metadata meta = new Metadata(); meta.set(Metadata.RESOURCE_NAME_KEY, attachName); File infile= new File(attachName); InputStream instream = FileUtils.openInputStream(infile); String attachString = new Tika().parseToString(instream, meta); {code} > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2840) windows batch file not detected
[ https://issues.apache.org/jira/browse/TIKA-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809949#comment-16809949 ] chandra commented on TIKA-2840: --- hi tim, Looks like simple batch files which are starting upper case @ECHO OFF are not being identified, however @echo off at the beginnng of the file is being detected by Tika Core API, is it possible to add more commands to the magic file to harden the detection of batch files.. thank you > windows batch file not detected > --- > > Key: TIKA-2840 > URL: https://issues.apache.org/jira/browse/TIKA-2840 > Project: Tika > Issue Type: Bug > Components: core >Affects Versions: 1.20 > Environment: tika core not detecting windows batch file when its > renamed with .txt, it results in mime type text/plain >Reporter: chandra >Priority: Major > Attachments: test.txt > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809928#comment-16809928 ] Tim Allison commented on TIKA-2847: --- How are you loading the PDF? Can you attach it/share it? You may be able to twiddle with the way the file is loaded. In practice, you'll probably need to bump 512m to 1g...your mileage will vary. > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2847) OutOfMemoryError - tika1.19.1.jar
[ https://issues.apache.org/jira/browse/TIKA-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809904#comment-16809904 ] Ashish Tiwari commented on TIKA-2847: - Thanks Tim i will check by setting SAX docx parser, but what in case of pdf's ? if i convert same docx to pdf i get OOM as well. > OutOfMemoryError - tika1.19.1.jar > - > > Key: TIKA-2847 > URL: https://issues.apache.org/jira/browse/TIKA-2847 > Project: Tika > Issue Type: Bug >Affects Versions: 1.19.1 >Reporter: Ashish Tiwari >Priority: Major > Attachments: testCmplData.docx > > > I am trying to parse a docx file and getting below error. Same issue happens > if i convert attached docx file to a pdf. > Attached pdf file is of 3.7 mb, however i doubt it is related to size of the > file, as i am able to parse a file above 30mb without any issues. > PS : This issue only happens if we have JVM configured to -Xmx512m if i > change value to 1024m it starts working fine. > > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) > at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3414) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272) > at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259) > at > org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345) > at > org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown > Source) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:178) > at org.apache.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:184) > at org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:138) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:60) > at > org.apache.poi.ooxml.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:228) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:116) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at org.apache.tika.Tika.parseToString(Tika.java:527) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison edited comment on TIKA-2749 at 4/4/19 1:44 PM: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying image file|I can't remember the name for this...help!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves or if you build a custom model, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | was (Author: talli...@mitre.org): There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying image file|I can't remember the name for this...help!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
[ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809855#comment-16809855 ] Tim Allison commented on TIKA-2749: --- There are several reasons why one might want to run OCR on a PDF page. It might be useful to catalog those here along with a diagnostic. I offer this as a first draft for discussion, and I welcome modifications. ||Issue||Diagnostic||Notes|| |Image only PDF|zero or only a few characters are extracted; inline images cover x% of the page|might be a non-text containing picture or might be an image of text...who knows?| |"Drawn" image|I've seen PDFs that "draw" an image...with no actual underlying image file|I can't remember the name for this...help!| |Scanned PDF|inline images cover x% of the page; text is extracted but it might be garbled (depending on quality of original scan);what are other signs of a scanned PDF???|As OCR improves, it might be useful to run OCR again on the PDF| |Missing unicode mappings|TIKA-2846's statistics|anything over 10%???| |Incorrect unicode/character mappings|out of vocabulary (OOV) stats?; how else can we automatically identify this?| | > OCR on PDFs should "just work" out of the box > - > > Key: TIKA-2749 > URL: https://issues.apache.org/jira/browse/TIKA-2749 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > > There are now two different ways (with various parameters) to trigger OCR on > inline images within PDFs. The user has to 1) understand that these are > available and then 2) elect to turn one of those on. > I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid > strategy between the 2 options. Users should still be allowed to configure > as they wish, of course. -- This message was sent by Atlassian JIRA (v7.6.3#76005)