[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Attachment: Jinwoo_032910.pptx > TaggedIOException on a valid Powerpoint file > > > Key: TIKA-2153 > URL: https://issues.apache.org/jira/browse/TIKA-2153 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Jinwoo_032910.pptx, tika_2153_unzipping.png > > > On the following Powerpoint file, which opens fine with Powerpoint: > https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx > the Tika parses throws the following error: > org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) > at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) > at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) > ... 13 more > Caused by: java.util.zip.ZipException: invalid stored block lengths > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at > org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 19 more > Could be similar to #2130. > EDIT: similar exception on the attached Jinwoo_032910.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2153) TaggedIOException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2153: - Description: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:78) ... 13 more Caused by: java.util.zip.ZipException: invalid stored block lengths at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:213) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 19 more Could be similar to #2130. EDIT: similar exception on the attached Jinwoo_032910.pptx was: On the following Powerpoint file, which opens fine with Powerpoint: https://dl.dropboxusercontent.com/u/92341073/Data%20Club%202%20March%2028.pptx the Tika parses throws the following error: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:82) at org.apache.tika.mime.MimeTypes.readMagicHeader(MimeTypes.java:258) at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:471) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedFile(AbstractOOXMLExtractor.java:298) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:199) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) Caused by: org.apache.tika.io.TaggedIOException: invalid stored block lengths at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at
[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
[ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687795#comment-15687795 ] Ashish Basran commented on TIKA-2180: - I am calling Tika (http://localhost:8080/tika) using HttpClient (.NET) from the same tika-server box. I used Task.Run to create requests for all 22 documents. It has 4 CPUs. > Multiple requests on Tika to extract text slows down > > > Key: TIKA-2180 > URL: https://issues.apache.org/jira/browse/TIKA-2180 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.13, 1.14 > Environment: Windows OS, Open JDK, 4 core 32 GB RAM >Reporter: Ashish Basran > > I observed that if I send multiple requests to Tika (eg. > http://localhost:8080/tika) with around 5MB files, Tika is very slow in > completing the action. I tried with ~20 random files, it took 170 seconds to > process all the files in sequence. If I pass all files in parallel, it took > around 780 seconds to process same set of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
[ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687776#comment-15687776 ] Tim Allison commented on TIKA-2180: --- Thank you for this. That isn't by design... that I'm aware of. How many threads are you running and how many cpus are on the tika-server box? > Multiple requests on Tika to extract text slows down > > > Key: TIKA-2180 > URL: https://issues.apache.org/jira/browse/TIKA-2180 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.13, 1.14 > Environment: Windows OS, Open JDK, 4 core 32 GB RAM >Reporter: Ashish Basran > > I observed that if I send multiple requests to Tika (eg. > http://localhost:8080/tika) with around 5MB files, Tika is very slow in > completing the action. I tried with ~20 random files, it took 170 seconds to > process all the files in sequence. If I pass all files in parallel, it took > around 780 seconds to process same set of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
[ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687660#comment-15687660 ] Ashish Basran commented on TIKA-2180: - I tested with Word document and Excel. I observed this in 1.13 too. Passed 22 document to Tika server for processing. 2, 5 MB documents and rest less than 1 MB documents. Following are the processing time in seconds (totals at the end) while processing documents in parallel and one after other is done. I am not sure if this behavior is by design but difference in processing time is huge. SequenceParallel 77.4790976 22.6876726 0.9335904 17.9678267 0.8854624 26.0525849 5.0577852 15.5999804 0.8060567 26.6077107 0.7831427 17.7433509 0.8196296 26.7486071 0.7667276 26.7675274 0.7648827 26.8234494 0.7632169 22.8773994 0.8247712 16.9681799 0.9260035 26.9742814 79.6387803 21.0023846 0.7795755 14.0186599 0.7646085 27.0261048 0.8339278 26.0542291 0.8345049 15.0697296 0.8402716 24.0850932 0.7785933 20.1221993 0.9135003 13.1501129 0.9229104 170.2784636 0.8859913 178.3212539 178.0030304 782.9468017 > Multiple requests on Tika to extract text slows down > > > Key: TIKA-2180 > URL: https://issues.apache.org/jira/browse/TIKA-2180 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.13, 1.14 > Environment: Windows OS, Open JDK, 4 core 32 GB RAM >Reporter: Ashish Basran > > I observed that if I send multiple requests to Tika (eg. > http://localhost:8080/tika) with around 5MB files, Tika is very slow in > completing the action. I tried with ~20 random files, it took 170 seconds to > process all the files in sequence. If I pass all files in parallel, it took > around 780 seconds to process same set of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2161: - Description: On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at java.nio.file.Files.copy(Files.java:2908) at java.nio.file.Files.copy(Files.java:3027) at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) Caused by: java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) at org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) ... 22 more EDIT: Tika 1.14 throws EOFException was: On the attached Powerpoint file, which opens fine with Powerpoint, the Tika parser throws the following error: org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream at org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.FilterInputStream.read(FilterInputStream.java:107) at java.nio.file.Files.copy(Files.java:2908) at java.nio.file.Files.copy(Files.java:3027) at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) at org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) at org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) at org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) at org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) at
[jira] [Updated] (TIKA-2161) EOFException on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seva Alekseyev updated TIKA-2161: - Summary: EOFException on a valid Powerpoint file (was: TaggedIOException from EOFException on a valid Powerpoint file) > EOFException on a valid Powerpoint file > --- > > Key: TIKA-2161 > URL: https://issues.apache.org/jira/browse/TIKA-2161 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Attachments: Erik-LymeChipBranchSeminar.ppt > > > On the attached Powerpoint file, which opens fine with Powerpoint, the Tika > parser throws the following error: > org.apache.tika.io.TaggedIOException: Unexpected end of ZLIB input stream > at > org.apache.tika.io.TaggedInputStream.handleIOException(TaggedInputStream.java:133) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:103) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at java.nio.file.Files.copy(Files.java:2908) > at java.nio.file.Files.copy(Files.java:3027) > at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:587) > at org.apache.tika.io.TikaInputStream.getFile(TikaInputStream.java:615) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:377) > at > org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:443) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112) > at > org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72) > at > org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:140) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:116) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedResources(HSLFExtractor.java:368) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:138) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > Caused by: java.io.EOFException: Unexpected end of ZLIB input stream > at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) > at > org.apache.poi.util.BoundedInputStream.read(BoundedInputStream.java:121) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) > ... 22 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2182) Investigate rare IllegalArgumentException in macro extraction
[ https://issues.apache.org/jira/browse/TIKA-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2182: -- Description: poi bug [60279|https://bz.apache.org/bugzilla/show_bug.cgi?id=60279] (was: poi bug 60279) > Investigate rare IllegalArgumentException in macro extraction > - > > Key: TIKA-2182 > URL: https://issues.apache.org/jira/browse/TIKA-2182 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > > poi bug [60279|https://bz.apache.org/bugzilla/show_bug.cgi?id=60279] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2118) Misleading exception on a password protected XLS
[ https://issues.apache.org/jira/browse/TIKA-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2118. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > Misleading exception on a password protected XLS > > > Key: TIKA-2118 > URL: https://issues.apache.org/jira/browse/TIKA-2118 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: BUSJDRVGZF7FKDA6L4PNTNATHQCLRW4O.xls, Copy of I-LHD > 3E.xls > > > When parsing the attached password protected Excel file "Copy of I-LHD > 3E.xls", Tika emits an IllegalArgumentException with a message "Unsupported > codepage requested". The inability to parse has nothing to do with codepage, > that error is misleading. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available
[ https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687415#comment-15687415 ] Tim Allison commented on TIKA-2104: --- moved poi bug 60279 to separate issue: TIKA-2182 > Upgrade to a version of POI that fixes common bugs in macro extraction, when > available > -- > > Key: TIKA-2104 > URL: https://issues.apache.org/jira/browse/TIKA-2104 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, > newExceptionsInBDetails.xlsx > > > On TIKA-2069, we found two bugs in POI that prevented the extraction of > macros from MSOffice files. Let's use this issue to track fixes in POI. > Current known bugs are POI: > -60162- duplicate of -59302- > -60158- > -59830- > -59858- > -60273- > After we release Tika 1.14, let's remove the catch blocks in Tika and rerun > against our regression corpus to help identify the most common bugs and find > new ones. > As always, patches are welcome on POI! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available
[ https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2104: -- Description: On TIKA-2069, we found two bugs in POI that prevented the extraction of macros from MSOffice files. Let's use this issue to track fixes in POI. Current known bugs are POI: -60162- duplicate of -59302- -60158- -59830- -59858- -60273- After we release Tika 1.14, let's remove the catch blocks in Tika and rerun against our regression corpus to help identify the most common bugs and find new ones. As always, patches are welcome on POI! was: On TIKA-2069, we found two bugs in POI that prevented the extraction of macros from MSOffice files. Let's use this issue to track fixes in POI. Current known bugs are POI: -60162- duplicate of -59302- -60158- -59830- -59858- -60273- 60279 After we release Tika 1.14, let's remove the catch blocks in Tika and rerun against our regression corpus to help identify the most common bugs and find new ones. As always, patches are welcome on POI! > Upgrade to a version of POI that fixes common bugs in macro extraction, when > available > -- > > Key: TIKA-2104 > URL: https://issues.apache.org/jira/browse/TIKA-2104 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, > newExceptionsInBDetails.xlsx > > > On TIKA-2069, we found two bugs in POI that prevented the extraction of > macros from MSOffice files. Let's use this issue to track fixes in POI. > Current known bugs are POI: > -60162- duplicate of -59302- > -60158- > -59830- > -59858- > -60273- > After we release Tika 1.14, let's remove the catch blocks in Tika and rerun > against our regression corpus to help identify the most common bugs and find > new ones. > As always, patches are welcome on POI! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-2182) Investigate rare IllegalArgumentException in macro extraction
Tim Allison created TIKA-2182: - Summary: Investigate rare IllegalArgumentException in macro extraction Key: TIKA-2182 URL: https://issues.apache.org/jira/browse/TIKA-2182 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial poi bug 60279 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2104) Upgrade to a version of POI that fixes common bugs in macro extraction, when available
[ https://issues.apache.org/jira/browse/TIKA-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2104. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > Upgrade to a version of POI that fixes common bugs in macro extraction, when > available > -- > > Key: TIKA-2104 > URL: https://issues.apache.org/jira/browse/TIKA-2104 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison > Fix For: 2.0, 1.15 > > Attachments: newExceptionsInBByMimeTypeByStackTrace.xlsx, > newExceptionsInBDetails.xlsx > > > On TIKA-2069, we found two bugs in POI that prevented the extraction of > macros from MSOffice files. Let's use this issue to track fixes in POI. > Current known bugs are POI: > -60162- duplicate of -59302- > -60158- > -59830- > -59858- > -60273- > After we release Tika 1.14, let's remove the catch blocks in Tika and rerun > against our regression corpus to help identify the most common bugs and find > new ones. > As always, patches are welcome on POI! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2158) NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2158. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > NullPointerException on a valid Word file > - > > Key: TIKA-2158 > URL: https://issues.apache.org/jira/browse/TIKA-2158 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: RTOP_Template01112015063856.docx > > > On the attached Word file, which opens fine with Word, the Tika parser throws > the following error: > java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.XWPFSDTContentCell.(XWPFSDTContentCell.java:49) > at org.apache.poi.xwpf.usermodel.XWPFSDTCell.(XWPFSDTCell.java:35) > at > org.apache.poi.xwpf.usermodel.XWPFTableRow.getTableICells(XWPFTableRow.java:147) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractTable(XWPFWordExtractorDecorator.java:359) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:111) > at > org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2160) POIXMLException from NullPointerException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2160. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > POIXMLException from NullPointerException on a valid Word file > -- > > Key: TIKA-2160 > URL: https://issues.apache.org/jira/browse/TIKA-2160 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: test_16022016081053.docx > > > On the attached word file, which opens fine with Word (albeit with no text), > the Tika parser throws the following error: > org.apache.poi.POIXMLException: java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:130) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:208) > at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:160) > at > org.apache.poi.xwpf.usermodel.XWPFDocument.(XWPFDocument.java:124) > at > org.apache.poi.xwpf.extractor.XWPFWordExtractor.(XWPFWordExtractor.java:58) > at > org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:237) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: java.lang.NullPointerException > at > org.apache.poi.xwpf.usermodel.AbstractXWPFSDT.(AbstractXWPFSDT.java:37) > at org.apache.poi.xwpf.usermodel.XWPFSDT.(XWPFSDT.java:38) > at > org.apache.poi.xwpf.usermodel.XWPFFooter.onDocumentRead(XWPFFooter.java:124) > ... 9 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2142) ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/TIKA-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2142. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > ArrayIndexOutOfBoundsException > -- > > Key: TIKA-2142 > URL: https://issues.apache.org/jira/browse/TIKA-2142 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: HPV8dHinge Confocal Results.ppt > > > On the attached PowerPoint presentation, which opens fine with PowerPoint, > the Tika parser throws the following error: > java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.readPictures(HSLFSlideShowImpl.java:438) > at > org.apache.poi.hslf.usermodel.HSLFSlideShowImpl.getPictureData(HSLFSlideShowImpl.java:772) > at > org.apache.poi.hslf.usermodel.HSLFSlideShow.getPictureData(HSLFSlideShow.java:547) > at > org.apache.tika.parser.microsoft.HSLFExtractor.handleSlideEmbeddedPictures(HSLFExtractor.java:305) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:193) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2145) InvalidFormatException on a valid Word file
[ https://issues.apache.org/jira/browse/TIKA-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2145. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > InvalidFormatException on a valid Word file > --- > > Key: TIKA-2145 > URL: https://issues.apache.org/jira/browse/TIKA-2145 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: safety_analysis_report_FINAL2.docx > > > On the attached Word file, which opens fine with Word, the Tika parser throws > the following exception: > org.apache.tika.exception.TikaException: Error creating OOXML extractor > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > Caused by: java.lang.IllegalArgumentException: Date for created could not be > parsed: 2015-07-27 > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:408) > at > org.apache.poi.openxml4j.opc.internal.unmarshallers.PackagePropertiesUnmarshaller.unmarshall(PackagePropertiesUnmarshaller.java:124) > at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:743) > at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:230) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:69) > ... 3 more > Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Date > 2015-07-27 not well formatted, expected format in: -MM-dd'T'HH:mm:ssz, > -MM-dd'T'HH:mm:ss.SSSz, -MM-dd'T'HH:mm:ss'Z', > -MM-dd'T'HH:mm:ss.SS'Z' > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setDateValue(PackagePropertiesPart.java:615) > at > org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart.setCreatedProperty(PackagePropertiesPart.java:406) > ... 7 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2132) NullPointerException on a valid Excel file
[ https://issues.apache.org/jira/browse/TIKA-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2132. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > NullPointerException on a valid Excel file > -- > > Key: TIKA-2132 > URL: https://issues.apache.org/jira/browse/TIKA-2132 > Project: Tika > Issue Type: Bug > Components: parser > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: 2a-Executive_Summary_50_Work.xlsm > > > The attached XLSM file, which opens fine in Excel, causes the following error > in the Tika parser: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.ooxml.OOXMLParser@a5bd950 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at gov.nih.niaid.fscanner.Extract.ExtractContents(Extract.java:62) > at gov.nih.niaid.temp.Main.main(Main.java:60) > Caused by: java.lang.NullPointerException > at > org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.next(XSSFReader.java:254) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:124) > at > org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:109) > at > org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) > at > org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2125) XmlValueOutOfRangeException on a good Word document
[ https://issues.apache.org/jira/browse/TIKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2125. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > XmlValueOutOfRangeException on a good Word document > --- > > Key: TIKA-2125 > URL: https://issues.apache.org/jira/browse/TIKA-2125 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: LMVR Mentoring Activities brm.docx > > > On the attached Word document, which opens fine with Word, the Tika parser > throws a TikaException caused by > org.apache.xmlbeans.impl.values.XmlValueOutOfRangeException with message > "string value 'odd' is not a valid enumeration value for ST_HdrFtr in > namespace http://schemas.openxmlformats.org/wordprocessingml/2006/main; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2129) IllegalArgumentException/"Unknown shape type" on a valid Powerpoint file
[ https://issues.apache.org/jira/browse/TIKA-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2129. --- Resolution: Fixed Fix Version/s: 1.15 2.0 Thank you, [~kiwiwings]! > IllegalArgumentException/"Unknown shape type" on a valid Powerpoint file > > > Key: TIKA-2129 > URL: https://issues.apache.org/jira/browse/TIKA-2129 > Project: Tika > Issue Type: Bug > Components: parser > Environment: Windows 7 x64, JVM 1.8.0_101 >Reporter: Seva Alekseyev > Fix For: 2.0, 1.15 > > Attachments: 10.1056-NEJMra020100Figure01.ppt > > > The attached valid Powerpoint file, when parsed with Tika, throws the > following error: > java.lang.IllegalArgumentException: Unknown shape type: 4095 > at org.apache.poi.sl.usermodel.ShapeType.forId(ShapeType.java:314) > at > org.apache.poi.hslf.usermodel.HSLFShapeFactory.createSimpleShape(HSLFShapeFactory.java:98) > at > org.apache.poi.hslf.usermodel.HSLFShapeFactory.createShape(HSLFShapeFactory.java:62) > at org.apache.poi.hslf.usermodel.HSLFSheet.getShapes(HSLFSheet.java:173) > at > org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:93) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2115) OOM caused by corrupt embedded OLE object
[ https://issues.apache.org/jira/browse/TIKA-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2115. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > OOM caused by corrupt embedded OLE object > - > > Key: TIKA-2115 > URL: https://issues.apache.org/jira/browse/TIKA-2115 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.13 > Environment: Generic test with tika-app-1.13.jar on test document >Reporter: Thomas Galla > Fix For: 2.0, 1.15 > > Attachments: TikaTestcase.pptx > > > There is a size field when parsing an embedded OLE object in a Powerpoint > presentation that says there are 2GB of data that needs to be read and the > code simply tries to allocate a buffer for that, which results in OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-2116) Upgrade to POI 3.16-beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-2116: -- Fix Version/s: 1.15 2.0 > Upgrade to POI 3.16-beta1 when available > > > Key: TIKA-2116 > URL: https://issues.apache.org/jira/browse/TIKA-2116 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > Fix For: 2.0, 1.15 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-2116) Upgrade to POI 3.16-beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-2116. --- Resolution: Fixed > Upgrade to POI 3.16-beta1 when available > > > Key: TIKA-2116 > URL: https://issues.apache.org/jira/browse/TIKA-2116 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1658) unable to parse microsoft visio files with tika
[ https://issues.apache.org/jira/browse/TIKA-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1658. --- Resolution: Fixed Fix Version/s: 1.15 2.0 > unable to parse microsoft visio files with tika > --- > > Key: TIKA-1658 > URL: https://issues.apache.org/jira/browse/TIKA-1658 > Project: Tika > Issue Type: Bug > Components: metadata >Affects Versions: 0.9, 1.1, 1.3, 1.4, 1.5, 1.8 > Environment: ubuntu 14.04 and windows 7 >Reporter: senthil > Fix For: 2.0, 1.15 > > Attachments: Connection Types.vsd > > > hi > With parsing an microsoft visio it throws an exception. > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@13d28e3 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > Caused by: java.lang.RuntimeException: TODO > at > org.apache.poi.hdgf.pointers.PointerFactory.createPointer(PointerFactory.java:45) > at org.apache.poi.hdgf.HDGFDiagram.(HDGFDiagram.java:99) > application/vnd.visio > at > org.apache.poi.hdgf.extractor.VisioTextExtractor.(VisioTextExtractor.java:55) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 4 more > Please help with a resolution > regards > sentil -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2116) Upgrade to POI 3.16-beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687264#comment-15687264 ] Hudson commented on TIKA-2116: -- SUCCESS: Integrated in Jenkins build tika-2.x #175 (See [https://builds.apache.org/job/tika-2.x/175/]) TIKA-2116 upgrade to POI 3.16-beta1 (tallison: rev 8c01e4d8e7b37bdcb1a1aa1bf99675dfb01d49e4) * (edit) CHANGES.txt * (edit) tika-parser-modules/pom.xml > Upgrade to POI 3.16-beta1 when available > > > Key: TIKA-2116 > URL: https://issues.apache.org/jira/browse/TIKA-2116 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2143) POI deprecated method used in TIKA 1.13
[ https://issues.apache.org/jira/browse/TIKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687243#comment-15687243 ] sbathrutheen commented on TIKA-2143: We have requested our client for opts details. will update you as soon as got the details. > POI deprecated method used in TIKA 1.13 > > > Key: TIKA-2143 > URL: https://issues.apache.org/jira/browse/TIKA-2143 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9, 1.13 > Environment: Windows java application >Reporter: sbathrutheen > Fix For: 1.13 > > > We see that TIKA throws a long list of errors when extraction ppt files. We > tested with standalone tike application (1.13) we cannot reproduce the issue. > We took a look at POI source code and abserved the class "HSLFSlideShow" we > could see the below deprecated method defined > * > /** > - * Get the lookup from slide numbers to their offsets inside > - * _ptrData, used when adding or moving slides. > - * > - * @deprecated since POI 3.11, not supported anymore > - */ > - @Deprecated > - public HashtablegetSlideOffsetDataLocationsLookup() { > - throw new > UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup() > is not supported since 3.12-Beta1"); > - } > * > we may think Tika library still calling this deprecated method causing this > run time Exception > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140) > ... 14 more > Caused by: java.lang.UnsupportedOperationException > at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source) > at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293) > at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273) > at org.apache.poi.hslf.HSLFSlideShow.(HSLFSlideShow.java:188) > at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 17 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2116) Upgrade to POI 3.16-beta1 when available
[ https://issues.apache.org/jira/browse/TIKA-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687159#comment-15687159 ] Hudson commented on TIKA-2116: -- FAILURE: Integrated in Jenkins build tika-2.x-windows #76 (See [https://builds.apache.org/job/tika-2.x-windows/76/]) TIKA-2116 upgrade to POI 3.16-beta1 (tallison: rev 8c01e4d8e7b37bdcb1a1aa1bf99675dfb01d49e4) * (edit) tika-parser-modules/pom.xml * (edit) CHANGES.txt > Upgrade to POI 3.16-beta1 when available > > > Key: TIKA-2116 > URL: https://issues.apache.org/jira/browse/TIKA-2116 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-2.x-windows - Build # 76 - Still Failing
The Apache Jenkins build system has built tika-2.x-windows (build #76) Status: Still Failing Check console output at https://builds.apache.org/job/tika-2.x-windows/76/ to view the results.
[jira] [Created] (TIKA-2181) Upgrade to POI 3.16-beta2 when available
Tim Allison created TIKA-2181: - Summary: Upgrade to POI 3.16-beta2 when available Key: TIKA-2181 URL: https://issues.apache.org/jira/browse/TIKA-2181 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2143) POI deprecated method used in TIKA 1.13
[ https://issues.apache.org/jira/browse/TIKA-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686939#comment-15686939 ] Tim Allison commented on TIKA-2143: --- Any further info on this issue, [~sbathrutheen]? > POI deprecated method used in TIKA 1.13 > > > Key: TIKA-2143 > URL: https://issues.apache.org/jira/browse/TIKA-2143 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9, 1.13 > Environment: Windows java application >Reporter: sbathrutheen > Fix For: 1.13 > > > We see that TIKA throws a long list of errors when extraction ppt files. We > tested with standalone tike application (1.13) we cannot reproduce the issue. > We took a look at POI source code and abserved the class "HSLFSlideShow" we > could see the below deprecated method defined > * > /** > - * Get the lookup from slide numbers to their offsets inside > - * _ptrData, used when adding or moving slides. > - * > - * @deprecated since POI 3.11, not supported anymore > - */ > - @Deprecated > - public HashtablegetSlideOffsetDataLocationsLookup() { > - throw new > UnsupportedOperationException("PersistPtrHolder.getSlideOffsetDataLocationsLookup() > is not supported since 3.12-Beta1"); > - } > * > we may think Tika library still calling this deprecated method causing this > run time Exception > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@204c3b78 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > com.searchtechnologies.aspire.docprocessing.extracttext.ExtractTextStage.process(ExtractTextStage.java:140) > ... 14 more > Caused by: java.lang.UnsupportedOperationException > at java.util.AbstractMap$SimpleImmutableEntry.setValue(Unknown Source) > at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:293) > at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:273) > at org.apache.poi.hslf.HSLFSlideShow.(HSLFSlideShow.java:188) > at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149) > at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) > ... 17 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
[ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15686698#comment-15686698 ] Tim Allison commented on TIKA-2180: --- Is this new in 1.14? Are the files of a particular format? I regret that I haven't done any performance tests on tika-server. > Multiple requests on Tika to extract text slows down > > > Key: TIKA-2180 > URL: https://issues.apache.org/jira/browse/TIKA-2180 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.13, 1.14 > Environment: Windows OS, Open JDK, 4 core 32 GB RAM >Reporter: Ashish Basran > > I observed that if I send multiple requests to Tika (eg. > http://localhost:8080/tika) with around 5MB files, Tika is very slow in > completing the action. I tried with ~20 random files, it took 170 seconds to > process all the files in sequence. If I pass all files in parallel, it took > around 780 seconds to process same set of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)