tika-trunk-jdk1.7 - Build # 202 - Still Failing
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #202) Status: Still Failing Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/202/ to view the results.
[jira] [Updated] (TIKA-1405) German content detected as French
[ https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaheer Beig updated TIKA-1405: -- Description: Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issue with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. Any update on this will be very helpful. was: Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issues with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. 2. For many of our German content , language gets detected as French [Though this is not the case for all German content] Any update on this will be very helpful. German content detected as French - Key: TIKA-1405 URL: https://issues.apache.org/jira/browse/TIKA-1405 Project: Tika Issue Type: Bug Components: languageidentifier Affects Versions: 1.4 Environment: Linux Reporter: Zaheer Beig Labels: newbie Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issue with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. Any update on this will be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1284) TikaException for Microsoft Powerpoint Document [ ppt ]
[ https://issues.apache.org/jira/browse/TIKA-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1284. -- Resolution: Fixed Fix Version/s: 1.6 Marking as fixed based on Felix's comments TikaException for Microsoft Powerpoint Document [ ppt ] Key: TIKA-1284 URL: https://issues.apache.org/jira/browse/TIKA-1284 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1, 1.3, 1.5 Reporter: Chetan Laddha Fix For: 1.6 Attachments: Problem2.ppt, problem1.ppt Attach PPT file is not getting extracted. Giving exception as Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2d536558 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5000 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5002 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 4012 on class class org.apache.poi.hslf.record.StyleTextProp9Atom : java.lang.reflect.InvocationTargetException Cause was : java.lang.ArrayIndexOutOfBoundsException: 20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1189) Fails to parse PPT file
[ https://issues.apache.org/jira/browse/TIKA-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1189. -- Resolution: Fixed Fix Version/s: 1.6 Marking as fixed based on Felix's comments Fails to parse PPT file --- Key: TIKA-1189 URL: https://issues.apache.org/jira/browse/TIKA-1189 Project: Tika Issue Type: Bug Components: cli, gui Environment: OSX 10.9, OSX 10.6 Reporter: Aimee Dev Fix For: 1.6 Attachments: CDT_Data_Retention-PPT.ppt Out of the box tika application when presented with the file results in Apache Tika was unable to parse the document at /Volumes/FREECOM_HDD/Test/CDT_Data_Retention-PPT.ppt. The full exception stack trace is included below: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@224f9db at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) at javax.swing.TransferHandler.importData(TransferHandler.java:826) at javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1536) at java.awt.dnd.DropTarget.drop(DropTarget.java:450) at javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1274) at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:537) at sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:127) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:851) at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:775) at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48) at java.awt.Component.dispatchEventImpl(Component.java:4716) at java.awt.Container.dispatchEventImpl(Container.java:2287) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) at java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4566) at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4417) at java.awt.Container.dispatchEventImpl(Container.java:2273) at java.awt.Window.dispatchEventImpl(Window.java:2719) at java.awt.Component.dispatchEvent(Component.java:4687) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) at java.awt.EventQueue.access$200(EventQueue.java:103) at java.awt.EventQueue$3.run(EventQueue.java:694) at java.awt.EventQueue$3.run(EventQueue.java:692) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) at java.awt.EventQueue$4.run(EventQueue.java:708) at java.awt.EventQueue$4.run(EventQueue.java:706) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138) at java.awt.EventDispatchThread.run(EventDispatchThread.java:91) Caused by: java.lang.RuntimeException: Couldn't instantiate the class for type with id 5000 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5002 on class class org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : java.lang.reflect.InvocationTargetException Cause was : java.lang.RuntimeException: Couldn't instantiate the class for type with id 5003 on class
[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1413. Resolution: Fixed OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126949#comment-14126949 ] Hong-Thai Nguyen commented on TIKA-1413: I agree. Fixed in r1623819 and _id_ is now from partName(). OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
buildbot success in ASF Buildbot on tika-trunk
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/190 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: lares_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1623819 Blamelist: thaichat04 Build succeeded! sincerely, -The Buildbot
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126994#comment-14126994 ] Hudson commented on TIKA-1413: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #203 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/203/]) TIKA-1413 - Remove embedded thumbnail from body (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1623819) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127015#comment-14127015 ] Hudson commented on TIKA-1413: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #181 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/181/]) TIKA-1413 - Remove embedded thumbnail from body (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1623819) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java OOXML thumbnail name added to body -- Key: TIKA-1413 URL: https://issues.apache.org/jira/browse/TIKA-1413 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki AbstractOOXMLExtractor.handleThumbnail processes thumbnails using EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike other embedded parts in handleEmbeddedParts(...)). This results in adding the thumbnail name to the main body of the document (as a package-entry), which in my opinion is wrong. Example: {code} ?xml version=1.0 encoding=UTF-8?html xmlns=http://www.w3.org/1999/xhtml; head meta name=meta:slide-count content=1/ meta name=cp:revision content=5/ meta name=meta:last-author content=Nick Burch/ meta name=Slide-Count content=1/ meta name=Last-Author content=Nick Burch/ meta name=meta:save-date content=2010-09-08T16:15:14Z/ meta name=Content-Length content=202969/ meta name=subject content=Gym class featuring a brown fox and lazy dog/ meta name=Application-Name content=Microsoft Office PowerPoint/ meta name=Author content=Nevin Nollop/ meta name=dcterms:created content=1601-01-01T00:00:00Z/ meta name=Application-Version content=12./ meta name=date content=2010-09-08T16:15:14Z/ meta name=Total-Time content=2/ meta name=extended-properties:Template content=/ meta name=publisher content=/ meta name=creator content=Nevin Nollop/ meta name=Word-Count content=9/ meta name=meta:paragraph-count content=1/ meta name=extended-properties:AppVersion content=12./ meta name=Creation-Date content=1601-01-01T00:00:00Z/ meta name=meta:author content=Nevin Nollop/ meta name=cp:subject content=Gym class featuring a brown fox and lazy dog/ meta name=extended-properties:Application content=Microsoft Office PowerPoint/ meta name=resourceName content=testPPT_embeded.pptx/ meta name=Paragraph-Count content=1/ meta name=dc:title content=The quick brown fox jumps over the lazy dog/ meta name=Last-Save-Date content=2010-09-08T16:15:14Z/ meta name=custom:Version content=1/ meta name=Revision-Number content=5/ meta name=Last-Printed content=1601-01-01T00:00:00Z/ meta name=meta:print-date content=1601-01-01T00:00:00Z/ meta name=meta:creation-date content=1601-01-01T00:00:00Z/ meta name=dcterms:modified content=2010-09-08T16:15:14Z/ meta name=Template content=/ meta name=dc:creator content=Nevin Nollop/ meta name=meta:word-count content=9/ meta name=extended-properties:Company content=/ meta name=Last-Modified content=2010-09-08T16:15:14Z/ meta name=extended-properties:PresentationFormat content=On-screen Show (4:3)/ meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/ meta name=X-Parsed-By content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/ meta name=modified content=2010-09-08T16:15:14Z/ meta name=xmpTPg:NPages content=1/ meta name=extended-properties:TotalTime content=2/ meta name=dc:publisher content=/ meta name=Content-Type content=application/vnd.openxmlformats-officedocument.presentationml.presentation/ meta name=Presentation-Format content=On-screen Show (4:3)/ titleThe quick brown fox jumps over the lazy dog/title /head bodypThe quick brown fox jumps over the lazy dog/p div class=embedded id=slide1_rId4/ div class=embedded id=slide1_rId5/ div class=embedded id=slide1_rId6/ div class=embedded id=slide1_rId7/ div class=embedded id=slide1_rId8/ div class=embedded id=slide1_rId9/ div class=embedded id=thumbnail_0.jpeg/div class=package-entryh1thumbnail_0.jpeg/h1/div/body/html {code} The extracted plain text looks like this (using tika-app): {code} The quick brown fox jumps over the lazy dog thumbnail_0.jpeg {code} The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. I think also that the id attribute should be set to the real thumbnail path within the package (i.e. tPart.getPartName().getName()) instead of the artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1268) Extract images from PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127173#comment-14127173 ] Lewis John McGibbney commented on TIKA-1268: Was there ever a patch for this issue I wonder? It would have been great to see what it looked like. Extract images from PDF documents - Key: TIKA-1268 URL: https://issues.apache.org/jira/browse/TIKA-1268 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Jukka Zitting Fix For: 1.6 It would be nice if images within PDF documents could be extracted much like embedded attachments are now being handled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)