[jira] [Commented] (TIKA-1268) Extract images from PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127173#comment-14127173 ] Lewis John McGibbney commented on TIKA-1268: Was there ever a patch for this issue I wonder? It would have been great to see what it looked like. > Extract images from PDF documents > - > > Key: TIKA-1268 > URL: https://issues.apache.org/jira/browse/TIKA-1268 > Project: Tika > Issue Type: New Feature > Components: parser >Reporter: Jukka Zitting >Assignee: Jukka Zitting > Fix For: 1.6 > > > It would be nice if images within PDF documents could be extracted much like > embedded attachments are now being handled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127015#comment-14127015 ] Hudson commented on TIKA-1413: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #181 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/181/]) TIKA-1413 - Remove embedded thumbnail from body (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1623819) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java > OOXML thumbnail name added to body > -- > > Key: TIKA-1413 > URL: https://issues.apache.org/jira/browse/TIKA-1413 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.6 >Reporter: Andrzej Bialecki > > AbstractOOXMLExtractor.handleThumbnail processes thumbnails using > EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike > other embedded parts in handleEmbeddedParts(...)). > This results in adding the thumbnail name to the main body of the document > (as a package-entry), which in my opinion is wrong. > Example: > {code} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> > > > > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > The quick brown fox jumps over the lazy dog > > The quick brown fox jumps over the lazy dog > > > > > > > class="package-entry">thumbnail_0.jpeg > {code} > The extracted plain text looks like this (using tika-app): > {code} > The quick brown fox jumps over the lazy dog > thumbnail_0.jpeg > {code} > The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. > I think also that the id attribute should be set to the real thumbnail path > within the package (i.e. tPart.getPartName().getName()) instead of the > artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126994#comment-14126994 ] Hudson commented on TIKA-1413: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #203 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/203/]) TIKA-1413 - Remove embedded thumbnail from body (thaichat04: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1623819) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java > OOXML thumbnail name added to body > -- > > Key: TIKA-1413 > URL: https://issues.apache.org/jira/browse/TIKA-1413 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.6 >Reporter: Andrzej Bialecki > > AbstractOOXMLExtractor.handleThumbnail processes thumbnails using > EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike > other embedded parts in handleEmbeddedParts(...)). > This results in adding the thumbnail name to the main body of the document > (as a package-entry), which in my opinion is wrong. > Example: > {code} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> > > > > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > The quick brown fox jumps over the lazy dog > > The quick brown fox jumps over the lazy dog > > > > > > > class="package-entry">thumbnail_0.jpeg > {code} > The extracted plain text looks like this (using tika-app): > {code} > The quick brown fox jumps over the lazy dog > thumbnail_0.jpeg > {code} > The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. > I think also that the id attribute should be set to the real thumbnail path > within the package (i.e. tPart.getPartName().getName()) instead of the > artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
buildbot success in ASF Buildbot on tika-trunk
The Buildbot has detected a restored build on builder tika-trunk while building ASF Buildbot. Full details are available at: http://ci.apache.org/builders/tika-trunk/builds/190 Buildbot URL: http://ci.apache.org/ Buildslave for this Build: lares_ubuntu Build Reason: scheduler Build Source Stamp: [branch tika/trunk] 1623819 Blamelist: thaichat04 Build succeeded! sincerely, -The Buildbot
[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1413. Resolution: Fixed > OOXML thumbnail name added to body > -- > > Key: TIKA-1413 > URL: https://issues.apache.org/jira/browse/TIKA-1413 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.6 >Reporter: Andrzej Bialecki > > AbstractOOXMLExtractor.handleThumbnail processes thumbnails using > EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike > other embedded parts in handleEmbeddedParts(...)). > This results in adding the thumbnail name to the main body of the document > (as a package-entry), which in my opinion is wrong. > Example: > {code} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> > > > > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > The quick brown fox jumps over the lazy dog > > The quick brown fox jumps over the lazy dog > > > > > > > class="package-entry">thumbnail_0.jpeg > {code} > The extracted plain text looks like this (using tika-app): > {code} > The quick brown fox jumps over the lazy dog > thumbnail_0.jpeg > {code} > The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. > I think also that the id attribute should be set to the real thumbnail path > within the package (i.e. tPart.getPartName().getName()) instead of the > artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body
[ https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126949#comment-14126949 ] Hong-Thai Nguyen commented on TIKA-1413: I agree. Fixed in r1623819 and _id_ is now from partName(). > OOXML thumbnail name added to body > -- > > Key: TIKA-1413 > URL: https://issues.apache.org/jira/browse/TIKA-1413 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.6 >Reporter: Andrzej Bialecki > > AbstractOOXMLExtractor.handleThumbnail processes thumbnails using > EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike > other embedded parts in handleEmbeddedParts(...)). > This results in adding the thumbnail name to the main body of the document > (as a package-entry), which in my opinion is wrong. > Example: > {code} > xmlns="http://www.w3.org/1999/xhtml";> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/> > > > > > content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/> > > The quick brown fox jumps over the lazy dog > > The quick brown fox jumps over the lazy dog > > > > > > > class="package-entry">thumbnail_0.jpeg > {code} > The extracted plain text looks like this (using tika-app): > {code} > The quick brown fox jumps over the lazy dog > thumbnail_0.jpeg > {code} > The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false. > I think also that the id attribute should be set to the real thumbnail path > within the package (i.e. tPart.getPartName().getName()) instead of the > artificially created sequential name. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1189) Fails to parse PPT file
[ https://issues.apache.org/jira/browse/TIKA-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1189. -- Resolution: Fixed Fix Version/s: 1.6 Marking as fixed based on Felix's comments > Fails to parse PPT file > --- > > Key: TIKA-1189 > URL: https://issues.apache.org/jira/browse/TIKA-1189 > Project: Tika > Issue Type: Bug > Components: cli, gui > Environment: OSX 10.9, OSX 10.6 >Reporter: Aimee Dev > Fix For: 1.6 > > Attachments: CDT_Data_Retention-PPT.ppt > > > Out of the box tika application when presented with the file results in > Apache Tika was unable to parse the document > at /Volumes/FREECOM_HDD/Test/CDT_Data_Retention-PPT.ppt. > The full exception stack trace is included below: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@224f9db > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) > at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) > at > org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) > at > org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) > at javax.swing.TransferHandler.importData(TransferHandler.java:826) > at > javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1536) > at java.awt.dnd.DropTarget.drop(DropTarget.java:450) > at > javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1274) > at > sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:537) > at > sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:127) > at > sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:851) > at > sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:775) > at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48) > at java.awt.Component.dispatchEventImpl(Component.java:4716) > at java.awt.Container.dispatchEventImpl(Container.java:2287) > at java.awt.Component.dispatchEvent(Component.java:4687) > at > java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832) > at > java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4566) > at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4417) > at java.awt.Container.dispatchEventImpl(Container.java:2273) > at java.awt.Window.dispatchEventImpl(Window.java:2719) > at java.awt.Component.dispatchEvent(Component.java:4687) > at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735) > at java.awt.EventQueue.access$200(EventQueue.java:103) > at java.awt.EventQueue$3.run(EventQueue.java:694) > at java.awt.EventQueue$3.run(EventQueue.java:692) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) > at > java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87) > at java.awt.EventQueue$4.run(EventQueue.java:708) > at java.awt.EventQueue$4.run(EventQueue.java:706) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76) > at java.awt.EventQueue.dispatchEvent(EventQueue.java:705) > at > java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242) > at > java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161) > at > java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150) > at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146) > at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138) > at java.awt.EventDispatchThread.run(EventDispatchThread.java:91) > Caused by: java.lang.RuntimeException: Couldn't instantiate the class for > type with id 5000 on class class > org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : > java.lang.reflect.InvocationTargetException > Cause was : java.lang.RuntimeException: Couldn't instantiate the class for > type with id 5002 on class class > org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : > java.lang.reflect.InvocationTargetException > Cause was : jav
[jira] [Resolved] (TIKA-1284) TikaException for Microsoft Powerpoint Document [ ppt ]
[ https://issues.apache.org/jira/browse/TIKA-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1284. -- Resolution: Fixed Fix Version/s: 1.6 Marking as fixed based on Felix's comments > TikaException for Microsoft Powerpoint Document [ ppt ] > > > Key: TIKA-1284 > URL: https://issues.apache.org/jira/browse/TIKA-1284 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1, 1.3, 1.5 >Reporter: Chetan Laddha > Fix For: 1.6 > > Attachments: Problem2.ppt, problem1.ppt > > > Attach PPT file is not getting extracted. Giving exception as > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@2d536558 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) > Caused by: java.lang.RuntimeException: Couldn't instantiate the class for > type with id 5000 on class class > org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : > java.lang.reflect.InvocationTargetException > Cause was : java.lang.RuntimeException: Couldn't instantiate the class for > type with id 5002 on class class > org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : > java.lang.reflect.InvocationTargetException > Cause was : java.lang.RuntimeException: Couldn't instantiate the class for > type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob > : java.lang.reflect.InvocationTargetException > Cause was : java.lang.RuntimeException: Couldn't instantiate the class for > type with id 4012 on class class > org.apache.poi.hslf.record.StyleTextProp9Atom : > java.lang.reflect.InvocationTargetException > Cause was : java.lang.ArrayIndexOutOfBoundsException: 20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1405) German content detected as French
[ https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zaheer Beig updated TIKA-1405: -- Description: Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issue with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. Any update on this will be very helpful. was: Hi, We are using Apache Tika 1.4 for document conversion to text and language detection in one of our project. We are facing below issues with language detection: 1. When the text is in all UPPER CASE, even though the language is English, it gets detected as Estonian. 2. For many of our German content , language gets detected as French [Though this is not the case for all German content] Any update on this will be very helpful. > German content detected as French > - > > Key: TIKA-1405 > URL: https://issues.apache.org/jira/browse/TIKA-1405 > Project: Tika > Issue Type: Bug > Components: languageidentifier >Affects Versions: 1.4 > Environment: Linux >Reporter: Zaheer Beig > Labels: newbie > > Hi, > We are using Apache Tika 1.4 for document conversion to text and language > detection in one of our project. We are facing below issue with language > detection: > 1. When the text is in all UPPER CASE, even though the language is English, > it gets detected as Estonian. > Any update on this will be very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-trunk-jdk1.7 - Build # 202 - Still Failing
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #202) Status: Still Failing Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/202/ to view the results.