[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-09 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127173#comment-14127173
 ] 

Lewis John McGibbney commented on TIKA-1268:


Was there ever a patch for this issue I wonder? It would have been great to see 
what it looked like.

> Extract images from PDF documents
> -
>
> Key: TIKA-1268
> URL: https://issues.apache.org/jira/browse/TIKA-1268
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Jukka Zitting
> Fix For: 1.6
>
>
> It would be nice if images within PDF documents could be extracted much like 
> embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127015#comment-14127015
 ] 

Hudson commented on TIKA-1413:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #181 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/181/])
TIKA-1413 - Remove embedded thumbnail from body (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1623819)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java


> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126994#comment-14126994
 ] 

Hudson commented on TIKA-1413:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #203 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/203/])
TIKA-1413 - Remove embedded thumbnail from body (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1623819)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java


> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


buildbot success in ASF Buildbot on tika-trunk

2014-09-09 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/190

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: lares_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1623819
Blamelist: thaichat04

Build succeeded!

sincerely,
 -The Buildbot





[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1413.

Resolution: Fixed

> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14126949#comment-14126949
 ] 

Hong-Thai Nguyen commented on TIKA-1413:


I agree. Fixed in r1623819 and _id_ is now from partName().

> OOXML thumbnail name added to body
> --
>
> Key: TIKA-1413
> URL: https://issues.apache.org/jira/browse/TIKA-1413
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Andrzej Bialecki 
>
> AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
> EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
> other embedded parts in handleEmbeddedParts(...)).
> This results in adding the thumbnail name to the main body of the document 
> (as a package-entry), which in my opinion is wrong.
> Example:
> {code}
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> 
> 
> 
> 
>  content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> 
> The quick brown fox jumps over the lazy dog
> 
> The quick brown fox jumps over the lazy dog
> 
> 
> 
> 
> 
> 
>  class="package-entry">thumbnail_0.jpeg
> {code}
> The extracted plain text looks like this (using tika-app):
> {code}
> The quick brown fox jumps over the lazy dog
> thumbnail_0.jpeg
> {code}
> The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
> I think also that the id attribute should be set to the real thumbnail path 
> within the package (i.e. tPart.getPartName().getName()) instead of the 
> artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1189) Fails to parse PPT file

2014-09-09 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1189.
--
   Resolution: Fixed
Fix Version/s: 1.6

Marking as fixed based on Felix's comments

> Fails to parse PPT file
> ---
>
> Key: TIKA-1189
> URL: https://issues.apache.org/jira/browse/TIKA-1189
> Project: Tika
>  Issue Type: Bug
>  Components: cli, gui
> Environment: OSX 10.9, OSX 10.6
>Reporter: Aimee Dev
> Fix For: 1.6
>
> Attachments: CDT_Data_Retention-PPT.ppt
>
>
> Out of the box tika application when presented with the file results in 
> Apache Tika was unable to parse the document
> at /Volumes/FREECOM_HDD/Test/CDT_Data_Retention-PPT.ppt.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@224f9db
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>   at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>   at javax.swing.TransferHandler.importData(TransferHandler.java:826)
>   at 
> javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1536)
>   at java.awt.dnd.DropTarget.drop(DropTarget.java:450)
>   at 
> javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1274)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:537)
>   at 
> sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:127)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:851)
>   at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:775)
>   at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48)
>   at java.awt.Component.dispatchEventImpl(Component.java:4716)
>   at java.awt.Container.dispatchEventImpl(Container.java:2287)
>   at java.awt.Component.dispatchEvent(Component.java:4687)
>   at 
> java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832)
>   at 
> java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4566)
>   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4417)
>   at java.awt.Container.dispatchEventImpl(Container.java:2273)
>   at java.awt.Window.dispatchEventImpl(Window.java:2719)
>   at java.awt.Component.dispatchEvent(Component.java:4687)
>   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735)
>   at java.awt.EventQueue.access$200(EventQueue.java:103)
>   at java.awt.EventQueue$3.run(EventQueue.java:694)
>   at java.awt.EventQueue$3.run(EventQueue.java:692)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
>   at 
> java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87)
>   at java.awt.EventQueue$4.run(EventQueue.java:708)
>   at java.awt.EventQueue$4.run(EventQueue.java:706)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
>   at java.awt.EventQueue.dispatchEvent(EventQueue.java:705)
>   at 
> java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242)
>   at 
> java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161)
>   at 
> java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146)
>   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138)
>   at java.awt.EventDispatchThread.run(EventDispatchThread.java:91)
> Caused by: java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 5000 on class class 
> org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
> java.lang.reflect.InvocationTargetException
> Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 5002 on class class 
> org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
> java.lang.reflect.InvocationTargetException
> Cause was : jav

[jira] [Resolved] (TIKA-1284) TikaException for Microsoft Powerpoint Document [ ppt ]

2014-09-09 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1284.
--
   Resolution: Fixed
Fix Version/s: 1.6

Marking as fixed based on Felix's comments

> TikaException for Microsoft Powerpoint Document [ ppt ] 
> 
>
> Key: TIKA-1284
> URL: https://issues.apache.org/jira/browse/TIKA-1284
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, 1.3, 1.5
>Reporter: Chetan Laddha
> Fix For: 1.6
>
> Attachments: Problem2.ppt, problem1.ppt
>
>
> Attach PPT file is not getting extracted. Giving exception as 
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@2d536558
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
> Caused by: java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 5000 on class class 
> org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
> java.lang.reflect.InvocationTargetException
> Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 5002 on class class 
> org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
> java.lang.reflect.InvocationTargetException
> Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob 
> : java.lang.reflect.InvocationTargetException
> Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
> type with id 4012 on class class 
> org.apache.poi.hslf.record.StyleTextProp9Atom : 
> java.lang.reflect.InvocationTargetException
> Cause was : java.lang.ArrayIndexOutOfBoundsException: 20



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1405) German content detected as French

2014-09-09 Thread Zaheer Beig (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaheer Beig updated TIKA-1405:
--
Description: 
Hi,
We are using Apache Tika 1.4  for document conversion to text and language 
detection in one of our project. We are facing below issue with language 
detection:

1. When the text is in all UPPER CASE, even though the language is English, it 
gets detected as Estonian.


Any update on this will be very helpful.

  was:
Hi,
We are using Apache Tika 1.4  for document conversion to text and language 
detection in one of our project. We are facing below issues with language 
detection:

1. When the text is in all UPPER CASE, even though the language is English, it 
gets detected as Estonian.
2. For many of our German content , language gets detected as French [Though 
this is not the case for all German content]

Any update on this will be very helpful.


> German content detected as French
> -
>
> Key: TIKA-1405
> URL: https://issues.apache.org/jira/browse/TIKA-1405
> Project: Tika
>  Issue Type: Bug
>  Components: languageidentifier
>Affects Versions: 1.4
> Environment: Linux
>Reporter: Zaheer Beig
>  Labels: newbie
>
> Hi,
> We are using Apache Tika 1.4  for document conversion to text and language 
> detection in one of our project. We are facing below issue with language 
> detection:
> 1. When the text is in all UPPER CASE, even though the language is English, 
> it gets detected as Estonian.
> Any update on this will be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


tika-trunk-jdk1.7 - Build # 202 - Still Failing

2014-09-09 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #202)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/202/ to 
view the results.