tika-trunk-jdk1.7 - Build # 202 - Still Failing

2014-09-09 Thread Apache Jenkins Server
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #202)

Status: Still Failing

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/202/ to 
view the results.

[jira] [Updated] (TIKA-1405) German content detected as French

2014-09-09 Thread Zaheer Beig (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaheer Beig updated TIKA-1405:
--
Description: 
Hi,
We are using Apache Tika 1.4  for document conversion to text and language 
detection in one of our project. We are facing below issue with language 
detection:

1. When the text is in all UPPER CASE, even though the language is English, it 
gets detected as Estonian.


Any update on this will be very helpful.

  was:
Hi,
We are using Apache Tika 1.4  for document conversion to text and language 
detection in one of our project. We are facing below issues with language 
detection:

1. When the text is in all UPPER CASE, even though the language is English, it 
gets detected as Estonian.
2. For many of our German content , language gets detected as French [Though 
this is not the case for all German content]

Any update on this will be very helpful.


 German content detected as French
 -

 Key: TIKA-1405
 URL: https://issues.apache.org/jira/browse/TIKA-1405
 Project: Tika
  Issue Type: Bug
  Components: languageidentifier
Affects Versions: 1.4
 Environment: Linux
Reporter: Zaheer Beig
  Labels: newbie

 Hi,
 We are using Apache Tika 1.4  for document conversion to text and language 
 detection in one of our project. We are facing below issue with language 
 detection:
 1. When the text is in all UPPER CASE, even though the language is English, 
 it gets detected as Estonian.
 Any update on this will be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1284) TikaException for Microsoft Powerpoint Document [ ppt ]

2014-09-09 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1284.
--
   Resolution: Fixed
Fix Version/s: 1.6

Marking as fixed based on Felix's comments

 TikaException for Microsoft Powerpoint Document [ ppt ] 
 

 Key: TIKA-1284
 URL: https://issues.apache.org/jira/browse/TIKA-1284
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.3, 1.5
Reporter: Chetan Laddha
 Fix For: 1.6

 Attachments: Problem2.ppt, problem1.ppt


 Attach PPT file is not getting extracted. Giving exception as 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@2d536558
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5000 on class class 
 org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
 java.lang.reflect.InvocationTargetException
 Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5002 on class class 
 org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
 java.lang.reflect.InvocationTargetException
 Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5003 on class class org.apache.poi.hslf.record.BinaryTagDataBlob 
 : java.lang.reflect.InvocationTargetException
 Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 4012 on class class 
 org.apache.poi.hslf.record.StyleTextProp9Atom : 
 java.lang.reflect.InvocationTargetException
 Cause was : java.lang.ArrayIndexOutOfBoundsException: 20



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1189) Fails to parse PPT file

2014-09-09 Thread Nick Burch (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-1189.
--
   Resolution: Fixed
Fix Version/s: 1.6

Marking as fixed based on Felix's comments

 Fails to parse PPT file
 ---

 Key: TIKA-1189
 URL: https://issues.apache.org/jira/browse/TIKA-1189
 Project: Tika
  Issue Type: Bug
  Components: cli, gui
 Environment: OSX 10.9, OSX 10.6
Reporter: Aimee Dev
 Fix For: 1.6

 Attachments: CDT_Data_Retention-PPT.ppt


 Out of the box tika application when presented with the file results in 
 Apache Tika was unable to parse the document
 at /Volumes/FREECOM_HDD/Test/CDT_Data_Retention-PPT.ppt.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@224f9db
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
   at 
 org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
   at 
 org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
   at javax.swing.TransferHandler.importData(TransferHandler.java:826)
   at 
 javax.swing.TransferHandler$DropHandler.drop(TransferHandler.java:1536)
   at java.awt.dnd.DropTarget.drop(DropTarget.java:450)
   at 
 javax.swing.TransferHandler$SwingDropTarget.drop(TransferHandler.java:1274)
   at 
 sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(SunDropTargetContextPeer.java:537)
   at 
 sun.lwawt.macosx.CDropTargetContextPeer.processDropMessage(CDropTargetContextPeer.java:127)
   at 
 sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(SunDropTargetContextPeer.java:851)
   at 
 sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(SunDropTargetContextPeer.java:775)
   at sun.awt.dnd.SunDropTargetEvent.dispatch(SunDropTargetEvent.java:48)
   at java.awt.Component.dispatchEventImpl(Component.java:4716)
   at java.awt.Container.dispatchEventImpl(Container.java:2287)
   at java.awt.Component.dispatchEvent(Component.java:4687)
   at 
 java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4832)
   at 
 java.awt.LightweightDispatcher.processDropTargetEvent(Container.java:4566)
   at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4417)
   at java.awt.Container.dispatchEventImpl(Container.java:2273)
   at java.awt.Window.dispatchEventImpl(Window.java:2719)
   at java.awt.Component.dispatchEvent(Component.java:4687)
   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:735)
   at java.awt.EventQueue.access$200(EventQueue.java:103)
   at java.awt.EventQueue$3.run(EventQueue.java:694)
   at java.awt.EventQueue$3.run(EventQueue.java:692)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
   at 
 java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:87)
   at java.awt.EventQueue$4.run(EventQueue.java:708)
   at java.awt.EventQueue$4.run(EventQueue.java:706)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 java.security.ProtectionDomain$1.doIntersectionPrivilege(ProtectionDomain.java:76)
   at java.awt.EventQueue.dispatchEvent(EventQueue.java:705)
   at 
 java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:242)
   at 
 java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:161)
   at 
 java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:150)
   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:146)
   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:138)
   at java.awt.EventDispatchThread.run(EventDispatchThread.java:91)
 Caused by: java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5000 on class class 
 org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
 java.lang.reflect.InvocationTargetException
 Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5002 on class class 
 org.apache.poi.hslf.record.DummyPositionSensitiveRecordWithChildren : 
 java.lang.reflect.InvocationTargetException
 Cause was : java.lang.RuntimeException: Couldn't instantiate the class for 
 type with id 5003 on class 

[jira] [Resolved] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1413.

Resolution: Fixed

 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126949#comment-14126949
 ] 

Hong-Thai Nguyen commented on TIKA-1413:


I agree. Fixed in r1623819 and _id_ is now from partName().

 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


buildbot success in ASF Buildbot on tika-trunk

2014-09-09 Thread buildbot
The Buildbot has detected a restored build on builder tika-trunk while building 
ASF Buildbot.
Full details are available at:
 http://ci.apache.org/builders/tika-trunk/builds/190

Buildbot URL: http://ci.apache.org/

Buildslave for this Build: lares_ubuntu

Build Reason: scheduler
Build Source Stamp: [branch tika/trunk] 1623819
Blamelist: thaichat04

Build succeeded!

sincerely,
 -The Buildbot





[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126994#comment-14126994
 ] 

Hudson commented on TIKA-1413:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #203 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/203/])
TIKA-1413 - Remove embedded thumbnail from body (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1623819)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java


 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1413) OOXML thumbnail name added to body

2014-09-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127015#comment-14127015
 ] 

Hudson commented on TIKA-1413:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #181 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/181/])
TIKA-1413 - Remove embedded thumbnail from body (thaichat04: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1623819)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/rtf/RTFParserTest.java


 OOXML thumbnail name added to body
 --

 Key: TIKA-1413
 URL: https://issues.apache.org/jira/browse/TIKA-1413
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 

 AbstractOOXMLExtractor.handleThumbnail processes thumbnails using 
 EmbeddedDocumentExtractor, but with the outputHtml flag set to true (unlike 
 other embedded parts in handleEmbeddedParts(...)).
 This results in adding the thumbnail name to the main body of the document 
 (as a package-entry), which in my opinion is wrong.
 Example:
 {code}
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=meta:slide-count content=1/
 meta name=cp:revision content=5/
 meta name=meta:last-author content=Nick Burch/
 meta name=Slide-Count content=1/
 meta name=Last-Author content=Nick Burch/
 meta name=meta:save-date content=2010-09-08T16:15:14Z/
 meta name=Content-Length content=202969/
 meta name=subject content=Gym class featuring a brown fox and lazy dog/
 meta name=Application-Name content=Microsoft Office PowerPoint/
 meta name=Author content=Nevin Nollop/
 meta name=dcterms:created content=1601-01-01T00:00:00Z/
 meta name=Application-Version content=12./
 meta name=date content=2010-09-08T16:15:14Z/
 meta name=Total-Time content=2/
 meta name=extended-properties:Template content=/
 meta name=publisher content=/
 meta name=creator content=Nevin Nollop/
 meta name=Word-Count content=9/
 meta name=meta:paragraph-count content=1/
 meta name=extended-properties:AppVersion content=12./
 meta name=Creation-Date content=1601-01-01T00:00:00Z/
 meta name=meta:author content=Nevin Nollop/
 meta name=cp:subject content=Gym class featuring a brown fox and lazy 
 dog/
 meta name=extended-properties:Application content=Microsoft Office 
 PowerPoint/
 meta name=resourceName content=testPPT_embeded.pptx/
 meta name=Paragraph-Count content=1/
 meta name=dc:title content=The quick brown fox jumps over the lazy dog/
 meta name=Last-Save-Date content=2010-09-08T16:15:14Z/
 meta name=custom:Version content=1/
 meta name=Revision-Number content=5/
 meta name=Last-Printed content=1601-01-01T00:00:00Z/
 meta name=meta:print-date content=1601-01-01T00:00:00Z/
 meta name=meta:creation-date content=1601-01-01T00:00:00Z/
 meta name=dcterms:modified content=2010-09-08T16:15:14Z/
 meta name=Template content=/
 meta name=dc:creator content=Nevin Nollop/
 meta name=meta:word-count content=9/
 meta name=extended-properties:Company content=/
 meta name=Last-Modified content=2010-09-08T16:15:14Z/
 meta name=extended-properties:PresentationFormat content=On-screen Show 
 (4:3)/
 meta name=X-Parsed-By content=org.apache.tika.parser.DefaultParser/
 meta name=X-Parsed-By 
 content=org.apache.tika.parser.microsoft.ooxml.OOXMLParser/
 meta name=modified content=2010-09-08T16:15:14Z/
 meta name=xmpTPg:NPages content=1/
 meta name=extended-properties:TotalTime content=2/
 meta name=dc:publisher content=/
 meta name=Content-Type 
 content=application/vnd.openxmlformats-officedocument.presentationml.presentation/
 meta name=Presentation-Format content=On-screen Show (4:3)/
 titleThe quick brown fox jumps over the lazy dog/title
 /head
 bodypThe quick brown fox jumps over the lazy dog/p
 div class=embedded id=slide1_rId4/
 div class=embedded id=slide1_rId5/
 div class=embedded id=slide1_rId6/
 div class=embedded id=slide1_rId7/
 div class=embedded id=slide1_rId8/
 div class=embedded id=slide1_rId9/
 div class=embedded id=thumbnail_0.jpeg/div 
 class=package-entryh1thumbnail_0.jpeg/h1/div/body/html
 {code}
 The extracted plain text looks like this (using tika-app):
 {code}
 The quick brown fox jumps over the lazy dog
 thumbnail_0.jpeg
 {code}
 The fix is trivial - change the flag in AbstractOOXMLExtractor:158 to false.
 I think also that the id attribute should be set to the real thumbnail path 
 within the package (i.e. tPart.getPartName().getName()) instead of the 
 artificially created sequential name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-09 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127173#comment-14127173
 ] 

Lewis John McGibbney commented on TIKA-1268:


Was there ever a patch for this issue I wonder? It would have been great to see 
what it looked like.

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)