[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128398#comment-14128398
 ] 

Tim Allison commented on TIKA-1268:
---

These should do it, no?

Either with svn commandline: svn diff -c 1586159

Or: 

[viewvc 
PDF2XHTML|http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java?annotate=1586159]

[patch 
PDF2XHTML|http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java?r1=1586158r2=1586159;]

[patch 
PDFParserTest|http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java?r1=1575116r2=1586159view=patch]

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128534#comment-14128534
 ] 

Lewis John McGibbney commented on TIKA-1268:


They sure do it [~talli...@mitre.org] thank you very much.
This is a nice feature and I wanted to see the code exactly.
Thank you

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Jeremy Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128855#comment-14128855
 ] 

Jeremy Anderson commented on TIKA-1268:
---

I created the TIKA-1285 patch after making that comment, but never re-linked it 
in.  I did some work on TIKA-1285 last week to re-sync the snapshot builds of 
the two projects, though things became even more complicated with PDF's 
snapshot transitioning from Jempbox to Xmpbox.  The current patch files for 
that one should work with the snapshots, though Xmpbox's DomXmpParser needs 
some refactoring to properly work with Tika's test files.  I believe metadata 
is being dropped for a few of Tikas test files.

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-10 Thread Jeremy Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128921#comment-14128921
 ] 

Jeremy Anderson commented on TIKA-1268:
---

Take a look at my last comment in TIKA-1285, to see some of the common 
exceptions that I saw that prevented DomXmpParser from being able to get all 
information from a files xmp data.  Some of which had no results on Tika's 
junit tests. (The scope of the inner workings of Xmp files and all is a bit out 
of my knowledge set) 

My patch removed Keyword  Comment metadata from a junit case or two in 
JempboxExtractorTest and JpegParserTest. And some extended ones in 
PDFParserTest.  Take a look at my patch and search for //TODO: Fix once 
DomXmpParser error fixed: which I placed by any test case that I commented out.

I believe the root reason for xmpbox.XmpDomParser failing is it requiring too 
strict of adherence to standards that files don't necessarily adhere to with 
their Xmp content, and a few missed cases of handling bags and Seq data.

If you apply the TIKA-1285 patch you can uncomment out the System.err.println's 
to see what messages DomXmpParser fails with, but be sure to also apply the 
PDFBOX-2318 patch as well which fixes a few easier issues with that parser.

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-09-09 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127173#comment-14127173
 ] 

Lewis John McGibbney commented on TIKA-1268:


Was there ever a patch for this issue I wonder? It would have been great to see 
what it looked like.

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1268) Extract images from PDF documents

2014-04-29 Thread Jeremy Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984984#comment-13984984
 ] 

Jeremy Anderson commented on TIKA-1268:
---

This fix will break when PDFBox 2.0.0 is released and upgraded to.  I may add a 
new TIKA issue at some-point to reference a 2.0.0 upgrade, with a patch if I 
implement one rather than commenting out this code.  (I'm currently building 
tika, pdfbox, and poi using daily snapshots.

See: PDFBOX-1893.

Essentially the org.apache.pdfbox.pdmodel.graphics.xobject package was removed 
and logic from its classes were refactored across various other classes.  This 
TIKA fix heavily utilized classes from this package.

 Extract images from PDF documents
 -

 Key: TIKA-1268
 URL: https://issues.apache.org/jira/browse/TIKA-1268
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.6


 It would be nice if images within PDF documents could be extracted much like 
 embedded attachments are now being handled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)