date:20140924

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147302#comment-14147302
 ] 

Hudson commented on TIKA-1420:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #208 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/208/])
TIKA-1420, refactor the phone number extraction to use a custom method of 
de-obfuscating numbers. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627446)
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/CleanPhoneText.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java
* 
/tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt


> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147285#comment-14147285
 ] 

Hudson commented on TIKA-1420:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #230 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/230/])
TIKA-1420, refactor the phone number extraction to use a custom method of 
de-obfuscating numbers. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627446)
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/CleanPhoneText.java
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java
* 
/tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt


> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147272#comment-14147272
 ] 

Tyler Palsulich commented on TIKA-1420:
---

Just made some more updates in r1627446. I added a lot more documentation, 
removed the dependency on libphonenumber, and added custom phone number 
deobfuscation code. The solution given assumes that the file's text will fit in 
a String, which may not be true. But, we can iterate on that later.

In my opinion, this is worth more than just an example. Parse any file and get 
a list of phone numbers out.

> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146976#comment-14146976
 ] 

Nick Burch commented on TIKA-1420:
--

Since it's an example, it might be good to put in a hefty amount of class-level 
JavaDoc explaining how it works, why you might want to use something like that 
etc!

> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146826#comment-14146826
 ] 

Hudson commented on TIKA-1420:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #207 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/207/])
TIKA-1420, create an example of a PhoneNumberContentExtractor. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627397)
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java
* /tika/trunk/tika-example/src/test/resources
* /tika/trunk/tika-example/src/test/resources/org
* /tika/trunk/tika-example/src/test/resources/org/apache
* /tika/trunk/tika-example/src/test/resources/org/apache/tika
* /tika/trunk/tika-example/src/test/resources/org/apache/tika/example
* 
/tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt


> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146808#comment-14146808
 ] 

Hudson commented on TIKA-1420:
--

FAILURE: Integrated in tika-trunk-jdk1.7 #229 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/229/])
TIKA-1420, create an example of a PhoneNumberContentExtractor. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627397)
* /tika/trunk/tika-example/pom.xml
* 
/tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java
* 
/tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java
* /tika/trunk/tika-example/src/test/resources
* /tika/trunk/tika-example/src/test/resources/org
* /tika/trunk/tika-example/src/test/resources/org/apache
* /tika/trunk/tika-example/src/test/resources/org/apache/tika
* /tika/trunk/tika-example/src/test/resources/org/apache/tika/example
* 
/tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt


> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-trunk-jdk1.7 - Build # 229 - Failure

2014-09-24 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #229)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/229/ to 
view the results.

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146775#comment-14146775
 ] 

Tyler Palsulich commented on TIKA-1420:
---

Initial example added in r1627397. 

> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146588#comment-14146588
 ] 

Tim Allison commented on TIKA-1396:
---

Y, I can think of a few options.  We still need to add tags in the PDFParser 
and RTFParser, and I'll do that on TIKA-1427...thank you for opening that.

You could use a ParserContainerExtractor to extract each file, or you could use 
an EmbeddedDocumentExtractor (see TikaCLI in tika-app or UnpackerResource in 
tika-server for examples).

You might also try the RecursiveParserWrapper that I just added to trunk if you 
know that your docs will be small enough to hold in memory.  With that, you 
parse a document and then call getMetadata() on the parser.  It returns a list 
of Metadata objects -- the first one is the parent document and then one 
metadata object for each attachment.  The text can be stored in a metadata 
field depending on what ContentHandlerFactory you pass in...but you would just 
iterate through the list to get the metadata and content for each embedded doc.



> Embedded images in PDF documents
> 
>
> Key: TIKA-1396
> URL: https://issues.apache.org/jira/browse/TIKA-1396
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>Reporter: Damiano
>Priority: Critical
> Fix For: 1.6
>
> Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread James Baker (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146564#comment-14146564
 ] 

James Baker commented on TIKA-1396:
---

Issue created, TIKA-1427.

> Embedded images in PDF documents
> 
>
> Key: TIKA-1396
> URL: https://issues.apache.org/jira/browse/TIKA-1396
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>Reporter: Damiano
>Priority: Critical
> Fix For: 1.6
>
> Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TIKA-1427) PDF Images don't appear in structured view

2014-09-24 Thread James Baker (JIRA)

James Baker created TIKA-1427:
-

 Summary: PDF Images don't appear in structured view
 Key: TIKA-1427
 URL: https://issues.apache.org/jira/browse/TIKA-1427
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: James Baker


When viewing, say, a Word Document, any images appear in the 'structured view' 
of the document as  tags. The same is not true of PDF documents, and we 
lose both the fact that there is an image present, and where it is in the 
document.

Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread James Baker (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146558#comment-14146558
 ] 

James Baker commented on TIKA-1396:
---

That will affect my processing, yes. My use case is trying to split a document 
into separate documents based on a delimiter in the text. If we don't know 
where the image is on the page, we don't know which document it should be in! 
Any ideas how that could be worked around?

> Embedded images in PDF documents
> 
>
> Key: TIKA-1396
> URL: https://issues.apache.org/jira/browse/TIKA-1396
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>Reporter: Damiano
>Priority: Critical
> Fix For: 1.6
>
> Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146537#comment-14146537
 ] 

Tim Allison commented on TIKA-1422:
---

Sorry, user error.  Needed to force update.  Thank you!

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
> Fix For: 1.7
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-09-24 Thread Tyler Palsulich (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146523#comment-14146523
 ] 

Tyler Palsulich commented on TIKA-1422:
---

The Hudson builds are now stable with the fix from TIKA-1421. So, this is only 
a failure when Tesseract is installed. It has something to do with how 
attachments are parsed, but I'm not sure exactly what this test is or why it's 
failing. As I understand it, there are 4 invocations of the handler without 
Tesseract installed and 5 with. So, it may not be an actual problem...

But, if you think we should disable it temporarily, that's fine by me! We could 
also comment out the failing Assert in this test.

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
> Fix For: 1.7
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84)
>   at sun.reflect.NativeMethodAccessorImpl.invo

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146360#comment-14146360
 ] 

Hudson commented on TIKA-1419:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #206 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/206/])
TIKA-1419: upgrade to PDFBox 1.8.7 and update CHANGES.txt for this and a few 
recent changes (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627308)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/pom.xml


> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146361#comment-14146361
 ] 

Hudson commented on TIKA-1424:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #206 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/206/])
TIKA-1424: clear PDFont's resources after each document (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627304)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Clear PDFont's resources after each file to prevent memory leak
> ---
>
> Key: TIKA-1424
> URL: https://issues.apache.org/jira/browse/TIKA-1424
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> PDFBox-2200 identified a memory-leak/caching strategy that can cause problems 
> for some documents.  A workaround of clearing the cache was recommended for 
> now.  Let's add that to Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146338#comment-14146338
 ] 

Hudson commented on TIKA-1419:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #228 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/228/])
TIKA-1419: upgrade to PDFBox 1.8.7 and update CHANGES.txt for this and a few 
recent changes (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627308)
* /tika/trunk/CHANGES.txt
* /tika/trunk/tika-parsers/pom.xml


> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

2014-09-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146339#comment-14146339
 ] 

Hudson commented on TIKA-1424:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #228 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/228/])
TIKA-1424: clear PDFont's resources after each document (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627304)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java


> Clear PDFont's resources after each file to prevent memory leak
> ---
>
> Key: TIKA-1424
> URL: https://issues.apache.org/jira/browse/TIKA-1424
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> PDFBox-2200 identified a memory-leak/caching strategy that can cause problems 
> for some documents.  A workaround of clearing the cache was recommended for 
> now.  Let's add that to Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146294#comment-14146294
 ] 

Tim Allison commented on TIKA-1419:
---

Happy to help (and again my apologies for the post-hoc run!)...and I look 
forward to the day when you can run your own regression tests on our shared vm!

Email would be great or you could open a ticket on tika for the upgrade before 
it is officially released.

Thank you, again!



> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-24 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1419.
---
Resolution: Fixed

r1627308

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

2014-09-24 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1424.
---
Resolution: Fixed

r1627304

> Clear PDFont's resources after each file to prevent memory leak
> ---
>
> Key: TIKA-1424
> URL: https://issues.apache.org/jira/browse/TIKA-1424
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
>
> PDFBox-2200 identified a memory-leak/caching strategy that can cause problems 
> for some documents.  A workaround of clearing the cache was recommended for 
> now.  Let's add that to Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

2014-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146283#comment-14146283
 ] 

Tim Allison commented on TIKA-1422:
---

While work is going on to get the TesseractOCRParser tests to pass on systems 
with and without Tesseract, would it be possible to temporarily ignore or 
comment out the things that are causing failures so that trunk will build 
cleanly?

I got a clean build if I removed TesseractOCRParser from the services list and 
commented out this line in TikaMimeTypesTest:
{noformat}
  assertEquals("org.apache.tika.parser.ocr.TesseractOCRParser", 
bmp.get("parser"));
{noformat}
 
To be clear, I'm extremely grateful for all of the work that has gone into 
integrating OCR, and apologies if you are just about to commit the fixes!

> org.apache.tika.parser.mail.RFC822ParserTest fails
> --
>
> Key: TIKA-1422
> URL: https://issues.apache.org/jira/browse/TIKA-1422
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
> Fix For: 1.7
>
>
> I'm seeing test failures from:
> {noformat}
> Results :
> Failed tests:   testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): 
> (..)
> Tests run: 538, Failures: 1, Errors: 0, Skipped: 1
> {noformat}
> CentOS6 VM image, running:
> {noformat}
> [mattmann@memex tika]$ java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
> [mattmann@memex tika]$ mvn -version
> Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 
> 2014-02-14T09:37:52-08:00)
> Maven home: /usr/share/apache-maven
> Java version: 1.7.0_65, vendor: Oracle Corporation
> Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: 
> "amd64", family: "unix"
> [mattmann@memex tika]$ 
> {noformat}
> Here are the surefire reports - no clue what's up here:
> {noformat}
> [mattmann@memex tika]$ more 
> tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt
>  
> ---
> Test set: org.apache.tika.parser.mail.RFC822ParserTest
> ---
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< 
> FAILURE!
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.152 sec  <<< FAILURE!
> org.mockito.exceptions.verification.TooManyActualInvocations: 
> xHTMLContentHandler.startElement(
> "http://www.w3.org/1999/xhtml";,
> "div",
> "div",
> isA(org.xml.sax.Attributes)
> );
> Wanted 4 times but was 5
>   at 
> org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87)
> Caused by: org.mockito.exceptions.cause.UndesiredInvocation: 
> Undesired invocation:
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243)
>   at 
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247)
>   at 
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
>   at 
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
>   at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
>   at 
> org.apache.tika.parser.mail.R

[jira] [Resolved] (TIKA-1297) Images not being extracted from PDFs

2014-09-24 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1297.
---
   Resolution: Fixed
Fix Version/s: 1.6

> Images not being extracted from PDFs
> 
>
> Key: TIKA-1297
> URL: https://issues.apache.org/jira/browse/TIKA-1297
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: James Baker
> Fix For: 1.6
>
>
> Images embedded within PDF documents are not being extracted by Tika. I have 
> tested this via the command line (where the -z option fails to extract any 
> images), and by inspecting the XHTML version of the PDF produced by Tika 
> (where the image tags are not included in the output).
> The images are extractable by PDFBox, so Tika should be able to extract them 
> and include them in the XHTML output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison closed TIKA-1396.
-
Resolution: Not a Problem

> Embedded images in PDF documents
> 
>
> Key: TIKA-1396
> URL: https://issues.apache.org/jira/browse/TIKA-1396
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>Reporter: Damiano
>Priority: Critical
> Fix For: 1.6
>
> Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

2014-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146230#comment-14146230
 ] 

Tim Allison commented on TIKA-1396:
---

Ah, ok.  Y, pls open another issue.  I should also add meta tags to the 
RTFParser while I'm at it.  The model I should use is from the microsoft 
parsers?

{noformat}
AttributesImpl attributes = new AttributesImpl();
attributes.addAttribute("", "class", "class", "CDATA", "embedded");
attributes.addAttribute("", "id", "id", "CDATA", id);
xhtml.startElement("div", attributes);
xhtml.endElement("div");
{noformat}

For the PDFParser, the inline images are extracted at the "bottom" of each 
page, not the actual coordinates, and regular attachments are extracted at the 
end of the document.  Will this wreck your processing?

> Embedded images in PDF documents
> 
>
> Key: TIKA-1396
> URL: https://issues.apache.org/jira/browse/TIKA-1396
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
> Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>Reporter: Damiano
>Priority: Critical
> Fix For: 1.6
>
> Attachments: tika_images.pdf
>
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC 
> document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

2014-09-24 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146187#comment-14146187
 ] 

Nick Burch commented on TIKA-1420:
--

For now, I'd suggest putting this into the Examples package, then the 
additional dependency should be fine.

Characters wise, you might need to use some sort of rolling buffer for the 
detection, in case the number gets split between multiple character calls (eg 
part of it is styled, part not, so in different tags, or just fits across a 
text size boundary), but for the initial version just checking the characters 
before passing them on should work fine

> Add Metadata Extraction to Arbitrary Parsers
> 
>
> Key: TIKA-1420
> URL: https://issues.apache.org/jira/browse/TIKA-1420
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tyler Palsulich
>Priority: Minor
>
> Suppose you wish to extract information from arbitrary file types and add it 
> to a Metadata Object. This type of task is best handled by a... Handler. But, 
> Handlers do not have access to the Metadata Object passed to a Parser. 
> So, I see a few ways we could do using existing functionality.
> 1) Make an intermediate XML representation of the desired metadata in a 
> handler, then convert the XML to the Metadata after parsing. 
> 2) Create a second Parser which extracts the desired information.
>  a) Assume the Handler passed to this Parser is already filled with 
> content. So, we could simply get whatever content from the Handler and 
> populate the Metadata directly.
>  b) Create a new Stream in the first Parser to pass to the second, which 
> in turn populates the Metadata.
> None of these options seem ideal. Is there a better way to handle this 
> scenario? Or, can we create some sort of... wrapper for a Handler which can 
> accept a Metadata Object to populate directly? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

tika-trunk-jdk1.7 - Build # 227 - Failure

2014-09-24 Thread Apache Jenkins Server

The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #227)

Status: Failure

Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/227/ to 
view the results.

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

tika-trunk-jdk1.7 - Build # 229 - Failure

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

[jira] [Created] (TIKA-1427) PDF Images don't appear in structured view

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

[jira] [Resolved] (TIKA-1419) Upgrade to PDFBox 1.8.7

[jira] [Resolved] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak

[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails

[jira] [Resolved] (TIKA-1297) Images not being extracted from PDFs

[jira] [Closed] (TIKA-1396) Embedded images in PDF documents

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers

tika-trunk-jdk1.7 - Build # 227 - Failure

27 matches

Site Navigation

Mail list logo

Footer information