[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147302#comment-14147302 ] Hudson commented on TIKA-1420: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #208 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/208/]) TIKA-1420, refactor the phone number extraction to use a custom method of de-obfuscating numbers. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627446) * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/CleanPhoneText.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147285#comment-14147285 ] Hudson commented on TIKA-1420: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #230 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/230/]) TIKA-1420, refactor the phone number extraction to use a custom method of de-obfuscating numbers. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627446) * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/CleanPhoneText.java * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147272#comment-14147272 ] Tyler Palsulich commented on TIKA-1420: --- Just made some more updates in r1627446. I added a lot more documentation, removed the dependency on libphonenumber, and added custom phone number deobfuscation code. The solution given assumes that the file's text will fit in a String, which may not be true. But, we can iterate on that later. In my opinion, this is worth more than just an example. Parse any file and get a list of phone numbers out. > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146976#comment-14146976 ] Nick Burch commented on TIKA-1420: -- Since it's an example, it might be good to put in a hefty amount of class-level JavaDoc explaining how it works, why you might want to use something like that etc! > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146826#comment-14146826 ] Hudson commented on TIKA-1420: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #207 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/207/]) TIKA-1420, create an example of a PhoneNumberContentExtractor. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627397) * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java * /tika/trunk/tika-example/src/test/resources * /tika/trunk/tika-example/src/test/resources/org * /tika/trunk/tika-example/src/test/resources/org/apache * /tika/trunk/tika-example/src/test/resources/org/apache/tika * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146808#comment-14146808 ] Hudson commented on TIKA-1420: -- FAILURE: Integrated in tika-trunk-jdk1.7 #229 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/229/]) TIKA-1420, create an example of a PhoneNumberContentExtractor. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627397) * /tika/trunk/tika-example/pom.xml * /tika/trunk/tika-example/src/main/java/org/apache/tika/example/PhoneExtractingContentHandler.java * /tika/trunk/tika-example/src/test/java/org/apache/tika/example/PhoneExtractingContentHandlerTest.java * /tika/trunk/tika-example/src/test/resources * /tika/trunk/tika-example/src/test/resources/org * /tika/trunk/tika-example/src/test/resources/org/apache * /tika/trunk/tika-example/src/test/resources/org/apache/tika * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example * /tika/trunk/tika-example/src/test/resources/org/apache/tika/example/testPhoneNumberExtractor.odt > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-trunk-jdk1.7 - Build # 229 - Failure
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #229) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/229/ to view the results.
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146775#comment-14146775 ] Tyler Palsulich commented on TIKA-1420: --- Initial example added in r1627397. > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1396) Embedded images in PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146588#comment-14146588 ] Tim Allison commented on TIKA-1396: --- Y, I can think of a few options. We still need to add tags in the PDFParser and RTFParser, and I'll do that on TIKA-1427...thank you for opening that. You could use a ParserContainerExtractor to extract each file, or you could use an EmbeddedDocumentExtractor (see TikaCLI in tika-app or UnpackerResource in tika-server for examples). You might also try the RecursiveParserWrapper that I just added to trunk if you know that your docs will be small enough to hold in memory. With that, you parse a document and then call getMetadata() on the parser. It returns a list of Metadata objects -- the first one is the parent document and then one metadata object for each attachment. The text can be stored in a metadata field depending on what ContentHandlerFactory you pass in...but you would just iterate through the list to get the metadata and content for each embedded doc. > Embedded images in PDF documents > > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) >Reporter: Damiano >Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1396) Embedded images in PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146564#comment-14146564 ] James Baker commented on TIKA-1396: --- Issue created, TIKA-1427. > Embedded images in PDF documents > > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) >Reporter: Damiano >Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1427) PDF Images don't appear in structured view
James Baker created TIKA-1427: - Summary: PDF Images don't appear in structured view Key: TIKA-1427 URL: https://issues.apache.org/jira/browse/TIKA-1427 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: James Baker When viewing, say, a Word Document, any images appear in the 'structured view' of the document as tags. The same is not true of PDF documents, and we lose both the fact that there is an image present, and where it is in the document. Some discussion of this issue in the comments of TIKA-1396. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1396) Embedded images in PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146558#comment-14146558 ] James Baker commented on TIKA-1396: --- That will affect my processing, yes. My use case is trying to split a document into separate documents based on a delimiter in the text. If we don't know where the image is on the page, we don't know which document it should be in! Any ideas how that could be worked around? > Embedded images in PDF documents > > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) >Reporter: Damiano >Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146537#comment-14146537 ] Tim Allison commented on TIKA-1422: --- Sorry, user error. Needed to force update. Thank you! > org.apache.tika.parser.mail.RFC822ParserTest fails > -- > > Key: TIKA-1422 > URL: https://issues.apache.org/jira/browse/TIKA-1422 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann > Fix For: 1.7 > > > I'm seeing test failures from: > {noformat} > Results : > Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): > (..) > Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 > {noformat} > CentOS6 VM image, running: > {noformat} > [mattmann@memex tika]$ java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > [mattmann@memex tika]$ mvn -version > Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; > 2014-02-14T09:37:52-08:00) > Maven home: /usr/share/apache-maven > Java version: 1.7.0_65, vendor: Oracle Corporation > Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: > "amd64", family: "unix" > [mattmann@memex tika]$ > {noformat} > Here are the surefire reports - no clue what's up here: > {noformat} > [mattmann@memex tika]$ more > tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt > > --- > Test set: org.apache.tika.parser.mail.RFC822ParserTest > --- > Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< > FAILURE! > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.152 sec <<< FAILURE! > org.mockito.exceptions.verification.TooManyActualInvocations: > xHTMLContentHandler.startElement( > "http://www.w3.org/1999/xhtml";, > "div", > "div", > isA(org.xml.sax.Attributes) > ); > Wanted 4 times but was 5 > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) > Caused by: org.mockito.exceptions.cause.UndesiredInvocation: > Undesired invocation: > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) > at > org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146523#comment-14146523 ] Tyler Palsulich commented on TIKA-1422: --- The Hudson builds are now stable with the fix from TIKA-1421. So, this is only a failure when Tesseract is installed. It has something to do with how attachments are parsed, but I'm not sure exactly what this test is or why it's failing. As I understand it, there are 4 invocations of the handler without Tesseract installed and 5 with. So, it may not be an actual problem... But, if you think we should disable it temporarily, that's fine by me! We could also comment out the failing Assert in this test. > org.apache.tika.parser.mail.RFC822ParserTest fails > -- > > Key: TIKA-1422 > URL: https://issues.apache.org/jira/browse/TIKA-1422 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann > Fix For: 1.7 > > > I'm seeing test failures from: > {noformat} > Results : > Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): > (..) > Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 > {noformat} > CentOS6 VM image, running: > {noformat} > [mattmann@memex tika]$ java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > [mattmann@memex tika]$ mvn -version > Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; > 2014-02-14T09:37:52-08:00) > Maven home: /usr/share/apache-maven > Java version: 1.7.0_65, vendor: Oracle Corporation > Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: > "amd64", family: "unix" > [mattmann@memex tika]$ > {noformat} > Here are the surefire reports - no clue what's up here: > {noformat} > [mattmann@memex tika]$ more > tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt > > --- > Test set: org.apache.tika.parser.mail.RFC822ParserTest > --- > Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< > FAILURE! > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.152 sec <<< FAILURE! > org.mockito.exceptions.verification.TooManyActualInvocations: > xHTMLContentHandler.startElement( > "http://www.w3.org/1999/xhtml";, > "div", > "div", > isA(org.xml.sax.Attributes) > ); > Wanted 4 times but was 5 > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) > Caused by: org.mockito.exceptions.cause.UndesiredInvocation: > Undesired invocation: > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) > at > org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:84) > at sun.reflect.NativeMethodAccessorImpl.invo
[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146360#comment-14146360 ] Hudson commented on TIKA-1419: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #206 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/206/]) TIKA-1419: upgrade to PDFBox 1.8.7 and update CHANGES.txt for this and a few recent changes (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627308) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/pom.xml > Upgrade to PDFBox 1.8.7 > --- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak
[ https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146361#comment-14146361 ] Hudson commented on TIKA-1424: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #206 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/206/]) TIKA-1424: clear PDFont's resources after each document (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627304) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > Clear PDFont's resources after each file to prevent memory leak > --- > > Key: TIKA-1424 > URL: https://issues.apache.org/jira/browse/TIKA-1424 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > > PDFBox-2200 identified a memory-leak/caching strategy that can cause problems > for some documents. A workaround of clearing the cache was recommended for > now. Let's add that to Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146338#comment-14146338 ] Hudson commented on TIKA-1419: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #228 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/228/]) TIKA-1419: upgrade to PDFBox 1.8.7 and update CHANGES.txt for this and a few recent changes (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627308) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/pom.xml > Upgrade to PDFBox 1.8.7 > --- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak
[ https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146339#comment-14146339 ] Hudson commented on TIKA-1424: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #228 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/228/]) TIKA-1424: clear PDFont's resources after each document (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1627304) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java > Clear PDFont's resources after each file to prevent memory leak > --- > > Key: TIKA-1424 > URL: https://issues.apache.org/jira/browse/TIKA-1424 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > > PDFBox-2200 identified a memory-leak/caching strategy that can cause problems > for some documents. A workaround of clearing the cache was recommended for > now. Let's add that to Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146294#comment-14146294 ] Tim Allison commented on TIKA-1419: --- Happy to help (and again my apologies for the post-hoc run!)...and I look forward to the day when you can run your own regression tests on our shared vm! Email would be great or you could open a ticket on tika for the upgrade before it is officially released. Thank you, again! > Upgrade to PDFBox 1.8.7 > --- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1419) Upgrade to PDFBox 1.8.7
[ https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1419. --- Resolution: Fixed r1627308 > Upgrade to PDFBox 1.8.7 > --- > > Key: TIKA-1419 > URL: https://issues.apache.org/jira/browse/TIKA-1419 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv > > > Will run against govdocs1 early next week and then upgrade if no major > regressions are found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1424) Clear PDFont's resources after each file to prevent memory leak
[ https://issues.apache.org/jira/browse/TIKA-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1424. --- Resolution: Fixed r1627304 > Clear PDFont's resources after each file to prevent memory leak > --- > > Key: TIKA-1424 > URL: https://issues.apache.org/jira/browse/TIKA-1424 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > > PDFBox-2200 identified a memory-leak/caching strategy that can cause problems > for some documents. A workaround of clearing the cache was recommended for > now. Let's add that to Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1422) org.apache.tika.parser.mail.RFC822ParserTest fails
[ https://issues.apache.org/jira/browse/TIKA-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146283#comment-14146283 ] Tim Allison commented on TIKA-1422: --- While work is going on to get the TesseractOCRParser tests to pass on systems with and without Tesseract, would it be possible to temporarily ignore or comment out the things that are causing failures so that trunk will build cleanly? I got a clean build if I removed TesseractOCRParser from the services list and commented out this line in TikaMimeTypesTest: {noformat} assertEquals("org.apache.tika.parser.ocr.TesseractOCRParser", bmp.get("parser")); {noformat} To be clear, I'm extremely grateful for all of the work that has gone into integrating OCR, and apologies if you are just about to commit the fixes! > org.apache.tika.parser.mail.RFC822ParserTest fails > -- > > Key: TIKA-1422 > URL: https://issues.apache.org/jira/browse/TIKA-1422 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Chris A. Mattmann > Fix For: 1.7 > > > I'm seeing test failures from: > {noformat} > Results : > Failed tests: testMultipart(org.apache.tika.parser.mail.RFC822ParserTest): > (..) > Tests run: 538, Failures: 1, Errors: 0, Skipped: 1 > {noformat} > CentOS6 VM image, running: > {noformat} > [mattmann@memex tika]$ java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > [mattmann@memex tika]$ mvn -version > Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; > 2014-02-14T09:37:52-08:00) > Maven home: /usr/share/apache-maven > Java version: 1.7.0_65, vendor: Oracle Corporation > Java home: /data/home/mattmann/dist/jdk1.7.0_65/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "2.6.32-431.23.3.el6.centos.plus.x86_64", arch: > "amd64", family: "unix" > [mattmann@memex tika]$ > {noformat} > Here are the surefire reports - no clue what's up here: > {noformat} > [mattmann@memex tika]$ more > tika-parsers/target/surefire-reports/org.apache.tika.parser.mail.RFC822ParserTest.txt > > --- > Test set: org.apache.tika.parser.mail.RFC822ParserTest > --- > Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.699 sec <<< > FAILURE! > testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: > 0.152 sec <<< FAILURE! > org.mockito.exceptions.verification.TooManyActualInvocations: > xHTMLContentHandler.startElement( > "http://www.w3.org/1999/xhtml";, > "div", > "div", > isA(org.xml.sax.Attributes) > ); > Wanted 4 times but was 5 > at > org.apache.tika.parser.mail.RFC822ParserTest.testMultipart(RFC822ParserTest.java:87) > Caused by: org.mockito.exceptions.cause.UndesiredInvocation: > Undesired invocation: > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284) > at > org.apache.tika.parser.ocr.TesseractOCRParser.extractOutput(TesseractOCRParser.java:243) > at > org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:155) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:247) > at > org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102) > at > org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) > at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76) > at > org.apache.tika.parser.mail.R
[jira] [Resolved] (TIKA-1297) Images not being extracted from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1297. --- Resolution: Fixed Fix Version/s: 1.6 > Images not being extracted from PDFs > > > Key: TIKA-1297 > URL: https://issues.apache.org/jira/browse/TIKA-1297 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 >Reporter: James Baker > Fix For: 1.6 > > > Images embedded within PDF documents are not being extracted by Tika. I have > tested this via the command line (where the -z option fails to extract any > images), and by inspecting the XHTML version of the PDF produced by Tika > (where the image tags are not included in the output). > The images are extractable by PDFBox, so Tika should be able to extract them > and include them in the XHTML output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1396) Embedded images in PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison closed TIKA-1396. - Resolution: Not a Problem > Embedded images in PDF documents > > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) >Reporter: Damiano >Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1396) Embedded images in PDF documents
[ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146230#comment-14146230 ] Tim Allison commented on TIKA-1396: --- Ah, ok. Y, pls open another issue. I should also add meta tags to the RTFParser while I'm at it. The model I should use is from the microsoft parsers? {noformat} AttributesImpl attributes = new AttributesImpl(); attributes.addAttribute("", "class", "class", "CDATA", "embedded"); attributes.addAttribute("", "id", "id", "CDATA", id); xhtml.startElement("div", attributes); xhtml.endElement("div"); {noformat} For the PDFParser, the inline images are extracted at the "bottom" of each page, not the actual coordinates, and regular attachments are extracted at the end of the document. Will this wreck your processing? > Embedded images in PDF documents > > > Key: TIKA-1396 > URL: https://issues.apache.org/jira/browse/TIKA-1396 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5 > Environment: *OS:* > Ubuntu 14.04.1 LTS > *KERNEL:* > 3.13.0-33-generic > gcc version 4.8.2 > *JAVA:* > java version "1.8.0_11" > Java(TM) SE Runtime Environment (build 1.8.0_11-b12) > Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode) >Reporter: Damiano >Priority: Critical > Fix For: 1.6 > > Attachments: tika_images.pdf > > > Hello! > I just found a problem with PDF documents that have embedded images. > Doing: > java -jar tika-app-1.5.jar --extract tika.pdf > Tika can not find the image. > Is this a PDF related problem? Because if i do the same operation with a DOC > document Tika finds the image correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1420) Add Metadata Extraction to Arbitrary Parsers
[ https://issues.apache.org/jira/browse/TIKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146187#comment-14146187 ] Nick Burch commented on TIKA-1420: -- For now, I'd suggest putting this into the Examples package, then the additional dependency should be fine. Characters wise, you might need to use some sort of rolling buffer for the detection, in case the number gets split between multiple character calls (eg part of it is styled, part not, so in different tags, or just fits across a text size boundary), but for the initial version just checking the characters before passing them on should work fine > Add Metadata Extraction to Arbitrary Parsers > > > Key: TIKA-1420 > URL: https://issues.apache.org/jira/browse/TIKA-1420 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: Tyler Palsulich >Priority: Minor > > Suppose you wish to extract information from arbitrary file types and add it > to a Metadata Object. This type of task is best handled by a... Handler. But, > Handlers do not have access to the Metadata Object passed to a Parser. > So, I see a few ways we could do using existing functionality. > 1) Make an intermediate XML representation of the desired metadata in a > handler, then convert the XML to the Metadata after parsing. > 2) Create a second Parser which extracts the desired information. > a) Assume the Handler passed to this Parser is already filled with > content. So, we could simply get whatever content from the Handler and > populate the Metadata directly. > b) Create a new Stream in the first Parser to pass to the second, which > in turn populates the Metadata. > None of these options seem ideal. Is there a better way to handle this > scenario? Or, can we create some sort of... wrapper for a Handler which can > accept a Metadata Object to populate directly? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
tika-trunk-jdk1.7 - Build # 227 - Failure
The Apache Jenkins build system has built tika-trunk-jdk1.7 (build #227) Status: Failure Check console output at https://builds.apache.org/job/tika-trunk-jdk1.7/227/ to view the results.