Fwd: tika install fail on os x 10.9.2
Hi all, I have a new computer running OS X 10.9.2 (13C64). I am attempting to get Tika up and running, but am getting errors in the Maven install phase. My steps are as follows: [annies-mbp:~/tika/] % svn co https://svn.apache.org/repos/asf/tika/trunktmp [annies-mbp:~/tika/tmp]% setenv MAVEN_OPTS "-Xms128m -Xmx256m" [annies-mbp:~/tika/tmp]% mvn install Results : Tests in error: testiBooksParser(org.apache.tika.parser.ibooks.iBooksParserTest): Premature end of file. Tests run: 506, Failures: 0, Errors: 1, Skipped: 1 [INFO] [INFO] Reactor Summary: [INFO] Apache Tika parent SUCCESS [ 0.626 s] [INFO] Apache Tika core .. SUCCESS [ 6.631 s] [INFO] Apache Tika parsers ... FAILURE [ 23.323 s] . . . [INFO] [INFO] BUILD FAILURE [INFO] My Maven version is: [annies-mbp:~/Development/tika/tmp]% mvn --version Apache Maven 3.2.1 (ea8b2b07643dbb1b84b6d16e1f08391b666bc1e9; 2014-02-14T08:37:52-09:00) Maven home: /usr/local/Cellar/maven/3.2.1/libexec Java version: 1.8.0_05, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: "mac os x", version: "10.9.2", arch: "x86_64", family: "mac"-- Does anyone have any insight as to why this is failing at 'iBooksParserTest'? Thanks! Annie -- Ann Bryant Burgess, PhD Postdoctoral Fellow Computer Science Department University of Southern California Viterbi School of Engineering Los Angeles, CA Alaska Science Center/USGS Anchorage, AK Cell: (585) 738-7549 Office: (907) 786-7059 Fax: (907) 786-7150 E-mail: anniebryant.burg...@gmail.com Office Address: 4210 University Dr., Anchorage, AK 99508-4626 ---
[jira] [Commented] (TIKA-1204) DWFX files detection
[ https://issues.apache.org/jira/browse/TIKA-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13992899#comment-13992899 ] Nick Burch commented on TIKA-1204: -- Mimetype and detector logic added in r1593322. Leaving the issue open for now, pending a small file which can be used to add unit tests for this > DWFX files detection > > > Key: TIKA-1204 > URL: https://issues.apache.org/jira/browse/TIKA-1204 > Project: Tika > Issue Type: Improvement > Components: detector, mime >Affects Versions: 1.4 >Reporter: Marco Quaranta >Priority: Minor > Attachments: General assembly filter.dwfx > > > DWFX are AutoCAD [Design web > format|http://en.wikipedia.org/wiki/Design_Web_Format] files and follow [Open > Packaging > Conventions|http://en.wikipedia.org/wiki/Open_Packaging_Conventions]. > Tika "correctly" detects these files as application/zip. > It would be better if Tika could recognize the true mimetype: > model/vnd.dwfx+xps. (y) > Please add logic in ZipContainerDetector in such a way could be possible to > detect dwfx. We need a method behaving like detectOfficeOpenXML(OPCPackage > pkg): > {noformat} > PackageRelationshipCollection core = > pkg.getRelationshipsByType("http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence";); > if (core.size() != 1) { > // Invalid DWFX Package received > return null; > } > PackagePart corePart = pkg.getPart(core.getRelationship(0)); > String coreType = corePart.getContentType(); > return MediaType.parse(coreType); > {noformat} > Thank you, > Marco -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-941) Detecting KML / KMZ files
[ https://issues.apache.org/jira/browse/TIKA-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-941. - Resolution: Fixed Fix Version/s: 1.6 Thanks, additional namespaces added in r1593315. > Detecting KML / KMZ files > - > > Key: TIKA-941 > URL: https://issues.apache.org/jira/browse/TIKA-941 > Project: Tika > Issue Type: Improvement > Components: mime >Affects Versions: 1.1 >Reporter: Marco Quaranta >Assignee: Jukka Zitting >Priority: Minor > Labels: google, kml, kmz > Fix For: 1.6, 1.2 > > Attachments: ZipContainerDetector.java > > > KML format is subtype of application/xml with a "kml" root node and (an > optional?) "http://www.opengis.net/kml/2.2"; namespace. > > > http://www.opengis.net/kml/2.2"; localName="kml"/> > > KML > <_comment>Keyhole Markup Language > > > > KMZ files (https://developers.google.com/kml/documentation/kmzarchives) are > zip archives with a KML file inside (the file should be called doc.kml) and > one or more folder. A naive approach consists in adding a further check in > ZipContainerDetector (find attached). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1175) MS Money files wrongly detected as True Type Font
[ https://issues.apache.org/jira/browse/TIKA-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1175. -- Resolution: Fixed Fix Version/s: 1.6 Thanks for this, magic added in r1593311. > MS Money files wrongly detected as True Type Font > - > > Key: TIKA-1175 > URL: https://issues.apache.org/jira/browse/TIKA-1175 > Project: Tika > Issue Type: Bug > Components: mime >Affects Versions: 1.3, 1.4 >Reporter: Boris Naguet >Priority: Minor > Fix For: 1.6 > > > TTF magic is probably not specific enough, because it incorrectly detect MS > Money files as TTF files, and then the parsing generates an Exception. > {quote} > Caused by: ! java.io.IOException: head is mandatory > ! at > org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:107) > > {quote} > Here is the magic detection code that I added to {{custom-mimetypes.xml}}, > and solves it: > {code:xml} > > > > >type="string" offset="0" /> > > > {code} > It can replace the existing {{application/x-msmoney}} empty mime-type in > {{tika-mimetypes.xml}}. > magic comes from > http://filesignatures.net/index.php?search=mny&mode=EXT -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995491#comment-13995491 ] Tim Allison edited comment on TIKA-1294 at 5/12/14 7:16 PM: Great. Just to make sure that I understand correctly...I think I was going to head this route at one point via subclassing EmbeddedResourceHandler. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage? was (Author: talli...@mitre.org): Great. Just to make sure that I understand correctly...I think I was going to head this route at one point. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage? > Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs > --- > > Key: TIKA-1294 > URL: https://issues.apache.org/jira/browse/TIKA-1294 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: TIKA-1294.patch > > > TIKA-1268 added the capability to extract embedded images as regular embedded > resources...a great feature! > However, for some use cases, it might not be desirable to extract those types > of embedded resources. I see two ways of allowing the client to choose > whether or not to extract those images: > 1) set a value in the metadata for the extracted images that identifies them > as embedded PDXObjectImages vs regular image attachments. The client can > then choose not to process embedded resources with a given metadata value. > 2) allow the client to set a parameter in the PDFConfig object. > My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995945#comment-13995945 ] Ray Gauss II commented on TIKA-1295: +1 for the data model more accurately reflecting the standard and for multilingual fields, but with a simple text bag how would you know which value corresponds to which language? I think this is another example that highlights the need for a more structured underlying metadata store as mentioned in section IV of the [metadata roadmap|http://wiki.apache.org/tika/MetadataRoadmap]. > Make some Dublin Core items multi-valued > > > Key: TIKA-1295 > URL: https://issues.apache.org/jira/browse/TIKA-1295 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Minor > Fix For: 1.6 > > > According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, > dc:title, dc:description and dc:rights should allow multiple values because > of language alternatives. Unless anyone objects in the next few days, I'll > switch those to Property.toInternalTextBag() from Property.toInternalText(). > I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995298#comment-13995298 ] Ray Gauss II commented on TIKA-1278: Hi [~tallison], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. > Expose PDF Avg Char and Spacing Tolerance Config Params > --- > > Key: TIKA-1278 > URL: https://issues.apache.org/jira/browse/TIKA-1278 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.5 >Reporter: Ray Gauss II >Assignee: Ray Gauss II > Fix For: 1.6 > > > {{PDFParserConfig}} should allow for override of PDFBox's > {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO > comment in {{PDF2XHTML}}. > Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed > slightly to allow for extension of that config class and its configuration > behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997801#comment-13997801 ] Tim Allison commented on TIKA-1294: --- https://github.com/kryton/flaming-sailor/blob/master/src/main/java/com/zilbo/flamingSailor/TE/PDFParser.java Code below this comment looks just like ours {quote} /* The following code is REALLY raw. initial testing seemed to show memory leaks, and was REALLY slow*/ {quote} > Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs > --- > > Key: TIKA-1294 > URL: https://issues.apache.org/jira/browse/TIKA-1294 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Priority: Trivial > Attachments: TIKA-1294.patch > > > TIKA-1268 added the capability to extract embedded images as regular embedded > resources...a great feature! > However, for some use cases, it might not be desirable to extract those types > of embedded resources. I see two ways of allowing the client to choose > whether or not to extract those images: > 1) set a value in the metadata for the extracted images that identifies them > as embedded PDXObjectImages vs regular image attachments. The client can > then choose not to process embedded resources with a given metadata value. > 2) allow the client to set a parameter in the PDFConfig object. > My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?
Hi Nick, On 7 May 2014, at 12:48, Nick Burch wrote: > Hi All > > One for our JAXRS gurus here… OK, not a guru here but I have a hunch. > At ApacheCon, we came up with the idea of having a welcome page on the Tika > Server, so that we could point people to it to try Tika, and let them > discover what it offered. Based on that, and the mailing list discussions, we > raised TIKA-1269. > > (Related to that is TIKA-1270, which aims to add endpoints similar to the > --list- ones the Tika CLI has, which is in progress) > > While we work out the best way to allow users to discover + learn about + try > the various REST endpoints on TIKA-1269, I've started with something basic. > This is done with the simple TikaWelcome class, which has a Path of / > > The problem - when the MetadataEP and UnpackerResource are enabled, it > doesn't work! With those to there, when you request / you get a 404 and the > server logs: > rg.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod > WARNING: No operation matching request path "/" is found, Relative Path: /, > HTTP Method: GET, ContentType: */*, Accept: */*,. Please enable FINE/TRACE > log level for more details. > > However, if you comment out those two endpoint classes from the > sf.setResourceClasses() call in TikaServerCLI, then the request gets > correctly routed to the welcome page. > > Neither MetadataEP nor UnpackerResource have a path that clashes, so I've no > idea why having them active stops / working. Any ideas? I am having a look quickly whilst traveling but from peeking at the code it looks like the following to me: * MetadataEP - we have no @Produces which will fail in the introspection code on the TikaWelcome class * UnpackerResource - as there is no class level @Path I am suspecting this is clashing with the TikaWelcome as it builds the routes with the method ones being place on the root as well. I don’t have time to test it just now but I wonder what would happen if you reordered TikaWelcome to the top about UnpackerResource? If my hunch is correct it should make the / request work using the self-generated documentation. > (Patch below if you want to try disabling them yourself to investigate) > > Nick > Cheers, Dave
[jira] [Updated] (TIKA-1275) Upgrade Commons compress to 1.8.1
[ https://issues.apache.org/jira/browse/TIKA-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Lange updated TIKA-1275: --- Description: Hi, I am using Tika to detect content also from archives. But because the raw input stream is a CipherInputStream I ran into https://issues.apache.org/jira/browse/COMPRESS-277 which compress kindly solved for me. To be able to use Tika without patching my stack, I would like to see an upgrade of commons compress to 1.8.1 as soon as it is out. This may, or may not be in 1.6 timeframe. Thanks! was: Hi, I am using Tika to detect content also from archives. But because the raw input stream is a CipherInputStream I ran into https://issues.apache.org/jira/browse/COMPRESS-277 which compress kindly solved for me. To be able to use Tika without patching my stack, I would like to see an upgrade of commons compress to 1.9 as soon as it is out. This may, or may not be in 1.6 timeframe. Thanks! Summary: Upgrade Commons compress to 1.8.1 (was: Upgrade Commons compress (to 1.9)) Hi Nick, compress 1.8.1 was released: http://markmail.org/message/rkwsqhs76hwrhrrw this contains the fixes to the compressed streams. I updated the ticket to reflect the 1.8.1 version number. Would be nice to upgrade, so that the detection support works nicely for all archives. Fabian > Upgrade Commons compress to 1.8.1 > - > > Key: TIKA-1275 > URL: https://issues.apache.org/jira/browse/TIKA-1275 > Project: Tika > Issue Type: Bug >Reporter: Fabian Lange > > Hi, > I am using Tika to detect content also from archives. But because the raw > input stream is a CipherInputStream I ran into > https://issues.apache.org/jira/browse/COMPRESS-277 > which compress kindly solved for me. > To be able to use Tika without patching my stack, I would like to see an > upgrade of commons compress to 1.8.1 as soon as it is out. > This may, or may not be in 1.6 timeframe. > Thanks! -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: JAXRS, endpoints and a / welcome page - any ideas why it's broken?
On Wed, 14 May 2014, Sergey Beryozkin wrote: UnpackerResource has no Path annotation so it is defaulted to "/". Every endpoint method within the class does have one though. I would've expected it to match based on those, is that not the case? However, the selection between multiple root resources with the same top-level Path is more expensive so ideally we could introduce a dedicated @Path to UnpackerResource. As we add more endpoints, that would seem to make sense to me. I'm not sure how widely used the unpacker resource is, so I don't know how much disruption it would be if we added a /unpacker/ prefix to the path? The other option is to consider implementing a Welcome functionality in a JAX-RS 2.0 ContainerRequestFilter (supported in CXF 2.7.x), build a welcome info there and abort/block a request Is that the more "normal" way to handle it in JAX-RS, or is what we've got so far a generally know practice? Nick
[jira] [Resolved] (TIKA-1112) Parsing for OGV file with invalid checksum
[ https://issues.apache.org/jira/browse/TIKA-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Burch resolved TIKA-1112. -- Resolution: Fixed Fix Version/s: 1.6 Fixed with upgrade to 0.6 in r1593570. > Parsing for OGV file with invalid checksum > -- > > Key: TIKA-1112 > URL: https://issues.apache.org/jira/browse/TIKA-1112 > Project: Tika > Issue Type: Bug > Components: metadata, parser >Affects Versions: 1.3 > Environment: OS X 10.8.3 > JDK 1.6.0_45 64-bit >Reporter: Alexander Chow > Fix For: 1.6 > > > When parsing any OGV file (e.g., > [Typing_example.ogv|http://commons.wikimedia.org/wiki/File:Typing_example.ogv]), > log will output something like the following: > {code} > Warning - invalid checksum on page 2 of stream 155f (5471) > Warning - invalid checksum on page 3 of stream 155f (5471) > Warning - invalid checksum on page 4 of stream 155f (5471) > Warning - invalid checksum on page 5 of stream 155f (5471) > Warning - invalid checksum on page 6 of stream 155f (5471) > Warning - invalid checksum on page 7 of stream 155f (5471) > ... > Warning - invalid checksum on page 3071 of stream 155f (5471) > Warning - invalid checksum on page 3072 of stream 155f (5471) > Warning - invalid checksum on page 3073 of stream 155f (5471) > Warning - invalid checksum on page 3074 of stream 155f (5471) > Exception in thread "main" java.io.IOException: Asked to read 4228 bytes from > 0 but hit EoF at 2884 > at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:39) > at org.gagravarr.ogg.IOUtils.readFully(IOUtils.java:31) > at org.gagravarr.ogg.OggPage.(OggPage.java:82) > at > org.gagravarr.ogg.OggPacketReader.getNextPacket(OggPacketReader.java:116) > at org.gagravarr.tika.OggDetector.detect(OggDetector.java:79) > at > org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:61) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:113) > at com.test.OGVTest.main(OGVTest.java:31) > {code} > My test code was the following: > {code:java} > void parse(String fileName) throws Exception { > InputStream inputStream = new FileInputStream(fileName); > > Metadata metadata = new Metadata(); > > Parser parser = new AutoDetectParser(); > > ParseContext parserContext = new ParseContext(); > parserContext.set(Parser.class, parser); > ContentHandler contentHandler = new WriteOutContentHandler( > new DummyWriter()); > parser.parse(inputStream, contentHandler, metadata, > parserContext); > > System.out.println(metadata); > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
JAXRS, endpoints and a / welcome page - any ideas why it's broken?
Hi All One for our JAXRS gurus here... At ApacheCon, we came up with the idea of having a welcome page on the Tika Server, so that we could point people to it to try Tika, and let them discover what it offered. Based on that, and the mailing list discussions, we raised TIKA-1269. (Related to that is TIKA-1270, which aims to add endpoints similar to the --list- ones the Tika CLI has, which is in progress) While we work out the best way to allow users to discover + learn about + try the various REST endpoints on TIKA-1269, I've started with something basic. This is done with the simple TikaWelcome class, which has a Path of / The problem - when the MetadataEP and UnpackerResource are enabled, it doesn't work! With those to there, when you request / you get a 404 and the server logs: rg.apache.cxf.jaxrs.utils.JAXRSUtils findTargetMethod WARNING: No operation matching request path "/" is found, Relative Path: /, HTTP Method: GET, ContentType: */*, Accept: */*,. Please enable FINE/TRACE log level for more details. However, if you comment out those two endpoint classes from the sf.setResourceClasses() call in TikaServerCLI, then the request gets correctly routed to the welcome page. Neither MetadataEP nor UnpackerResource have a path that clashes, so I've no idea why having them active stops / working. Any ideas? (Patch below if you want to try disabling them yourself to investigate) Nick Index: src/main/java/org/apache/tika/server/TikaServerCli.java === --- src/main/java/org/apache/tika/server/TikaServerCli.java (revision 1592656) +++ src/main/java/org/apache/tika/server/TikaServerCli.java (working copy) @@ -92,10 +92,20 @@ JAXRSServerFactoryBean sf = new JAXRSServerFactoryBean(); // Note - at least one of these stops TikaWelcome matching on / // This prevents TikaWelcome acting as a partial solution to TIKA-1269 - sf.setResourceClasses(MetadataEP.class, MetadataResource.class, - TikaResource.class, UnpackerResource.class, - TikaDetectors.class, TikaMimeTypes.class, - TikaVersion.class, TikaWelcome.class); +// sf.setResourceClasses(MetadataEP.class, MetadataResource.class, +// TikaResource.class, UnpackerResource.class, +// TikaDetectors.class, TikaMimeTypes.class, +// TikaVersion.class, TikaWelcome.class); + sf.setResourceClasses( +// MetadataEP.class, + MetadataResource.class, + TikaResource.class, +// UnpackerResource.class, + TikaDetectors.class, + TikaMimeTypes.class, + TikaVersion.class, + TikaWelcome.class + ); List providers = new ArrayList(); providers.add(new TarWriter());