[jira] [Created] (TIKA-1901) tika detect consumes stream when streams contains msoffice file
Gerard van der Hoorn created TIKA-1901: -- Summary: tika detect consumes stream when streams contains msoffice file Key: TIKA-1901 URL: https://issues.apache.org/jira/browse/TIKA-1901 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.12 Reporter: Gerard van der Hoorn When tika.detect is used to on ms-office file (word or excel 2003) the stream is consumed which is not as expected. According to the documentation when the stream supports marking the position in the file will be returned to the original position. Added is a testcase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1901) tika detect consumes stream when streams contains msoffice file
[ https://issues.apache.org/jira/browse/TIKA-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerard van der Hoorn updated TIKA-1901: --- Attachment: test.xls test.pdf test.doc TikaStreamConsumingIssue.java Added a junit test with test files. > tika detect consumes stream when streams contains msoffice file > --- > > Key: TIKA-1901 > URL: https://issues.apache.org/jira/browse/TIKA-1901 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.12 >Reporter: Gerard van der Hoorn > Attachments: TikaStreamConsumingIssue.java, test.doc, test.pdf, > test.xls > > > When tika.detect is used to on ms-office file (word or excel 2003) the stream > is consumed which is not as expected. According to the documentation when > the stream supports marking the position in the file will be returned to the > original position. > Added is a testcase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor
[ https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193622#comment-15193622 ] Ray Gauss II commented on TIKA-1894: The {{tika-xmp}} project deals with converting a populated Tika {{Metadata}} object into XMP. Perhaps that project should be renamed to something more specific at some point, but regardless, I don't think it's the right spot for this sort of shared parser code. I'd vote for the simpler shared util jar, but I think it can still live next to the modules, something like {{/tika-parsers-modules/tika-parser-xmp-commons}}? > Add XMPMM metadata extraction to JempboxExtractor > - > > Key: TIKA-1894 > URL: https://issues.apache.org/jira/browse/TIKA-1894 > Project: Tika > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > > The XMP Media Management (XMPMM) section of xmp carries some useful > information. We currently have keys for many of the important attributes in > tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but > the wiring between the two has not yet been installed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [DISCUSS] options for XMP parsing?
Hi Tim, Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try to help out where I can. > You've been happy with it at Alfresco? It's been a while since I looked at it but I don't recall any difficulties. > I'd be interested to hear more about what happens with InDesign files. It stores things in 'pages' [1]. Regards, Ray [1] http://stackoverflow.com/a/22661992 > On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. wrote: > > Hi Ray, > Got it. Thank you. > > That'd be great. In follow up discussion with PDFBox devs, they mentioned > that it is not a design feature/restriction on XMPBox that it doesn't handle > non PDF/A files...only a matter of patching and building out their current > code base. The downside is there's quite a bit to do, the upside is that it > is a living code base. > > I'll experiment with Adobe's xmp-core. If you have any pointers/examples, > let me know...I'll be starting with: > https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. > You've been happy with it at Alfresco? > > No matter which package we use, it would be nice to build out uniform > extraction of XMP for all image and PDF files for the common elements -- with > special handling by file type if necessary. As you mentioned, it would also > be great to add or modify our XMPScanner to extract all XMP packets from a > file...I've started dabbling with this here: > https://github.com/tballison/tika/tree/xmp_scanner . I'd be interested to > hear more about what happens with InDesign files. In our own test set, we > have a PDF file with two packets containing conflicting authorship info IIRC! > :) It would be nice to expose both the canonical XMP info (with proper > processing of "later-xmp-overrides-earlier") as well as all of the info that > can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two > different use cases. > > Thank you, again. > > Cheers, > > Tim > > > > > -Original Message- > From: Ray Gauss [mailto:ray.ga...@alfresco.com] > Sent: Tuesday, March 08, 2016 2:34 PM > To: dev@tika.apache.org > Subject: Re: [DISCUSS] options for XMP parsing? > > To clarify... the 'we' in my third sentence was referring to Alfresco, not > Tika. > > I'm not sure how much of that code would be useful but I may be able to > contribute some of it. > > Regards, > > Ray > > >> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. wrote: >> >> Thank you. Will take a look. >> >> -Original Message- >> From: Ray Gauss [mailto:ray.ga...@alfresco.com] >> Sent: Tuesday, March 08, 2016 1:55 PM >> To: dev@tika.apache.org >> Subject: Re: [DISCUSS] options for XMP parsing? >> >> Hi Tim, >> >> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing >> XMP (though has not seen updates in a while), but getting the XMP packets >> out of the files is tricker. >> >> We have XMPPacketScanner which works for many cases, but not all. InDesign >> files for example do some strange things. >> >> In the past we've used different packet scanners depending on the file type >> (including Exiftool command-line) to get the XMP out then used xmpcore to >> parse into simple flattened properties. >> >> Regards, >> >> Ray >> >> >>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. wrote: >>> >>> All, >>> >>> PDFBox 2.0 is soon to be released. In the course of its development, the >>> project has migrated from Jempbox (which we're now using) to XmpBox; and >>> Jempbox is now on its last legs. >>> >>> XmpBox was "written for PDF/A checking," not for robust processing of >>> common variants of XMPs in the wild; I found that it fails on roughly 40% >>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl. >>> >>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life. >>> >>> Has anyone had any luck with an Apache-friendly XMP parser? Are there >>> better options than copying and pasting jempbox into Tika and maintaining >>> it ourselves (yuk!)? >>> >>>Best, >>> >>> Tim >>> >>> -Original Message- >>> From: Tilman Hausherr [mailto:thaush...@t-online.de] >>> Sent: Tuesday, March 08, 2016 12:13 PM >>> To: d...@pdfbox.apache.org >>> Subject: Re: roadmap for XMPBox? >>> >>> I think the problem is that XmpBox was written for PDF/A checking, so it >>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the >>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: >>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_ >>> p >>> roperties_in_pdfa-1_2008-03-20.pdf >>> >>> And no, there are no plans for anything on XMP at this time... >>> >>> Tilman >>> >>> >>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: All, When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from our current re
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845 ] Ray Gauss II commented on TIKA-1607: Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedResourceHandler}} implementation they could use without resorting to extracting them to files? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1902) Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files
Harsh Fatepuria created TIKA-1902: - Summary: Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files Key: TIKA-1902 URL: https://issues.apache.org/jira/browse/TIKA-1902 Project: Tika Issue Type: Bug Components: handler, parser Affects Versions: 1.12 Environment: Java Reporter: Harsh Fatepuria Java Code: public static String parseBodyToHTML(String filePath) throws IOException, SAXException, TikaException { ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler()); AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); try (FileInputStream stream =new FileInputStream(new File(filePath))) { parser.parse(stream, handler, metadata); return handler.toString(); } } While using this function for some files, I get the following error: Exception in thread "main" org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) at org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) at org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291) at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)
[ https://issues.apache.org/jira/browse/TIKA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harsh Fatepuria updated TIKA-1902: -- Summary: Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object) (was: Error while parsing a file using ContentHandler object (initialized using the BodyContentHandler object) for some files) > Error while parsing some files using ContentHandler object (initialized using > the BodyContentHandler object) > - > > Key: TIKA-1902 > URL: https://issues.apache.org/jira/browse/TIKA-1902 > Project: Tika > Issue Type: Bug > Components: handler, parser >Affects Versions: 1.12 > Environment: Java >Reporter: Harsh Fatepuria > Labels: handler, java, parser, tika > > Java Code: > public static String parseBodyToHTML(String filePath) throws IOException, > SAXException, TikaException > { > ContentHandler handler = new BodyContentHandler(new > ToXMLContentHandler()); > > AutoDetectParser parser = new AutoDetectParser(); > Metadata metadata = new Metadata(); > try (FileInputStream stream =new FileInputStream(new > File(filePath))) { > parser.parse(stream, handler, metadata); > return handler.toString(); > } > } > While using this function for some files, I get the following error: > Exception in thread "main" org.xml.sax.SAXException: Namespace > http://www.w3.org/1999/xhtml not declared > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) > at > org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291) > at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1902) Error while parsing some files using ContentHandler object (initialized using the BodyContentHandler object)
[ https://issues.apache.org/jira/browse/TIKA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194148#comment-15194148 ] Harsh Fatepuria commented on TIKA-1902: --- I am using the BodyContentHandler object to just get the part of the extracted XHTML. If I modify the code to : ContentHandler handler = new ToXMLContentHandler(); I get the full XHTML code with the metadata. > Error while parsing some files using ContentHandler object (initialized using > the BodyContentHandler object) > - > > Key: TIKA-1902 > URL: https://issues.apache.org/jira/browse/TIKA-1902 > Project: Tika > Issue Type: Bug > Components: handler, parser >Affects Versions: 1.12 > Environment: Java >Reporter: Harsh Fatepuria > Labels: handler, java, parser, tika > > Java Code: > public static String parseBodyToHTML(String filePath) throws IOException, > SAXException, TikaException > { > ContentHandler handler = new BodyContentHandler(new > ToXMLContentHandler()); > > AutoDetectParser parser = new AutoDetectParser(); > Metadata metadata = new Metadata(); > try (FileInputStream stream =new FileInputStream(new > File(filePath))) { > parser.parse(stream, handler, metadata); > return handler.toString(); > } > } > While using this function for some files, I get the following error: > Exception in thread "main" org.xml.sax.SAXException: Namespace > http://www.w3.org/1999/xhtml not declared > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62) > at > org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68) > at > org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291) > at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136) > at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1898) backslashes in mime-type : application/vnd.mif are wrong
[ https://issues.apache.org/jira/browse/TIKA-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194363#comment-15194363 ] Nick Burch commented on TIKA-1898: -- I've just tried with your test file, and Tika is able to detect the file correctly with the data only (no filename). That makes me think that the mimetype is correct: {code} $ java -jar tika-app-1.13-SNAPSHOT.jar --detect < test.mif application/vnd.mif {code} Are you able to produce a junit unit test that shows your detection issue, and ideally shows your proposed patch fixes it? (Bonus marks if it's as a Github Pull Request or a Patch attached to the JIRA!) > backslashes in mime-type : application/vnd.mif are wrong > - > > Key: TIKA-1898 > URL: https://issues.apache.org/jira/browse/TIKA-1898 > Project: Tika > Issue Type: Bug > Components: config, core > Environment: Win64, Eclipse >Reporter: Steffen Netz >Priority: Minor > Labels: easyfix, patch > Attachments: test.doc, test.fm, test.mif, tika-bug.log > > > In > tika-core\src\main\resources\org\apache\tika\mime\tika-mimetypes.xml > there are the lines: > > > > > > > > wrong. > the backslashes must be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)