Hi Tim, Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try to help out where I can.
> You've been happy with it at Alfresco? It's been a while since I looked at it but I don't recall any difficulties. > I'd be interested to hear more about what happens with InDesign files. It stores things in 'pages' [1]. Regards, Ray [1] http://stackoverflow.com/a/22661992 > On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <talli...@mitre.org> wrote: > > Hi Ray, > Got it. Thank you. > > That'd be great. In follow up discussion with PDFBox devs, they mentioned > that it is not a design feature/restriction on XMPBox that it doesn't handle > non PDF/A files...only a matter of patching and building out their current > code base. The downside is there's quite a bit to do, the upside is that it > is a living code base. > > I'll experiment with Adobe's xmp-core. If you have any pointers/examples, > let me know...I'll be starting with: > https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. > You've been happy with it at Alfresco? > > No matter which package we use, it would be nice to build out uniform > extraction of XMP for all image and PDF files for the common elements -- with > special handling by file type if necessary. As you mentioned, it would also > be great to add or modify our XMPScanner to extract all XMP packets from a > file...I've started dabbling with this here: > https://github.com/tballison/tika/tree/xmp_scanner . I'd be interested to > hear more about what happens with InDesign files. In our own test set, we > have a PDF file with two packets containing conflicting authorship info IIRC! > :) It would be nice to expose both the canonical XMP info (with proper > processing of "later-xmp-overrides-earlier") as well as all of the info that > can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two > different use cases. > > Thank you, again. > > Cheers, > > Tim > > > > > -----Original Message----- > From: Ray Gauss [mailto:ray.ga...@alfresco.com] > Sent: Tuesday, March 08, 2016 2:34 PM > To: dev@tika.apache.org > Subject: Re: [DISCUSS] options for XMP parsing? > > To clarify... the 'we' in my third sentence was referring to Alfresco, not > Tika. > > I'm not sure how much of that code would be useful but I may be able to > contribute some of it. > > Regards, > > Ray > > >> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <talli...@mitre.org> wrote: >> >> Thank you. Will take a look. >> >> -----Original Message----- >> From: Ray Gauss [mailto:ray.ga...@alfresco.com] >> Sent: Tuesday, March 08, 2016 1:55 PM >> To: dev@tika.apache.org >> Subject: Re: [DISCUSS] options for XMP parsing? >> >> Hi Tim, >> >> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing >> XMP (though has not seen updates in a while), but getting the XMP packets >> out of the files is tricker. >> >> We have XMPPacketScanner which works for many cases, but not all. InDesign >> files for example do some strange things. >> >> In the past we've used different packet scanners depending on the file type >> (including Exiftool command-line) to get the XMP out then used xmpcore to >> parse into simple flattened properties. >> >> Regards, >> >> Ray >> >> >>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <talli...@mitre.org> wrote: >>> >>> All, >>> >>> PDFBox 2.0 is soon to be released. In the course of its development, the >>> project has migrated from Jempbox (which we're now using) to XmpBox; and >>> Jempbox is now on its last legs. >>> >>> XmpBox was "written for PDF/A checking," not for robust processing of >>> common variants of XMPs in the wild; I found that it fails on roughly 40% >>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl. >>> >>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life. >>> >>> Has anyone had any luck with an Apache-friendly XMP parser? Are there >>> better options than copying and pasting jempbox into Tika and maintaining >>> it ourselves (yuk!)? >>> >>> Best, >>> >>> Tim >>> >>> -----Original Message----- >>> From: Tilman Hausherr [mailto:thaush...@t-online.de] >>> Sent: Tuesday, March 08, 2016 12:13 PM >>> To: d...@pdfbox.apache.org >>> Subject: Re: roadmap for XMPBox? >>> >>> I think the problem is that XmpBox was written for PDF/A checking, so it >>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the >>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: >>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_ >>> p >>> roperties_in_pdfa-1_2008-03-20.pdf >>> >>> And no, there are no plans for anything on XMP at this time... >>> >>> Tilman >>> >>> >>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: >>>> All, >>>> >>>> >>>> >>>> When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch >>>> from our current reliance on jempbox to XMPBox. I recently extracted ~70k >>>> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, >>>> there were exceptions on roughly 40% of the XMPs. >>>> >>>> >>>> >>>> I’m including a table below of the counts of exception messages. Are >>>> there any plans to make XMPBox more lenient or is this what we can expect >>>> going forward? >>>> >>>> >>>> >>>> As always, I’m more than happy to help with files and tests. Let me know >>>> what I can do. >>>> >>>> >>>> >>>> Cheers, >>>> >>>> >>>> >>>> Tim >>>> >>>> >>>> >>>> No XmpParsingException on 42,022 files. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Exceptions: >>>> >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/pdfx/1.3/ >>>> >>>> 13403 >>>> >>>> Type 'originalDocumentID' not defined in >>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef# >>>> >>>> 3710 >>>> >>>> Missing pdfaSchema:property in type definition >>>> >>>> 3113 >>>> >>>> Expecting namespace 'adobe:ns:meta/' and found >>>> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' >>>> >>>> 2867 >>>> >>>> Invalid array type, expecting Seq and found Bag [prefix=dc; >>>> name=creator] >>>> >>>> 927 >>>> >>>> Invalid array type, expecting Alt and found Seq [prefix=dc; >>>> name=description] >>>> >>>> 723 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/xmp/InDesign/private >>>> >>>> 710 >>>> >>>> Invalid array type, expecting Bag and found Seq [prefix=dc; >>>> name=subject] >>>> >>>> 654 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ >>>> >>>> 522 >>>> >>>> Failed to parse >>>> >>>> 492 >>>> >>>> Invalid array definition, expecting Seq and found >>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>>> name=date] >>>> >>>> 370 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/illustrator/1.0/ >>>> >>>> 262 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/xfa/promoted-desc/ >>>> >>>> 188 >>>> >>>> Failed to instanciate property in xmp:CreateDate >>>> >>>> 144 >>>> >>>> Schema is not set in this document : >>>> http://www.w3.org/1999/02/22-rdf-syntax-ns# >>>> >>>> 125 >>>> >>>> Expecting local name 'xmpmeta' and found 'xapmeta' >>>> >>>> 94 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.rwjf.org/rwjf/1.0 >>>> >>>> 84 >>>> >>>> Failed to instanciate property in xap:CreateDate >>>> >>>> 74 >>>> >>>> Invalid array definition, expecting Bag and found >>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>>> name=language] >>>> >>>> 68 >>>> >>>> Invalid array definition, expecting Alt and found >>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>>> name=title] >>>> >>>> 49 >>>> >>>> Cannot find a definition for the namespace http://www.sap.com >>>> >>>> 46 >>>> >>>> Failed to instanciate property in exif:ColorSpace >>>> >>>> 33 >>>> >>>> Failed to instanciate property in xmpMM:History >>>> >>>> 28 >>>> >>>> xmp should start with a processing instruction >>>> >>>> 26 >>>> >>>> Cannot find a definition for the namespace >>>> http://prismstandard.org/namespaces/basic/2.0/ >>>> >>>> 24 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.npes.org/pdfx/ns/id/ >>>> >>>> 21 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.InsiderSoftware.com/fontlist/1.0/ >>>> >>>> 14 >>>> >>>> Invalid array definition, expecting Seq and found >>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>>> name=creator] >>>> >>>> 14 >>>> >>>> Failed to instanciate property in xmp:MetadataDate >>>> >>>> 12 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.xinet.com/webnative/private/1.0/ >>>> >>>> 10 >>>> >>>> Failed to instanciate property in xap:ModifyDate >>>> >>>> 10 >>>> >>>> Failed to instanciate property in xmp:ModifyDate >>>> >>>> 10 >>>> >>>> Type 'params' not defined in >>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent# >>>> >>>> 9 >>>> >>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; >>>> name=History] >>>> >>>> 8 >>>> >>>> Type 'documentName' not defined in >>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef# >>>> >>>> 8 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.day.com/dam/1.0 >>>> >>>> 7 >>>> >>>> Cannot find a definition for the namespace ptc >>>> >>>> 7 >>>> >>>> Failed to instanciate property in xapMM:History >>>> >>>> 6 >>>> >>>> Invalid array definition, expecting Seq and found >>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl >>>> [prefix=tiff; name=YCbCrPositioning] >>>> >>>> 5 >>>> >>>> Schema is not set in this document : >>>> http://purl.org/dc/elements/1.1/ >>>> >>>> 5 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.extensis.com/meta/FontSense/ >>>> >>>> 4 >>>> >>>> Excepted xpacket 'end' attribute (must be present and placed in >>>> first) >>>> >>>> 4 >>>> >>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; >>>> name=TextLayers] >>>> >>>> 3 >>>> >>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/ >>>> >>>> 3 >>>> >>>> no message (NPE) >>>> >>>> 2 >>>> >>>> Cannot find a definition for the namespace >>>> http://laserfiche.com/xmp/schema/1.0/ >>>> >>>> 2 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/ >>>> >>>> 2 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.adobe.com/camera-raw-settings/1.0/ >>>> >>>> 2 >>>> >>>> Failed to instanciate property in xapRights:Marked >>>> >>>> 2 >>>> >>>> Invalid array type, expecting Alt and found Bag [prefix=dc; >>>> name=title] >>>> >>>> 2 >>>> >>>> Invalid array type, expecting Alt and found Seq [prefix=dc; >>>> name=title] >>>> >>>> 2 >>>> >>>> Invalid array type, expecting Seq and found Alt [prefix=dc; >>>> name=creator] >>>> >>>> 2 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.cambridgeassociates.com/status/1.0/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.computershare.com.au/ccs/1.0/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.esko-graphics.com/grinfo/1.0/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://ns.tripletriangle.com/ns/tripletri/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://prismstandard.org/namespaces/basic/2.1/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.aiim.org/pdfa/ns/id.html >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.aiim.org/pdfe/ns/id/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/ >>>> >>>> 1 >>>> >>>> Cannot find a definition for the namespace >>>> http://www.northplains.com/xmpnps/cov/1.0/ >>>> >>>> 1 >>>> >>>> Failed to instanciate property in xmpRights:Marked >>>> >>>> 1 >>>> >>>> Invalid array type, expecting Seq and found Bag [prefix=dc; >>>> name=date] >>>> >>>> 1 >>>> >>>> This namespace is not a schema or a structured type : >>>> http://ns.adobe.com/xap/1.0/sType/Job# >>>> >>>> 1 >>>> >>>> >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >>> additional commands, e-mail: dev-h...@pdfbox.apache.org >>> >> >