Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it 
and I'll try to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned 
> that it is not a design feature/restriction on XMPBox that it doesn't handle 
> non PDF/A files...only a matter of patching and building out their current 
> code base.   The downside is there's quite a bit to do, the upside is that it 
> is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, 
> let me know...I'll be starting with: 
> https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/.
>  You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform 
> extraction of XMP for all image and PDF files for the common elements -- with 
> special handling by file type if necessary.  As you mentioned, it would also 
> be great to add or modify our XMPScanner to extract all XMP packets from a 
> file...I've started dabbling with this here: 
> https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to 
> hear more about what happens with InDesign files. In our own test set, we 
> have a PDF file with two packets containing conflicting authorship info IIRC! 
> :)  It would be nice to expose both the canonical XMP info (with proper 
> processing of "later-xmp-overrides-earlier") as well as all of the info that 
> can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two 
> different use cases.
> 
> Thank you, again.
> 
>             Cheers,
> 
>                   Tim 
> 
> 
> 
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.ga...@alfresco.com] 
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not 
> Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to 
> contribute some of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -----Original Message-----
>> From: Ray Gauss [mailto:ray.ga...@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing 
>> XMP (though has not seen updates in a while), but getting the XMP packets 
>> out of the files is tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign 
>> files for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type 
>> (including Exiftool command-line) to get the XMP out then used xmpcore to 
>> parse into simple flattened properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the 
>>> project has migrated from Jempbox (which we're now using) to XmpBox; and 
>>> Jempbox is now on its last legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of 
>>> common variants of XMPs in the wild; I found that it fails on roughly 40% 
>>> of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there 
>>> better options than copying and pasting jempbox into Tika and maintaining 
>>> it ourselves (yuk!)?
>>> 
>>>        Best,
>>> 
>>>               Tim
>>> 
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:thaush...@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: d...@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it 
>>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
>>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>> 
>>> And no, there are no plans for anything on XMP at this time...
>>> 
>>> Tilman
>>> 
>>> 
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>>> All,
>>>> 
>>>> 
>>>> 
>>>> When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
>>>> from our current reliance on jempbox to XMPBox.  I recently extracted ~70k 
>>>> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, 
>>>> there were exceptions on roughly 40% of the XMPs.
>>>> 
>>>> 
>>>> 
>>>> I’m including a table below of the counts of exception messages.  Are 
>>>> there any plans to make XMPBox more lenient or is this what we can expect 
>>>> going forward?
>>>> 
>>>> 
>>>> 
>>>> As always, I’m more than happy to help with files and tests.  Let me know 
>>>> what I can do.
>>>> 
>>>> 
>>>> 
>>>>            Cheers,
>>>> 
>>>> 
>>>> 
>>>>                     Tim
>>>> 
>>>> 
>>>> 
>>>> No XmpParsingException on 42,022 files.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Exceptions:
>>>> 
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/pdfx/1.3/
>>>> 
>>>> 13403
>>>> 
>>>> Type 'originalDocumentID' not defined in 
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 3710
>>>> 
>>>> Missing pdfaSchema:property in type definition
>>>> 
>>>> 3113
>>>> 
>>>> Expecting namespace 'adobe:ns:meta/' and found 
>>>> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>> 
>>>> 2867
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 927
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=description]
>>>> 
>>>> 723
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xmp/InDesign/private
>>>> 
>>>> 710
>>>> 
>>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>>> name=subject]
>>>> 
>>>> 654
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>> 
>>>> 522
>>>> 
>>>> Failed to parse
>>>> 
>>>> 492
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 370
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/illustrator/1.0/
>>>> 
>>>> 262
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xfa/promoted-desc/
>>>> 
>>>> 188
>>>> 
>>>> Failed to instanciate property in xmp:CreateDate
>>>> 
>>>> 144
>>>> 
>>>> Schema is not set in this document : 
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>> 
>>>> 125
>>>> 
>>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>> 
>>>> 94
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.rwjf.org/rwjf/1.0
>>>> 
>>>> 84
>>>> 
>>>> Failed to instanciate property in xap:CreateDate
>>>> 
>>>> 74
>>>> 
>>>> Invalid array definition, expecting Bag and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=language]
>>>> 
>>>> 68
>>>> 
>>>> Invalid array definition, expecting Alt and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 49
>>>> 
>>>> Cannot find a definition for the namespace http://www.sap.com
>>>> 
>>>> 46
>>>> 
>>>> Failed to instanciate property in exif:ColorSpace
>>>> 
>>>> 33
>>>> 
>>>> Failed to instanciate property in xmpMM:History
>>>> 
>>>> 28
>>>> 
>>>> xmp should start with a processing instruction
>>>> 
>>>> 26
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.0/
>>>> 
>>>> 24
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.npes.org/pdfx/ns/id/
>>>> 
>>>> 21
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>> 
>>>> 14
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 14
>>>> 
>>>> Failed to instanciate property in xmp:MetadataDate
>>>> 
>>>> 12
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.xinet.com/webnative/private/1.0/
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xap:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xmp:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Type 'params' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>> 
>>>> 9
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>>> name=History]
>>>> 
>>>> 8
>>>> 
>>>> Type 'documentName' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 8
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.day.com/dam/1.0
>>>> 
>>>> 7
>>>> 
>>>> Cannot find a definition for the namespace ptc
>>>> 
>>>> 7
>>>> 
>>>> Failed to instanciate property in xapMM:History
>>>> 
>>>> 6
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl 
>>>> [prefix=tiff; name=YCbCrPositioning]
>>>> 
>>>> 5
>>>> 
>>>> Schema is not set in this document : 
>>>> http://purl.org/dc/elements/1.1/
>>>> 
>>>> 5
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.extensis.com/meta/FontSense/
>>>> 
>>>> 4
>>>> 
>>>> Excepted xpacket 'end' attribute (must be present and placed in
>>>> first)
>>>> 
>>>> 4
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>>> name=TextLayers]
>>>> 
>>>> 3
>>>> 
>>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>> 
>>>> 3
>>>> 
>>>> no message (NPE)
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://laserfiche.com/xmp/schema/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Failed to instanciate property in xapRights:Marked
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.cambridgeassociates.com/status/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.computershare.com.au/ccs/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.tripletriangle.com/ns/tripletri/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.1/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfa/ns/id.html
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfe/ns/id/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Failed to instanciate property in xmpRights:Marked
>>>> 
>>>> 1
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 1
>>>> 
>>>> This namespace is not a schema or a structured type : 
>>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>> 
>> 
> 

Reply via email to