Re: Question XMP metadata extraction

Jeremias Maerki Wed, 28 Oct 2009 00:02:53 -0700

Robin,

could you just post the XMP packet (or the full PDF if that's easier) here?
That way we could test why this happens.


On 27.10.2009 21:15:49 Robin Diederen wrote:
> Hi Jeremias and Andreas,
> 
> Thanks for the support; I've been toying quite a bit with PDFbox and I'm 
> finally getting somewhere. I did not know about the objects / schemas which 
> can contain metadata information (heck two weeks ago I didn't know of XMP 
> metadata ;-)).
> 
> Anyhow.. I finally am getting some results. For this specific project I was 
> after the metadata which Adobe reader shows as "description". I learned, by 
> printing the raw XMP metadata, that this piece if information was stored in 
> the dublin core schema. Nice as that is, by using the getDescription method 
> from the Dublin core schema, I did not get any results. However, by using the 
> getTextProperty("dc:description"), I get all the data I am after. 
> 
> I do not have any clue why the getDescription call does not return anything 
> but I guess it's bug of somekind. 
> 
> Thanks for all your help!
> 
> Best, Robin
> 
> -----Oorspronkelijk bericht-----
> Van: Jeremias Maerki [mailto:[email protected]] 
> Verzonden: zaterdag 24 oktober 2009 14:52
> Aan: [email protected]
> Onderwerp: Re: Question XMP metadata extraction
> 
> I've just added an example that shows how to extract document-level XMP
> metadata:
> http://svn.apache.org/viewvc?rev=829357&view=rev
> 
> As Andreas noted, PDF supports attaching metadata to many different objects 
> (including pages, XObjects, fonts etc.). The most interesting packet will 
> certainly be that attached to the document catalog. I hope the new example 
> will help you solve your requirement, Robin.
> 
> On 23.10.2009 12:35:17 Andreas Lehmkühler wrote:
> > Hi,
> > 
> > Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<[email protected]>
> > 
> > > Hi,
> > > 
> > > Thanks for looking into the code; I'm a bit confused though. I guess 
> > > it's your suggestion to inspect the three locations for metadata "by 
> > > hand"?  What would be the best way to proceed?
> > As I've already said I'm not a XMP expert, I just try to find possible 
> > locations where metadata are used within pdfbox.
> > 
> > PDPage-metadata:
> > - load the document
> > - get all pages calling document.getDocumentCatalog().getAllPages()
> > - iterate through all pages and check them for metadata calling 
> > getMetadata()
> > 
> > PDXObject:
> > - load the document
> > - get all pages calling document.getDocumentCatalog().getAllPages()
> > - iterate through all pages and get all XObjects by calling 
> > getXObjects()
> > - iterate through all XObjects and check them for metadata calling 
> > getMetadata()
> > 
> > I don't know if that really works, but give it a try.
> > 
> > BR
> > Andreas Lehmkühler
> > > 
> > > Best, Robin
> > >  
> > > -----Original message-----
> > > From: Andreas Lehmkühler <[email protected]>
> > > Sent: Thu 22-10-2009 22:36
> > > To: [email protected];
> > > Subject: Re: Question XMP metadata extraction
> > > 
> > > 
> > > Robin Diederen schrieb:
> > > > Andreas,
> > > > 
> > > > According to the JavaDoc
> > > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.
> > > html#PDM
> > > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata 
> > > method should be able to do this. Or am I missing something?
> > > Ok, I had a deeper look and it seems that there are 3 supported 
> > > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and 
> > > PDXObject. The "classic" metadata are located in the catalog. 
> > > Perhaps you will find the metadata your are looking for in the two other 
> > > objects?
> > > 
> > > BR
> > > Andreas Lehmkühler
> > > 
> > > > Thanks for your help, greatly appreciated!
> > > > 
> > > >  
> > > > 
> > > > Best, Robin
> > > >  
> > > > -----Original message-----
> > > > From: Andreas Lehmkühler <[email protected]>
> > > > Sent: Thu 22-10-2009 22:09
> > > > To: [email protected];
> > > > Subject: Re: Question XMP metadata extraction
> > > > 
> > > > Hi,
> > > > 
> > > > Robin Diederen schrieb:
> > > >> Hello Andreas,
> > > >>
> > > >> I did have a look at the PrintDocumentMetaData.java fille; there 
> > > >> I find
> > > that using the PDDocumentInformation metadata is extracted. This 
> > > code is useful for PDF files with "classic" metadata, but not for 
> > > PDF files only carrying XMP metadata, right?
> > > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I 
> > > > understand your problem.
> > > > 
> > > >> There's my issue.. as soon as I have a PDF file with only XMP 
> > > >> metadata I
> > > need some other way to extract this metadata..
> > > > I'm afraid that pdfbox is yet limited to the handling of "classic"
> > > metadata.
> > > > 
> > > > 
> > > >> Best, Robin
> > > >>  
> > > >> -----Original message-----
> > > >> From: Andreas Lehmkühler <[email protected]>
> > > >> Sent: Thu 22-10-2009 21:47
> > > >> To: [email protected];
> > > >> Subject: Re: Question XMP metadata extraction
> > > >>
> > > >> Hi,
> > > >>
> > > >> Robin Diederen schrieb:
> > > >>> Hello all,
> > > >>>
> > > >>> I'm quite new to PDFbox and currently figuring out how to 
> > > >>> extract
> > > metadata from PDF files which is in XMP format.
> > > >>>
> > > >>> I have a few files containing XMP metadata, but I can not get 
> > > >>> any of
> > > those to work. And I can't seem to figure out where I am failing.
> > > >>>
> > > >>> A code snippet (all non-relevant code was deleted):
> > > >>>
> > > >>> String inputFile = "/some/file.pdf"
> > > >>>
> > > >>> PDDocument pdfDocument = null;
> > > >>> pdfDocument = new PDDocument();
> > > >>> pdfDocument = PDDocument.load(inputFile); PDMetadata pdfMetaData 
> > > >>> = new PDMetadata(pdfDocument);
> > > >>>             
> > > >>> int metadataLength = pdfMetaData.getLength(); 
> > > >>> System.out.println(pdfMetaData.getLength());
> > > >>>  
> > > >>>
> > > >>> pdfMetaData.exportXMPMetadata();
> > > >>>  
> > > >>>
> > > >>> The getLength call always returns 0; the exportXMPMetadata call 
> > > >>> returns
> > > an error:
> > > >>>
> > > >>> [Fatal Error] :-1:-1: Premature end of file.
> > > >>> Exception in thread "main" java.io.IOException: Premature end of file.
> > > >>>     at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78)
> > > >>>     at 
> > > >>> org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)
> > > >>>     at
> > > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMeta
> > > data.jav
> > > a:86)
> > > >>>     at
> > > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.jav
> > > a:124)
> > > >>>     at com.robindiederen.pdf.Extractor.main(Extractor.java:90)
> > > >>>
> > > >>>  
> > > >>>
> > > >>> This happens for every PDF I test. Extracting metadata from the
> > > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on 
> > > Java 1.5.
> > > >> Have a look at PrintDocumentMetaData as an example how to extract 
> > > >> the docs metadata.
> > > >>
> > > >> HTH
> > > >> Andreas Lehmkühler
> > > >>
> > > >>
> > > > BR
> > > > Andreas Lehmkühler
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > 
> > --- original Nachricht Ende ----
> 
> 
> 
> 
> Jeremias Maerki
> 
> 
> 




Jeremias Maerki

Re: Question XMP metadata extraction

Reply via email to