Re: RE: Re: Question XMP metadata extraction

Andreas Lehmkühler Fri, 23 Oct 2009 03:35:51 -0700

Hi,

Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<[email protected]>


> Hi,
> 
> Thanks for looking into the code; I'm a bit confused though. I guess it's
> your suggestion to inspect the three locations for metadata "by hand"?  What
> would be the best way to proceed?
As I've already said I'm not a XMP expert, I just try to find possible 
locations where metadata are used within pdfbox.

PDPage-metadata:
- load the document
- get all pages calling document.getDocumentCatalog().getAllPages()
- iterate through all pages and check them for metadata calling getMetadata()

PDXObject:
- load the document
- get all pages calling document.getDocumentCatalog().getAllPages()
- iterate through all pages and get all XObjects by calling getXObjects()
- iterate through all XObjects and check them for metadata calling getMetadata()

I don't know if that really works, but give it a try.

BR 
Andreas Lehmkühler
> 
> Best, Robin
>  
> -----Original message-----
> From: Andreas Lehmkühler <[email protected]>
> Sent: Thu 22-10-2009 22:36
> To: [email protected]; 
> Subject: Re: Question XMP metadata extraction
> 
> 
> Robin Diederen schrieb:
> > Andreas,
> > 
> > According to the JavaDoc
> (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.html#PDM
> etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata method
> should be able to do this. Or am I missing something?
> Ok, I had a deeper look and it seems that there are 3 supported
> locations for metadata within pdfbox: PDDocumentCatalog, PDPage and
> PDXObject. The "classic" metadata are located in the catalog. Perhaps
> you will find the metadata your are looking for in the two other objects?
> 
> BR
> Andreas Lehmkühler
> 
> > Thanks for your help, greatly appreciated!
> > 
> >  
> > 
> > Best, Robin
> >  
> > -----Original message-----
> > From: Andreas Lehmkühler <[email protected]>
> > Sent: Thu 22-10-2009 22:09
> > To: [email protected]; 
> > Subject: Re: Question XMP metadata extraction
> > 
> > Hi,
> > 
> > Robin Diederen schrieb:
> >> Hello Andreas,
> >>
> >> I did have a look at the PrintDocumentMetaData.java fille; there I find
> that using the PDDocumentInformation metadata is extracted. This code is
> useful for PDF files with "classic" metadata, but not for PDF files only
> carrying XMP metadata, right?
> > OK, I see. I'm not that familiar with the XMP stuff, but I guess I
> > understand your problem.
> > 
> >> There's my issue.. as soon as I have a PDF file with only XMP metadata I
> need some other way to extract this metadata..
> > I'm afraid that pdfbox is yet limited to the handling of "classic"
> metadata.
> > 
> > 
> >> Best, Robin
> >>  
> >> -----Original message-----
> >> From: Andreas Lehmkühler <[email protected]>
> >> Sent: Thu 22-10-2009 21:47
> >> To: [email protected]; 
> >> Subject: Re: Question XMP metadata extraction
> >>
> >> Hi,
> >>
> >> Robin Diederen schrieb:
> >>> Hello all,
> >>>
> >>> I'm quite new to PDFbox and currently figuring out how to extract
> metadata from PDF files which is in XMP format.
> >>>
> >>> I have a few files containing XMP metadata, but I can not get any of
> those to work. And I can't seem to figure out where I am failing.
> >>>
> >>> A code snippet (all non-relevant code was deleted):
> >>>
> >>> String inputFile = "/some/file.pdf"
> >>>
> >>> PDDocument pdfDocument = null;
> >>> pdfDocument = new PDDocument();
> >>> pdfDocument = PDDocument.load(inputFile);     
> >>> PDMetadata pdfMetaData = new PDMetadata(pdfDocument);
> >>>             
> >>> int metadataLength = pdfMetaData.getLength();
> >>> System.out.println(pdfMetaData.getLength());
> >>>  
> >>>
> >>> pdfMetaData.exportXMPMetadata();
> >>>  
> >>>
> >>> The getLength call always returns 0; the exportXMPMetadata call returns
> an error:
> >>>
> >>> [Fatal Error] :-1:-1: Premature end of file.
> >>> Exception in thread "main" java.io.IOException: Premature end of file.
> >>>     at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78)
> >>>     at org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)
> >>>     at
> org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMetadata.jav
> a:86)
> >>>     at
> com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.java:124)
> >>>     at com.robindiederen.pdf.Extractor.main(Extractor.java:90)
> >>>
> >>>  
> >>>
> >>> This happens for every PDF I test. Extracting metadata from the
> DocumentInformation table works as a charm. I'm using PDFbox 0.80 on Java
> 1.5.
> >> Have a look at PrintDocumentMetaData as an example how to extract the
> >> docs metadata.
> >>
> >> HTH
> >> Andreas Lehmkühler
> >>
> >>
> > BR
> > Andreas Lehmkühler
> > 
> > 
> > 
> 
> 
> 

--- original Nachricht Ende ----

Re: RE: Re: Question XMP metadata extraction

Reply via email to