Robin,
could you just post the XMP packet (or the full PDF if that's easier) here?
That way we could test why this happens.
On 27.10.2009 21:15:49 Robin Diederen wrote:
> Hi Jeremias and Andreas,
>
> Thanks for the support; I've been toying quite a bit with PDFbox and I'm
> finally getting somewhere. I did not know about the objects / schemas which
> can contain metadata information (heck two weeks ago I didn't know of XMP
> metadata ;-)).
>
> Anyhow.. I finally am getting some results. For this specific project I was
> after the metadata which Adobe reader shows as "description". I learned, by
> printing the raw XMP metadata, that this piece if information was stored in
> the dublin core schema. Nice as that is, by using the getDescription method
> from the Dublin core schema, I did not get any results. However, by using the
> getTextProperty("dc:description"), I get all the data I am after.
>
> I do not have any clue why the getDescription call does not return anything
> but I guess it's bug of somekind.
>
> Thanks for all your help!
>
> Best, Robin
>
> -----Oorspronkelijk bericht-----
> Van: Jeremias Maerki [mailto:[email protected]]
> Verzonden: zaterdag 24 oktober 2009 14:52
> Aan: [email protected]
> Onderwerp: Re: Question XMP metadata extraction
>
> I've just added an example that shows how to extract document-level XMP
> metadata:
> http://svn.apache.org/viewvc?rev=829357&view=rev
>
> As Andreas noted, PDF supports attaching metadata to many different objects
> (including pages, XObjects, fonts etc.). The most interesting packet will
> certainly be that attached to the document catalog. I hope the new example
> will help you solve your requirement, Robin.
>
> On 23.10.2009 12:35:17 Andreas Lehmkühler wrote:
> > Hi,
> >
> > Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<[email protected]>
> >
> > > Hi,
> > >
> > > Thanks for looking into the code; I'm a bit confused though. I guess
> > > it's your suggestion to inspect the three locations for metadata "by
> > > hand"? What would be the best way to proceed?
> > As I've already said I'm not a XMP expert, I just try to find possible
> > locations where metadata are used within pdfbox.
> >
> > PDPage-metadata:
> > - load the document
> > - get all pages calling document.getDocumentCatalog().getAllPages()
> > - iterate through all pages and check them for metadata calling
> > getMetadata()
> >
> > PDXObject:
> > - load the document
> > - get all pages calling document.getDocumentCatalog().getAllPages()
> > - iterate through all pages and get all XObjects by calling
> > getXObjects()
> > - iterate through all XObjects and check them for metadata calling
> > getMetadata()
> >
> > I don't know if that really works, but give it a try.
> >
> > BR
> > Andreas Lehmkühler
> > >
> > > Best, Robin
> > >
> > > -----Original message-----
> > > From: Andreas Lehmkühler <[email protected]>
> > > Sent: Thu 22-10-2009 22:36
> > > To: [email protected];
> > > Subject: Re: Question XMP metadata extraction
> > >
> > >
> > > Robin Diederen schrieb:
> > > > Andreas,
> > > >
> > > > According to the JavaDoc
> > > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.
> > > html#PDM
> > > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata
> > > method should be able to do this. Or am I missing something?
> > > Ok, I had a deeper look and it seems that there are 3 supported
> > > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and
> > > PDXObject. The "classic" metadata are located in the catalog.
> > > Perhaps you will find the metadata your are looking for in the two other
> > > objects?
> > >
> > > BR
> > > Andreas Lehmkühler
> > >
> > > > Thanks for your help, greatly appreciated!
> > > >
> > > >
> > > >
> > > > Best, Robin
> > > >
> > > > -----Original message-----
> > > > From: Andreas Lehmkühler <[email protected]>
> > > > Sent: Thu 22-10-2009 22:09
> > > > To: [email protected];
> > > > Subject: Re: Question XMP metadata extraction
> > > >
> > > > Hi,
> > > >
> > > > Robin Diederen schrieb:
> > > >> Hello Andreas,
> > > >>
> > > >> I did have a look at the PrintDocumentMetaData.java fille; there
> > > >> I find
> > > that using the PDDocumentInformation metadata is extracted. This
> > > code is useful for PDF files with "classic" metadata, but not for
> > > PDF files only carrying XMP metadata, right?
> > > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I
> > > > understand your problem.
> > > >
> > > >> There's my issue.. as soon as I have a PDF file with only XMP
> > > >> metadata I
> > > need some other way to extract this metadata..
> > > > I'm afraid that pdfbox is yet limited to the handling of "classic"
> > > metadata.
> > > >
> > > >
> > > >> Best, Robin
> > > >>
> > > >> -----Original message-----
> > > >> From: Andreas Lehmkühler <[email protected]>
> > > >> Sent: Thu 22-10-2009 21:47
> > > >> To: [email protected];
> > > >> Subject: Re: Question XMP metadata extraction
> > > >>
> > > >> Hi,
> > > >>
> > > >> Robin Diederen schrieb:
> > > >>> Hello all,
> > > >>>
> > > >>> I'm quite new to PDFbox and currently figuring out how to
> > > >>> extract
> > > metadata from PDF files which is in XMP format.
> > > >>>
> > > >>> I have a few files containing XMP metadata, but I can not get
> > > >>> any of
> > > those to work. And I can't seem to figure out where I am failing.
> > > >>>
> > > >>> A code snippet (all non-relevant code was deleted):
> > > >>>
> > > >>> String inputFile = "/some/file.pdf"
> > > >>>
> > > >>> PDDocument pdfDocument = null;
> > > >>> pdfDocument = new PDDocument();
> > > >>> pdfDocument = PDDocument.load(inputFile); PDMetadata pdfMetaData
> > > >>> = new PDMetadata(pdfDocument);
> > > >>>
> > > >>> int metadataLength = pdfMetaData.getLength();
> > > >>> System.out.println(pdfMetaData.getLength());
> > > >>>
> > > >>>
> > > >>> pdfMetaData.exportXMPMetadata();
> > > >>>
> > > >>>
> > > >>> The getLength call always returns 0; the exportXMPMetadata call
> > > >>> returns
> > > an error:
> > > >>>
> > > >>> [Fatal Error] :-1:-1: Premature end of file.
> > > >>> Exception in thread "main" java.io.IOException: Premature end of file.
> > > >>> at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78)
> > > >>> at
> > > >>> org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)
> > > >>> at
> > > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMeta
> > > data.jav
> > > a:86)
> > > >>> at
> > > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.jav
> > > a:124)
> > > >>> at com.robindiederen.pdf.Extractor.main(Extractor.java:90)
> > > >>>
> > > >>>
> > > >>>
> > > >>> This happens for every PDF I test. Extracting metadata from the
> > > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on
> > > Java 1.5.
> > > >> Have a look at PrintDocumentMetaData as an example how to extract
> > > >> the docs metadata.
> > > >>
> > > >> HTH
> > > >> Andreas Lehmkühler
> > > >>
> > > >>
> > > > BR
> > > > Andreas Lehmkühler
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> > --- original Nachricht Ende ----
>
>
>
>
> Jeremias Maerki
>
>
>
Jeremias Maerki