Thank you for letting us know about this and sharing a file. My belief is that we should be trusting the XMP metadata over the PDFInfo for DC metadata keys like TikaCoreProperties.CREATED. I'll take a look.
On Mon, May 11, 2020 at 11:40 AM Tucker B <[email protected]> wrote: > I have a PDF with XMP metadata with two rdf:Description tags with > different namespaces. The first namespace is DublinCore the other is > XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata > properly and properly identify the namespaces. However, it appears the > PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted > metadata, specifically the CreateDate. I'm curious if this is expected > behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED > to the value in the XMP metadata absent the presence of a created date in > the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate". > I've attached the XMP packet and a PDF with the XMP metadata. I'm using > Tika 1.24.1 Any help or guidance would be greatly appreciated. > > Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is > base64 encoded string "[42!573]". Curious if anyone knows the > significance of this. >
