RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph Thu, 02 Jan 2020 05:55:27 -0800

Happy New Year, Sebastian,

Thank you. That looks promising. Hope you enjoy the holiday!


 Joe 

-----Original Message-----
From: Sebastian Nagel <[email protected]> 
Sent: Thursday, January 2, 2020 7:42 AM
To: [email protected]
Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

Hi Joseph,

this could be related to
   
https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013&amp;sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3D&amp;reserved=0
caused by not-all-lowercase meta keys.

I'm happy to check whether the attached patch fixes your problem when I'm back 
from holidays in a few days.

Best,
Sebastian

On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
> Thanks, Markus,
> 
> Those are the tools I've been using to debug because it's quicker than 
> reindexing even a test collection in Solr. So parsechecker shows that these 
> fields are in the parse metadata, but I can't figure out how to get them into 
> the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the 
> other namespaces using ':' aren't making it through and I'm at a loss.
> 
> Nutch schema.xml:
> 
> <field name="pdf_docinfo_created" type="pdates"/> <field 
> name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
> 
> nutch-site.xml:
> 
>   <property>
>     <name>index.parse.md</name>
>     
> <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages
>  </value>
>   </property>
> 
> 
> Parsechecker sees the values for the xmp stuff:
> 
> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 
> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 
> access_permission:blah_blah_blah xmpTPg:NPages=23 
> access_permission:can_modify=true pdf:docinfo:producer=Acrobat 
> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
> 
> 
> Indexchecker doesn't:
> 
> fetching: 
> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0
> .01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c
> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7
> C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2
> FAOBXXM%3D&amp;reserved=0
> robots.txt whitelist not configured.
> parsing: 
> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3D&amp;reserved=0
> pdf:docinfo:title :     Test File
> tstamp :        Tue Dec 31 11:23:28 EST 2019
> pdf:docinfo:modified :  2011-04-27T18:36:58Z
> pdf:docinfo:created :   2011-04-27T18:33:06Z
> 
> 
> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. 
> There are embedded spaces in some of the xmp values, but the 
> pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm 
> wondering if there's anything special about the pdf:docinfo that isn't 
> generalized or is somehow configurable for generalization to other 
> namespaces. 
> 
>  Thanks,
> 
>  Joe
> 
> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Sent: Tuesday, December 31, 2019 8:30 AM
> To: [email protected]
> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
> 
> Hello Joseph,
> 
>> Is there more documentation on having Nutch get what Tika sees into what 
>> Solr will see?
> 
> No, but i believe you would want to checkout the parsechecker and 
> indexchecker tools. These tools display what Tika sees and what will be sent 
> to Solr.
> 
> Regards,
> Markus
>  
> -----Original message-----
>> From:Gilvary, Joseph <[email protected]>
>> Sent: Tuesday 31st December 2019 14:19
>> To: [email protected]
>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Happy New Year,
>>
>> I've searched the archives and the web as best I can, tinkered with 
>> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the 
>> parse metadata into the Solr (7.6) index.
>>
>> I want to index stuff like:
>>
>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>> xmpTPg:NPages=23
>>
>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping 
>> out ':' for '_' isn't working for the xmp stuff.
>>
>> Is there more documentation on having Nutch get what Tika sees into what 
>> Solr will see?
>>
>> Any help appreciated.
>>
>> Thanks,
>>
>> Joe
>>

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Reply via email to