RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph Tue, 31 Dec 2019 08:44:07 -0800

Thanks, Markus,

Those are the tools I've been using to debug because it's quicker than 
reindexing even a test collection in Solr. So parsechecker shows that these 
fields are in the parse metadata, but I can't figure out how to get them into 
the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the 
other namespaces using ':' aren't making it through and I'm at a loss.


Nutch schema.xml:

<field name="pdf_docinfo_created" type="pdates"/>
<field name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>

nutch-site.xml:

  <property>
    <name>index.parse.md</name>
    
<value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages
 </value>
  </property>


Parsechecker sees the values for the xmp stuff:

Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 
pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 
access_permission:blah_blah_blah xmpTPg:NPages=23 
access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 7.0.5 
(Windows) pdf:docinfo:created=2011-04-27T18:33:06Z


Indexchecker doesn't:

fetching: http://127.0.01/test.pdf
robots.txt whitelist not configured.
parsing: http://127.0.01/test.pdf
pdf:docinfo:title :     Test File
tstamp :        Tue Dec 31 11:23:28 EST 2019
pdf:docinfo:modified :  2011-04-27T18:36:58Z
pdf:docinfo:created :   2011-04-27T18:33:06Z


The Dublin Core values don't use colon ':' but dot '.' and they show up fine. 
There are embedded spaces in some of the xmp values, but the pdf:docinfo:title 
has that, too, it shows up in the indexchecker output. I'm wondering if there's 
anything special about the pdf:docinfo that isn't generalized or is somehow 
configurable for generalization to other namespaces. 

 Thanks,

 Joe

-----Original Message-----
From: Markus Jelsma <[email protected]> 
Sent: Tuesday, December 31, 2019 8:30 AM
To: [email protected]
Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15 

Hello Joseph,

> Is there more documentation on having Nutch get what Tika sees into what Solr 
> will see?

No, but i believe you would want to checkout the parsechecker and indexchecker 
tools. These tools display what Tika sees and what will be sent to Solr.

Regards,
Markus
 
-----Original message-----
> From:Gilvary, Joseph <[email protected]>
> Sent: Tuesday 31st December 2019 14:19
> To: [email protected]
> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15 
> 
> Happy New Year,
> 
> I've searched the archives and the web as best I can, tinkered with 
> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the 
> parse metadata into the Solr (7.6) index.
> 
> I want to index stuff like:
> 
> xmp:CreatorTool=PScript5.dll Version 5.2.2
> xmpTPg:NPages=23
> 
> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping 
> out ':' for '_' isn't working for the xmp stuff.
> 
> Is there more documentation on having Nutch get what Tika sees into what Solr 
> will see?
> 
> Any help appreciated.
> 
> Thanks,
> 
> Joe
>

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Reply via email to