Thanks, Markus,
Those are the tools I've been using to debug because it's quicker than
reindexing even a test collection in Solr. So parsechecker shows that these
fields are in the parse metadata, but I can't figure out how to get them into
the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the
other namespaces using ':' aren't making it through and I'm at a loss.
Nutch schema.xml:
<field name="pdf_docinfo_created" type="pdates"/>
<field name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
nutch-site.xml:
<property>
<name>index.parse.md</name>
<value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages
</value>
</property>
Parsechecker sees the values for the xmp stuff:
Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4
pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2
access_permission:blah_blah_blah xmpTPg:NPages=23
access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 7.0.5
(Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
Indexchecker doesn't:
fetching: http://127.0.01/test.pdf
robots.txt whitelist not configured.
parsing: http://127.0.01/test.pdf
pdf:docinfo:title : Test File
tstamp : Tue Dec 31 11:23:28 EST 2019
pdf:docinfo:modified : 2011-04-27T18:36:58Z
pdf:docinfo:created : 2011-04-27T18:33:06Z
The Dublin Core values don't use colon ':' but dot '.' and they show up fine.
There are embedded spaces in some of the xmp values, but the pdf:docinfo:title
has that, too, it shows up in the indexchecker output. I'm wondering if there's
anything special about the pdf:docinfo that isn't generalized or is somehow
configurable for generalization to other namespaces.
Thanks,
Joe
-----Original Message-----
From: Markus Jelsma <[email protected]>
Sent: Tuesday, December 31, 2019 8:30 AM
To: [email protected]
Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
Hello Joseph,
> Is there more documentation on having Nutch get what Tika sees into what Solr
> will see?
No, but i believe you would want to checkout the parsechecker and indexchecker
tools. These tools display what Tika sees and what will be sent to Solr.
Regards,
Markus
-----Original message-----
> From:Gilvary, Joseph <[email protected]>
> Sent: Tuesday 31st December 2019 14:19
> To: [email protected]
> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Happy New Year,
>
> I've searched the archives and the web as best I can, tinkered with
> nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the
> parse metadata into the Solr (7.6) index.
>
> I want to index stuff like:
>
> xmp:CreatorTool=PScript5.dll Version 5.2.2
> xmpTPg:NPages=23
>
> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping
> out ':' for '_' isn't working for the xmp stuff.
>
> Is there more documentation on having Nutch get what Tika sees into what Solr
> will see?
>
> Any help appreciated.
>
> Thanks,
>
> Joe
>