Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero Thu, 08 Feb 2018 23:32:16 -0800

Thank you for this information. Since this is very much related to Any23 and 
microdata parsing, I’m going to ask what I believe is a related question but 
keep this same thread so it will be organized in one place:


I noticed a lot of job boards such as dice.com <http://dice.com/>, monster.com 
<http://monster.com/>, etc use http://schema.org/JobPosting 
<http://schema.org/JobPosting> information, however many seem to use <script 
type="application/ld+json”>…</script> rather than RDF.
Summer 2017, Google announced structured data guidance for Jobs:
https://developers.google.com/search/docs/data-types/job-posting 
<https://developers.google.com/search/docs/data-types/job-posting>
and a testing tool to validate your HTML: 
https://search.google.com/structured-data/testing-tool
I verified a few sample listings on the above mentioned job boards on google’s 
testing-tool and they validate OK.

So after looking at http://any23.apache.org/getting-started.html 
<http://any23.apache.org/getting-started.html> for the supported extractors, I 
see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml 
to override the same property in nutch-default.xml:

<property>
    <name>any23.extractors</name>
    <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
    <description>Comma-separated list of Any23 extractors (a list of extractors 
is available here: http://any23.apache.org/getting-started.html)</description>
</property>

I expected to see additional information from nutch parsechecker after adding 
the jsonld extractors, however I see NO changes to Any23-Triples microdata 
parsed. 

What might I be doing wrong?

> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[email protected]> wrote:
> 
> Hi David,
> Answers inline
> 
> On Thu, Feb 8, 2018 at 9:19 AM, <[email protected]> wrote:
> 
>> 
>> From: David Ferrero <[email protected]>
>> To: [email protected]
>> Cc:
>> Bcc:
>> Date: Thu, 8 Feb 2018 10:19:52 -0700
>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
>> Pull request #205 was recently merged into master branch for Nutch 1.x in
>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>> 
>> I am new to nutch and solr and have just started crawling and indexing a
>> few select websites. Using the built in html parsing/indexing, I am getting
>> searchable fields like url, content, host, sometimes a title, and a few
>> other indexing related fields like digest, boost, segment, and tstamp. That
>> said, I realized very quickly that I need better results. While exploring
>> the source of the website, I noticed references to schema.org and get
>> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>> 
>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>> 
> 
> Excellent.
> 
> 
>> 
>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
>> capabilities introduced by NUTCH-1129?
>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
>> plugin.includes with something like parse-(html | tika |
>> any23)|index-(basic | anchor | any23)
>> 
> 
> No, you just add 'any23' to the list of plugins within the plugin.includes
> property of nutch-site.xml
> 
> 
>> Q: How do I expose the discovered microdata structure / items to end-user
>> such as Solr? For example, what are the microdata items and do I need to
>> map them to Solr in solrindex-mapping.xml?
>> 
> 
> OK, so current configuration for the Any23 plugin, is to store extracted
> structured data markup in the Nutch Metadata object with a key "
> Any23-Triples". You can locate it using something like the ParserChekcer
> tool provided via the 'nutch' script. Liekwise you can also locate it, as a
> representation of what would be indexed, by using the IndexerChecker
> tooling also provided within the 'nutch' script.
> 
> An example would be as follows, data is now indexed as follows (example
> after crawling https://smartive.ch/jobs):
> 
> 
>          "structured_data": [
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"IE-edge,chrome=1\"@de",
>              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
>              "short_key": "X-UA-Compatible"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"Wir sind smartive \\u2014 eine dynamische,
> innovative Schweizer Webentwicklungsagentur. Die Realisierung
> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
> Kunden.\"@de",
>              "key": "<http://vocab.sindice.net/any23#description>",
>              "short_key": "description"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"width=device-width, initial-scale=1,
> shrink-to-fit=no\"@de",
>              "key": "<http://vocab.sindice.net/any23#viewport>",
>              "short_key": "viewport"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"width=device-width,initial-scale=1\"@de",
>              "key": "<http://vocab.sindice.net/any23#viewport>",
>              "short_key": "viewport"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"ie=edge\"@de",
>              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
>              "short_key": "x-ua-compatible"
>            }
>          ],
> 
> 
> Note from above, that the 'predicate' key field is very useful for quickly
> filtering through, for example, Hotel Ratings, or something similar.
> 
> 
>> 
>> I’d also be interested to learn how to point at a specific URL and see how
>> nutch sees the microdata (best case), then learn how to leverage this into
>> nutch and finally into solr.
>> 
>> 
> See the tooling for ParserChecker and IndexerChecker as explained above.
> Any further question, please let me know.
> Lewis

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

Reply via email to