Thank you for this information. Since this is very much related to Any23 and microdata parsing, I’m going to ask what I believe is a related question but keep this same thread so it will be organized in one place:
I noticed a lot of job boards such as dice.com <http://dice.com/>, monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <http://schema.org/JobPosting> information, however many seem to use <script type="application/ld+json”>…</script> rather than RDF. Summer 2017, Google announced structured data guidance for Jobs: https://developers.google.com/search/docs/data-types/job-posting <https://developers.google.com/search/docs/data-types/job-posting> and a testing tool to validate your HTML: https://search.google.com/structured-data/testing-tool I verified a few sample listings on the above mentioned job boards on google’s testing-tool and they validate OK. So after looking at http://any23.apache.org/getting-started.html <http://any23.apache.org/getting-started.html> for the supported extractors, I see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml to override the same property in nutch-default.xml: <property> <name>any23.extractors</name> <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value> <description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description> </property> I expected to see additional information from nutch parsechecker after adding the jsonld extractors, however I see NO changes to Any23-Triples microdata parsed. What might I be doing wrong? > On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <lewi...@apache.org> wrote: > > Hi David, > Answers inline > > On Thu, Feb 8, 2018 at 9:19 AM, <user-digest-h...@nutch.apache.org> wrote: > >> >> From: David Ferrero <david.ferr...@zion.com> >> To: user@nutch.apache.org >> Cc: >> Bcc: >> Date: Thu, 8 Feb 2018 10:19:52 -0700 >> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction? >> Pull request #205 was recently merged into master branch for Nutch 1.x in >> fulfillment of NUTCH-1129 "microdata for Nutch 1.x" >> >> I am new to nutch and solr and have just started crawling and indexing a >> few select websites. Using the built in html parsing/indexing, I am getting >> searchable fields like url, content, host, sometimes a title, and a few >> other indexing related fields like digest, boost, segment, and tstamp. That >> said, I realized very quickly that I need better results. While exploring >> the source of the website, I noticed references to schema.org and get >> excited by what I see. That’s how I stumbled upon NUTCH-1129. >> >> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer. >> > > Excellent. > > >> >> Q: Now what? How do I gain Any23 microdata parsing / indexing >> capabilities introduced by NUTCH-1129? >> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in >> plugin.includes with something like parse-(html | tika | >> any23)|index-(basic | anchor | any23) >> > > No, you just add 'any23' to the list of plugins within the plugin.includes > property of nutch-site.xml > > >> Q: How do I expose the discovered microdata structure / items to end-user >> such as Solr? For example, what are the microdata items and do I need to >> map them to Solr in solrindex-mapping.xml? >> > > OK, so current configuration for the Any23 plugin, is to store extracted > structured data markup in the Nutch Metadata object with a key " > Any23-Triples". You can locate it using something like the ParserChekcer > tool provided via the 'nutch' script. Liekwise you can also locate it, as a > representation of what would be indexed, by using the IndexerChecker > tooling also provided within the 'nutch' script. > > An example would be as follows, data is now indexed as follows (example > after crawling https://smartive.ch/jobs): > > > "structured_data": [ > { > "node": "<https://smartive.ch/jobs>", > "value": "\"IE-edge,chrome=1\"@de", > "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>", > "short_key": "X-UA-Compatible" > }, > { > "node": "<https://smartive.ch/jobs>", > "value": "\"Wir sind smartive \\u2014 eine dynamische, > innovative Schweizer Webentwicklungsagentur. Die Realisierung > zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer > Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und > Kunden.\"@de", > "key": "<http://vocab.sindice.net/any23#description>", > "short_key": "description" > }, > { > "node": "<https://smartive.ch/jobs>", > "value": "\"width=device-width, initial-scale=1, > shrink-to-fit=no\"@de", > "key": "<http://vocab.sindice.net/any23#viewport>", > "short_key": "viewport" > }, > { > "node": "<https://smartive.ch/jobs>", > "value": "\"width=device-width,initial-scale=1\"@de", > "key": "<http://vocab.sindice.net/any23#viewport>", > "short_key": "viewport" > }, > { > "node": "<https://smartive.ch/jobs>", > "value": "\"ie=edge\"@de", > "key": "<http://vocab.sindice.net/any23#x-ua-compatible>", > "short_key": "x-ua-compatible" > } > ], > > > Note from above, that the 'predicate' key field is very useful for quickly > filtering through, for example, Hotel Ratings, or something similar. > > >> >> I’d also be interested to learn how to point at a specific URL and see how >> nutch sees the microdata (best case), then learn how to leverage this into >> nutch and finally into solr. >> >> > See the tooling for ParserChecker and IndexerChecker as explained above. > Any further question, please let me know. > Lewis