Re: [Nutch-dev] does nutch follow HEAD element?

AJ Chen Fri, 16 Jun 2006 17:36:37 -0700

Andrzej, thanks so much. It's great that nutch follows HEAD <link> since
it's the preferred place for autodiscovery of rdf/owl data. The type
property inside <link> tag can be set to "application/owl+xml" and
"application/rdf+xml"
so that nutch crawler knows the linked resource has rdf/owl content.


A related question: If I want nutch to fetch only rdf/owl files, is it
possible to generate the fetch list with urls that have type of
"application/owl+xml" or "application/rdf+xml"? Using file extension does
not always work because the resource url may not have extention like ".rdf".
If Nutch keeps the application type for each <link> item it finds, then the
application type can be used later when selecting urls for fetch list.

I plan to use nutch to crawl specifically for rdf/owl files and then parse
them into Lucene document for storing in a lucene index. This lucene index
of semantic data will be searched from the same nutch search interface.

Thanks,
AJ

On 6/16/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

AJ Chen wrote:
> I'm about to use nutch to crawl semantic data. Links to semantic data
> files
> (RDF, OWL, etc.) can be placed in two places: (1) HEAD <link>; (2)
> BODY <a
> href...>.  Does nutch crawler follows the HEAD <link>?

Yes. Please see parse-html/..../DOMContentUtils.java for details.

>
> I'm also creating a semantic data publishing tool, I would appreciate
any
> suggestion regarding the best way to make RDF files visible to nutch
> crawler.

Well, Nutch is certainly not a competitor to an RDF triple-store ;) It
may be used to collect RDF files, and then the map-reduce jobs can be
used to massively process these files to annotate large numbers of
target resources (e.g. add metadata to pages in the crawldb). You could
also load them to a triple store and use that to annotate resources in
Nutch, to provide a better searching experience (e.g. searching by
concept, by semantic relationships, finding similar concepts in other
ontologies, etc).

In the end, the model that Nutch supports the best is the Lucene model,
which is an unordered bag of documents with multiple fields
(properties). If you can translate your required model into this, then
you're all set. Nutch/Hadoop provides also a scalable processing
framework, which is quite useful for enhancing the existing data with
data from external sources (e.g. databases, triplestore, ontologies,
semantic nets and such).

In some cases, when this external infrastructure is efficient enough,
it's possible to combine it on-the-fly (I have successfully used this
approach with WordNet, Wikipedia and DMOZ), in other cases you will need
to do some batch pre-processing to make this external metadata available
as a part of Nutch documents ... again, the framework of map/reduce and
DFS is very useful for that (and I have used this approach too, even
with the same data as above).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] does nutch follow HEAD element?

Reply via email to