Andrzej, thanks so much. It's great that nutch follows HEAD since
it's the preferred place for autodiscovery of rdf/owl data. The type
property inside tag can be set to "application/owl+xml" and
"application/rdf+xml"
so that nutch crawler knows the linked resource has rdf/owl content.
A related
AJ Chen wrote:
> I'm about to use nutch to crawl semantic data. Links to semantic data
> files
> (RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2)
> BODY href...>. Does nutch crawler follows the HEAD ?
Yes. Please see parse-html//DOMContentUtils.java for details.
>
> I'm also c
I was going to suggest the same approach. Seems simple enough and would force
the person to edit the config. What is entered in place of EDITME is another
story, but maybe some code can enforce some rules on that, too.
Otis
- Original Message
From: Teruhiko Kurosaka <[EMAIL PROTECTED
I'm about to use nutch to crawl semantic data. Links to semantic data files
(RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2) BODY . Does nutch crawler follows the HEAD ?
I'm also creating a semantic data publishing tool, I would appreciate any
suggestion regarding the best way to mak
I guess that's the middle of the road approach, with
the two extremes being raw
data and standardized approach.
I agree that we should make some kind of open web
directory or info. I think a
decentralized approach will make it more difficult to
distribute the data
whereas a centralized exposes us
That does sound fairly brilliant. One thing you'll have to keep
in mind is that different plugins index different things and sometimes
the same things in different ways. You'll need to make sure that crawl
data is labeled with both the plugins used and the versions of each of
the plugins.
Michael,
Superb idea! And if those crawls could be distributed through a protocol
like bittorrent, it would spread out the load versus having a single
bottleneck somewhere. I haven't thought it through, but here's some
information (the pdf is the best place to start).
http://www.bittorrent.com/bi
How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
"EDITME".
The description should clearly state that these values *must* be
edited
to reflect the true
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v3.patch
Version of patch that doesn't "...process the String twice if it contains some
illegal characters!". Its name
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Version: 0.8-dev
(was: 0.7)
Was version 0.7. Changed 'Affects Version' to 0.8-dev.
> OpenSearchServlet outputs illegal xml characters
[
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12416526 ]
Jerome Charron commented on NUTCH-258:
--
> Because we don't want one RuntimeException killing all subsequent fetching
> tasks
Chris, a RuntimeException will not kill all su
[
http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ]
Jerome Charron commented on NUTCH-110:
--
This patch process the String twice if it contains some illegal characters!
> OpenSearchServlet outputs illegal xml characters
> --
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
John VanDyk updated NUTCH-110:
--
Attachment: fixIllegalXmlChars08-v2.patch
Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did
it.
This patch fixes the illegal XML charac
Paul Sutter wrote:
> I think that Nutch has to solve the problem: if you leave the problem to the
> websites, they're more likely to cut you off than they are to implement
> their own index storage scheme. Besides, they'd get it wrong, have stale
> data, etc.
>
agreed
> Maybe what is needed is
I'm somewhat worried about the possible clash in the conf name-space -
usually, when we store Object's in Configuration instance, we use their
full class name, or at least a long and most probably unique string. In
this case, we use just "http", "https", "ftp", "file" and so on ...
Would it make s
[EMAIL PROTECTED] wrote:
> Author: siren
> Date: Thu Jun 15 13:53:14 2006
> New Revision: 414681
>
> URL: http://svn.apache.org/viewvc?rev=414681&view=rev
> Log:
> protocols are now instantiated and configured only once
>
>
[...]
> +
> + if (conf.getObject(protocolName) != null) {
> +
16 matches
Mail list logo