Re: [Nutch-dev] does nutch follow HEAD element?

2006-06-16 Thread AJ Chen
Andrzej, thanks so much. It's great that nutch follows HEAD since it's the preferred place for autodiscovery of rdf/owl data. The type property inside tag can be set to "application/owl+xml" and "application/rdf+xml" so that nutch crawler knows the linked resource has rdf/owl content. A related

Re: [Nutch-dev] does nutch follow HEAD element?

2006-06-16 Thread Andrzej Bialecki
AJ Chen wrote: > I'm about to use nutch to crawl semantic data. Links to semantic data > files > (RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2) > BODY href...>. Does nutch crawler follows the HEAD ? Yes. Please see parse-html//DOMContentUtils.java for details. > > I'm also c

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread ogjunk-nutch
I was going to suggest the same approach. Seems simple enough and would force the person to edit the config. What is entered in place of EDITME is another story, but maybe some code can enforce some rules on that, too. Otis - Original Message From: Teruhiko Kurosaka <[EMAIL PROTECTED

[Nutch-dev] does nutch follow HEAD element?

2006-06-16 Thread AJ Chen
I'm about to use nutch to crawl semantic data. Links to semantic data files (RDF, OWL, etc.) can be placed in two places: (1) HEAD ; (2) BODY . Does nutch crawler follows the HEAD ? I'm also creating a semantic data publishing tool, I would appreciate any suggestion regarding the best way to mak

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread peter decrem
I guess that's the middle of the road approach, with the two extremes being raw data and standardized approach. I agree that we should make some kind of open web directory or info. I think a decentralized approach will make it more difficult to distribute the data whereas a centralized exposes us

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Vanderdray, Jacob
That does sound fairly brilliant. One thing you'll have to keep in mind is that different plugins index different things and sometimes the same things in different ways. You'll need to make sure that crawl data is labeled with both the plugins used and the versions of each of the plugins.

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Paul Sutter
Michael, Superb idea! And if those crawls could be distributed through a protocol like bittorrent, it would spread out the load versus having a single bottleneck somewhere. I haven't thought it through, but here's some information (the pdf is the best place to start). http://www.bittorrent.com/bi

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Teruhiko Kurosaka
How about introducing these changes in an effort to force the nutch admins to properly edit the bot identity strings? 1. Add the http.agent.* entries to nutch-site.xml with the value being "EDITME". The description should clearly state that these values *must* be edited to reflect the true

[Nutch-dev] [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars08-v3.patch Version of patch that doesn't "...process the String twice if it contains some illegal characters!". Its name

[Nutch-dev] [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread [EMAIL PROTECTED] (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Version: 0.8-dev (was: 0.7) Was version 0.7. Changed 'Affects Version' to 0.8-dev. > OpenSearchServlet outputs illegal xml characters

[Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-16 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12416526 ] Jerome Charron commented on NUTCH-258: -- > Because we don't want one RuntimeException killing all subsequent fetching > tasks Chris, a RuntimeException will not kill all su

[Nutch-dev] [jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ] Jerome Charron commented on NUTCH-110: -- This patch process the String twice if it contains some illegal characters! > OpenSearchServlet outputs illegal xml characters > --

[Nutch-dev] [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2006-06-16 Thread John VanDyk (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] John VanDyk updated NUTCH-110: -- Attachment: fixIllegalXmlChars08-v2.patch Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did it. This patch fixes the illegal XML charac

Re: [Nutch-dev] IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Michael Wechner
Paul Sutter wrote: > I think that Nutch has to solve the problem: if you leave the problem to the > websites, they're more likely to cut you off than they are to implement > their own index storage scheme. Besides, they'd get it wrong, have stale > data, etc. > agreed > Maybe what is needed is

Re: [Nutch-dev] [Nutch-cvs] svn commit: r414681 - /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolFactory.java

2006-06-16 Thread Jérôme Charron
I'm somewhat worried about the possible clash in the conf name-space - usually, when we store Object's in Configuration instance, we use their full class name, or at least a long and most probably unique string. In this case, we use just "http", "https", "ftp", "file" and so on ... Would it make s

Re: [Nutch-dev] [Nutch-cvs] svn commit: r414681 - /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolFactory.java

2006-06-16 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: > Author: siren > Date: Thu Jun 15 13:53:14 2006 > New Revision: 414681 > > URL: http://svn.apache.org/viewvc?rev=414681&view=rev > Log: > protocols are now instantiated and configured only once > > [...] > + > + if (conf.getObject(protocolName) != null) { > +