RE: Writable vs Externalizable

2005-08-08 Thread Chirag Chaman
In our experience, we use flavors of Nutch RPC, RMI and Externalizable. RMI has been easy to implement when only one server needs to be accessed (such as a status check) and class has many functions. The Nutch RPC is excellent for distribution -- yes one needs to serialize by hand and create the

Re: Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
What do others think? I think, RMI isn't a good idea. I waste a lot of time with it. I like the nutch rpc very much. However I think usage of Externalizable is a good idea, first it is a very small change. Second many users use nutch for very custom things and usage of Externalizable mak

Re: Writable vs Externalizable

2005-08-08 Thread Doug Cutting
Stefan Groschupf wrote: can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? You don't miss much! I avoided using Java's bui

Writable vs Externalizable

2005-08-08 Thread Stefan Groschupf
Hi, can someone please tell me what is the technical difference between org.apache.nutch.io.Writable and java.io.Externalizable? For me that looks very similar and Externalizable is available since jdk 1.1. What do I miss? Thanks for any hints. Stefan

Re: svn commit: r230887 - /lucene/nutch/trunk/conf/nutch-default.xml

2005-08-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: - http://www.nutch.org/docs/en/bot.html + http://lucene.apache.org/nutch/bot.html I think this should now be: http://lucene.apache.org/nutch/bot.html The docs/en pages have mostly been reduced to the "about" page, whose translations I hate to throw away, even thoug

Re: User agent string

2005-08-08 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current values:

User agent string

2005-08-08 Thread Piotr Kosiorowski
Hello, We should probably change user agent string in nutch-default.xml to point to Apache site. The only question is http.agent.version - should we set it to 0.07 for release and 0.08-dev for future work? I do not know how it was used previously. Current values: http.agent.url http://ww

Re: ndfs problem needs fix

2005-08-08 Thread Jay Pound
#2 from your response: I'm not yet sure how disk > failures appear to a JVM. Things are currently written so that if an > exception is thrown during disk i/o then the datanode should take itself > offline, initiating replication of its data. We'll see if that's > sufficient. the data is replicate

Re: ndfs problem needs fix

2005-08-08 Thread Doug Cutting
Jay Pound wrote: 1.) we need to split up chunks of data into sub-folders as not to run the filesystem out of its physical limitations of concurrent files in a single directory, like the way squid splits up its data into directories. I agree. I am currently using reiser with NDFS so this is no

Re: svn commit: r230867 - /lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

2005-08-08 Thread Piotr Kosiorowski
No problem for me. I have just run the test crawl on http://lucene.apache.org/nutch as described in new tutorial and a lot of pdf and png files were causing big exceptions and stack traces in log. I thought that people (usually using nutch for the first time) might think that they did something

Re: svn commit: r230867 - /lucene/nutch/trunk/conf/crawl-urlfilter.txt.template

2005-08-08 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Skipping png and pdf files. I think the undocumented convention has been to, by default, still fetch content types that are not parsed by default, but which may be parsed by simply enabling a plugin. That way folks need to only change one place in order to, e.g., st

Re: regex-url filter

2005-08-08 Thread Jay Pound
is there any way to filter results to english via search, so I can setup a multi-language search, I thought I saw somewhere that you could put something into the form of the html, a switch while submiting the form that would use a plugin to filter the results? I know I had seen some benchmarks on a

RE: regex-url filter

2005-08-08 Thread Chirag Chaman
Here's a better way http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/ FYI, this will not remove non-English sites -- but international sites that follow the two-letter convention. CC- -Original Message- From: Jay Pound [mailto:[EMAIL PROTECTED] Sent: Monday, August 08, 2

regex-url filter

2005-08-08 Thread Jay Pound
I would like a confirmation from someone that this will work, I've edited the regex filter in hopes to weed out non-english sites from my search results, I'll be testing pruning on my current 40mil index to see if it works there, or maybe there is a way to set the search to return only english resu

Re: Nutch website deployment

2005-08-08 Thread Piotr Kosiorowski
Thanks. I will add it to Wiki (but not today). P. Doug Cutting wrote: Piotr Kosiorowski wrote: So I have installed forrest and modified src/site/src/documentation/content/xdocs. Than run 'forrest'. And it generated content in src/site/build/site. And now the questions: Should I copy src/site

Re: Tutorial

2005-08-08 Thread Doug Cutting
+1 Piotr Kosiorowski wrote: Hello, Some time ago someone mentioned on the list a problem with nutch tutorial (I cannot find this email now). I have checked it today and he/she was right. If you follow the nutch Intranet Crawling tutorial you will end up with not very interesting index. This is

Re: Nutch website deployment

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: So I have installed forrest and modified src/site/src/documentation/content/xdocs. Than run 'forrest'. And it generated content in src/site/build/site. And now the questions: Should I copy src/site/build/site to site and commit it? Yes. I'm impressed that you got th

Re: JIRA access

2005-08-08 Thread Piotr Kosiorowski
Thanks. It works. Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able

Re: JIRA access

2005-08-08 Thread Doug Cutting
Piotr Kosiorowski wrote: Looking around in JIRA I found out I cannot resolve an issue. I am not sure how it works but I suspect I lack some rights to do so. Am I right? I have added you to the nutch-developers Jira group. Now you should be able to resolve issues, etc. Doug

NUTCH 79 Fault tolerant searching.

2005-08-08 Thread Piotr Kosiorowski
Hello, I just created an issue in JIRA http://issues.apache.org/jira/browse/NUTCH-79 containing the code for fault tolerant searching. I think it is too late to include it in 0.7 release but I would wait for comments and test it in the meantime. I would like to commit it when release would be d

Re: Ignore external links from crawled domains

2005-08-08 Thread Ken Krugler
A very basic facility seem to be missing in Nutch. If I have a 2000 urls list in Nutch DB and want to ignore external links, I have to build a regex-filter with thousands of different domain I want to crawl. No parameter to only crawl the different domain and ignore external links. At these t

Re: luke??

2005-08-08 Thread Jay Pound
I got it to work now, it wasent selecting the directory I had chosen, so I typed it in and it works fine BTW very cool tool -J - Original Message - From: "Fredrik Andersson" <[EMAIL PROTECTED]> To: Sent: Sunday, August 07, 2005 6:16 PM Subject: Re: luke?? > That's odd, Luke is working g

Re: Tutorial

2005-08-08 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: I can commit such changes for 0.7 release (it means today) if I got positive feedback from other committers. +1 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Tutorial

2005-08-08 Thread Piotr Kosiorowski
Hello, Some time ago someone mentioned on the list a problem with nutch tutorial (I cannot find this email now). I have checked it today and he/she was right. If you follow the nutch Intranet Crawling tutorial you will end up with not very interesting index. This is because it recommends users to

Creation of a Graph File with the DB Link Graph Database

2005-08-08 Thread Nils Hoeller
Hi, actually my Searcher is running on my Nutch made Indexed. Everything seems to work out: So I go on with a main part of my app. Before Nutch I used Arachnid as a crawler. During Crawling I used my Method /** * Each page considered to be inserted in the sitemap graph is stored in