GNU Getopt

2005-12-19 Thread Andrew McNabb
I'm on break right now, and I'm hoping to have a chance to get some stuff done. One thing I would like to do for Nutch is to use GNU Getopt (should be familiar for C coders out there) to make the command-line utilities behave properly. I'm especially thinking of NDFS. Would there be any objectio

RE: [Nutch-dev] distributed search

2005-12-19 Thread Ledio Ago
Rafi, Based on what you're saying, this tool splits a fetchlist into several fetchlists so that we can crawl/fetch the URLs from different fetchers, right?? If so, that's is not what I'm after. I'm trying to split an existing index into smaller partitions, so that I can make those partinions se

RE: [Nutch-dev] distributed search

2005-12-19 Thread Ledio Ago
I have the book so I'll check what I can do with the API. Thanks Stefan, Ledio -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Monday, December 19, 2005 3:38 PM To: nutch-dev@lucene.apache.org Subject: Re: [Nutch-dev] distributed search > By the way, is there

Re: [Nutch-dev] distributed search

2005-12-19 Thread Rafi Iz
check the next command FetchListTool (-local | -ndfs ) [-refetchonly] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays] This command call to a function called emitMultipleLists which spit out several fetchlists, so that you can fetch across several machines.

[jira] Updated: (NUTCH-145) build of war file fails on Chinese (zh) .xml files due to UTF-8 BOM

2005-12-19 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ] KuroSaka TeruHiko updated NUTCH-145: Summary: build of war file fails on Chinese (zh) .xml files due to UTF-8 BOM (was: ant build of the war fie fails on Chinese (zh) .xml files due to UTF

[jira] Updated: (NUTCH-145) ant build of the war fie fails on Chinese (zh) .xml files due to UTF-8 BOM

2005-12-19 Thread KuroSaka TeruHiko (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ] KuroSaka TeruHiko updated NUTCH-145: Attachment: NUTCH-145-fix.zip header.xml should go to src/web/include/zh/header.xml, and other *.xml should go to src/web/pages/zh/ > ant build of the

[jira] Commented: (NUTCH-145) ant build of the war fie fails on Chinese (zh) .xml files due to UTF-8 BOM

2005-12-19 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-145?page=comments#action_12360876 ] Stefan Groschupf commented on NUTCH-145: Pach files are always welcome, also if it take some time to be commited. :) However just create a patch file like below and atta

[jira] Created: (NUTCH-145) ant build of the war fie fails on Chinese (zh) .xml files due to UTF-8 BOM

2005-12-19 Thread KuroSaka TeruHiko (JIRA)
ant build of the war fie fails on Chinese (zh) .xml files due to UTF-8 BOM -- Key: NUTCH-145 URL: http://issues.apache.org/jira/browse/NUTCH-145 Project: Nutch Type: Bug Components: web gui

Re: [Nutch-dev] distributed search

2005-12-19 Thread Stefan Groschupf
By the way, is there an easy way to split the index I have already have. I would hate to recrawl all of the 1.9MM URLs again and waste bandwidth. Well I do not know any tool that comes with nutch or a other tool that does it, may there is one. But to write a java class that creates two smal

RE: [Nutch-dev] distributed search

2005-12-19 Thread Ledio Ago
I tried separating the Tomcat into a different machine and bingo. The performance went up by 30%%. Right now I only have two machines with 900K URLs each that act as Nutch servers and one machine that hosts the Tomcat. At this time I don't suspect any more that Tomcat is synchronously requesting

Re: Latest version of Mapred

2005-12-19 Thread Jérôme Charron
> Thanks for the fast response, > Do you know where I can find a compressed version? Here are the nightly builds: http://cvs.apache.org/dist/lucene/nutch/nightly/ Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Latest version of Mapred

2005-12-19 Thread Rafi Iz
Thanks for the fast response, Do you know where I can find a compressed version? Thanks, Rafi From: Stefan Groschupf <[EMAIL PROTECTED]> Reply-To: nutch-dev@lucene.apache.org To: nutch-dev@lucene.apache.org Subject: Re: Latest version of Mapred Date: Mon, 19 Dec 2005 19:00:29 +0100 mapred is

Re: [VOTE] Commiter access for Stefan Groschupf

2005-12-19 Thread Piotr Kosiorowski
+1 - especially for amount of support Stefan gives to nutch users. P. Andrzej Bialecki wrote: Hi, During the past year and more Stefan participated actively in the development, and contributed many high-quality patches. He's been spending considerable effort on addressing many issues in JIRA, an

Re: problems http-client

2005-12-19 Thread Andrzej Bialecki
Stefan Groschupf wrote: OK I will do that tomorrow! However in case it is known as buggy, we may should not set up as default http protocol plugin as it is by today. Newbies checking out nutch ill use the version that does not fetch all pages, since most people start with the standard config

Re: problems http-client

2005-12-19 Thread Michael
The same problem on FreeBSD 6.0 + jdk1.4.2 I think it was also reported some time ago by Rod Taylor. Switch to protocol-http. SG> Hi there, SG> is there someone out there that can confirm a problem we discovered? SG> We was wondering why not all pages of a generated segments was SG> fetched.

Re: problems http-client

2005-12-19 Thread Stefan Groschupf
OK I will do that tomorrow! However in case it is known as buggy, we may should not set up as default http protocol plugin as it is by today. Newbies checking out nutch ill use the version that does not fetch all pages, since most people start with the standard configuration. Am 19.12.2005 u

Re: problems http-client

2005-12-19 Thread Andrzej Bialecki
Stefan Groschupf wrote: Anyway today we note that when fetching with http-client the sum of errors and fetched pages is much less than the size defined when generating the segment. Changing to protocol-http solves the problem. Has anyone also note this behavior? I haven't, but this plugi

problems http-client

2005-12-19 Thread Stefan Groschupf
Hi there, is there someone out there that can confirm a problem we discovered? We was wondering why not all pages of a generated segments was fetched. The most strange thing was that the sum of errors and sucesspages was never the same as we defined in topN when generating a sgemtent . F

Re: Latest version of Mapred

2005-12-19 Thread Stefan Groschupf
mapred is now trunk... Am 19.12.2005 um 18:46 schrieb Rafi Iz: Hi all, I am currently working with Nutch 0.7.1, I want to start using the mapred, any ideas where I can find the latest version. B.T.W I looked at the path: http://svn.apache.org/repos/asf/lucene/ nutch/branches/ but the only d

Latest version of Mapred

2005-12-19 Thread Rafi Iz
Hi all, I am currently working with Nutch 0.7.1, I want to start using the mapred, any ideas where I can find the latest version. B.T.W I looked at the path: http://svn.apache.org/repos/asf/lucene/nutch/branches/ but the only directory that exists there is branch-0.7/ Thanks, Raffi ___