I'm on break right now, and I'm hoping to have a chance to get some
stuff done. One thing I would like to do for Nutch is to use GNU Getopt
(should be familiar for C coders out there) to make the command-line
utilities behave properly. I'm especially thinking of NDFS.
Would there be any objectio
Rafi,
Based on what you're saying, this tool splits a fetchlist into several
fetchlists
so that we can crawl/fetch the URLs from different fetchers, right??
If so, that's is not what I'm after. I'm trying to split an existing index
into smaller partitions, so that I can make those partinions se
I have the book so I'll check what I can do with the API.
Thanks Stefan,
Ledio
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Monday, December 19, 2005 3:38 PM
To: nutch-dev@lucene.apache.org
Subject: Re: [Nutch-dev] distributed search
> By the way, is there
check the next command
FetchListTool (-local | -ndfs )
[-refetchonly] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers]
[-adddays numDays]
This command call to a function called emitMultipleLists which spit out
several fetchlists, so that you can fetch across several machines.
[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ]
KuroSaka TeruHiko updated NUTCH-145:
Summary: build of war file fails on Chinese (zh) .xml files due to UTF-8
BOM (was: ant build of the war fie fails on Chinese (zh) .xml files due to
UTF
[ http://issues.apache.org/jira/browse/NUTCH-145?page=all ]
KuroSaka TeruHiko updated NUTCH-145:
Attachment: NUTCH-145-fix.zip
header.xml should go to src/web/include/zh/header.xml, and other *.xml should
go to src/web/pages/zh/
> ant build of the
[
http://issues.apache.org/jira/browse/NUTCH-145?page=comments#action_12360876 ]
Stefan Groschupf commented on NUTCH-145:
Pach files are always welcome, also if it take some time to be commited. :)
However just create a patch file like below and atta
ant build of the war fie fails on Chinese (zh) .xml files due to UTF-8 BOM
--
Key: NUTCH-145
URL: http://issues.apache.org/jira/browse/NUTCH-145
Project: Nutch
Type: Bug
Components: web gui
By the way, is there an easy way to split the index I have already
have.
I would hate to recrawl all of the 1.9MM URLs again and waste
bandwidth.
Well I do not know any tool that comes with nutch or a other tool
that does it, may there is one.
But to write a java class that creates two smal
I tried separating the Tomcat into a different machine and bingo.
The performance went up by 30%%. Right now I only have two machines
with 900K URLs each that act as Nutch servers and one machine that hosts the
Tomcat.
At this time I don't suspect any more that Tomcat is synchronously requesting
> Thanks for the fast response,
> Do you know where I can find a compressed version?
Here are the nightly builds:
http://cvs.apache.org/dist/lucene/nutch/nightly/
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Thanks for the fast response,
Do you know where I can find a compressed version?
Thanks,
Rafi
From: Stefan Groschupf <[EMAIL PROTECTED]>
Reply-To: nutch-dev@lucene.apache.org
To: nutch-dev@lucene.apache.org
Subject: Re: Latest version of Mapred Date: Mon, 19 Dec 2005 19:00:29 +0100
mapred is
+1 - especially for amount of support Stefan gives to nutch users.
P.
Andrzej Bialecki wrote:
Hi,
During the past year and more Stefan participated actively in the
development, and contributed many high-quality patches. He's been
spending considerable effort on addressing many issues in JIRA, an
Stefan Groschupf wrote:
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard config
The same problem on FreeBSD 6.0 + jdk1.4.2
I think it was also reported some time ago by Rod Taylor.
Switch to protocol-http.
SG> Hi there,
SG> is there someone out there that can confirm a problem we discovered?
SG> We was wondering why not all pages of a generated segments was
SG> fetched.
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch
all pages, since most people start with the standard configuration.
Am 19.12.2005 u
Stefan Groschupf wrote:
Anyway today we note that when fetching with http-client the sum of
errors and fetched pages is much less than the size defined when
generating the segment.
Changing to protocol-http solves the problem.
Has anyone also note this behavior?
I haven't, but this plugi
Hi there,
is there someone out there that can confirm a problem we discovered?
We was wondering why not all pages of a generated segments was
fetched. The most strange thing was that the sum of errors and
sucesspages was never the same as we defined in topN when generating
a sgemtent .
F
mapred is now trunk...
Am 19.12.2005 um 18:46 schrieb Rafi Iz:
Hi all,
I am currently working with Nutch 0.7.1,
I want to start using the mapred, any ideas where I can find the
latest version.
B.T.W I looked at the path: http://svn.apache.org/repos/asf/lucene/
nutch/branches/
but the only d
Hi all,
I am currently working with Nutch 0.7.1,
I want to start using the mapred, any ideas where I can find the latest
version.
B.T.W I looked at the path:
http://svn.apache.org/repos/asf/lucene/nutch/branches/
but the only directory that exists there is branch-0.7/
Thanks,
Raffi
___
20 matches
Mail list logo