(The message below was posted to nutch-dev a few days ago.) Can anyone
(anonymous or otherwise) confirm whether it's possible to use Nutch 0.7 for
a "4-6 billion page search engine"? Is this a typo or for real? Just curious
and if it's true what were the major issues e.g. time, RAM, (storage
presu
You can build a ngp profile for chinese, but i think that in language
identifiers current form it might not work that well.
We re-wrote this plugin to doing a more naive-Bayes like identification
approach and got better results for Japanese. It wasn't proper Naive Bayes
but did work better.
I
Haven't seen anyone mention this on the lists yet but is probably of
interest to the community:
http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/
Hi,
What would be the best way to perform crawling with two different
user-agents so as to compare the pages (requested with the two different
agents) returned by a server and accept/reject the url (for subseqent
parsing/indexing etc.)?
I believe the Google crawler used to do (still does?) somet
running everything on a single MP
machine w/o DFS.
Doug
e w wrote:
>
> Logging of the Fetcher output in 0.8-dev used to work (writing to the
> corresponding tasktracker output log) but doesn't appear to any more
with
> the nightly build from a couple of weeks ago and also
Hi Sami,
In case it helps (since I've experience the same issue) I'm running on a
multiple node setup and run dfs and the nutch commands same as Otis.
However, with my "fix" of hard-wiring the path of the hadoop.log file in
log4j.properties I get multiple machines and threads trying to write
sim
Thanks for pointing this out! I've sent 2 messages to the lists asking where
the Fetcher logs have disappeared to and no-one else seemed to be
experiencing this problem. Hardwiring the "log4j.appender.DRFA.File"
variable to a specified filename has solved this and the logs are back. If
anyone find
Logging of the Fetcher output in 0.8-dev used to work (writing to the
corresponding tasktracker output log) but doesn't appear to any more with
the nightly build from a couple of weeks ago and also the one from last
night.
I've enabled DEBUG for the first 4 logging properties in
conf/log4j.proper