Re: Nutch Programmer Wanted

2007-01-07 Thread e w
(The message below was posted to nutch-dev a few days ago.) Can anyone (anonymous or otherwise) confirm whether it's possible to use Nutch 0.7 for a "4-6 billion page search engine"? Is this a typo or for real? Just curious and if it's true what were the major issues e.g. time, RAM, (storage presu

Re: Nutch .81: the process to add a new analyzer ?

2007-01-07 Thread e w
You can build a ngp profile for chinese, but i think that in language identifiers current form it might not work that well. We re-wrote this plugin to doing a more naive-Bayes like identification approach and got better results for Japanese. It wasn't proper Naive Bayes but did work better. I

New Wikipedia search engine using Nutch

2006-12-25 Thread e w
Haven't seen anyone mention this on the lists yet but is probably of interest to the community: http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/

Fetching with two different user agents

2006-11-13 Thread e w
Hi, What would be the best way to perform crawling with two different user-agents so as to compare the pages (requested with the two different agents) returned by a server and accept/reject the url (for subseqent parsing/indexing etc.)? I believe the Google crawler used to do (still does?) somet

Re: Problem with logging of Fetcher output in 0.8-dev

2006-08-23 Thread e w
running everything on a single MP machine w/o DFS. Doug e w wrote: > > Logging of the Fetcher output in 0.8-dev used to work (writing to the > corresponding tasktracker output log) but doesn't appear to any more with > the nightly build from a couple of weeks ago and also

Re: [Nutch-general] log4j.properties bug(?)

2006-08-12 Thread e w
Hi Sami, In case it helps (since I've experience the same issue) I'm running on a multiple node setup and run dfs and the nutch commands same as Otis. However, with my "fix" of hard-wiring the path of the hadoop.log file in log4j.properties I get multiple machines and threads trying to write sim

Re: log4j.properties bug(?)

2006-08-12 Thread e w
Thanks for pointing this out! I've sent 2 messages to the lists asking where the Fetcher logs have disappeared to and no-one else seemed to be experiencing this problem. Hardwiring the "log4j.appender.DRFA.File" variable to a specified filename has solved this and the logs are back. If anyone find

Problem with logging of Fetcher output in 0.8-dev

2006-07-25 Thread e w
Logging of the Fetcher output in 0.8-dev used to work (writing to the corresponding tasktracker output log) but doesn't appear to any more with the nightly build from a couple of weeks ago and also the one from last night. I've enabled DEBUG for the first 4 logging properties in conf/log4j.proper