[Nutch-dev] Problem with search engine

2005-03-08 Thread YourSoft
Dear List! I have a problem with search engine. I have ~900 000 pages in my test db. When I make a search to the "notebook", in the first 4 hits pages: http://www.notebook.hu/catalogue/notebook/ http://www.notebook.hu/nb_search/ http://hp-renew.outlet.laptop.notebook.hu/ http://gigalan.notebook.hu

Re: [Nutch-dev] Plugins - sum up

2005-03-08 Thread John X
> Please confirm this little sum up and tutorial for plugins. > > (1) > PARSING plugins : allow to parse different kinds of mime types -> html, > text, pdf, msword, mp3, rtf > ** parse-ext ** is a wrapper ... what can it do ? Here is a description: http://nutch.neasys.com/patch/20040703/note.txt

RE: [Nutch-dev] Nutch Crawler !!!

2005-03-08 Thread Daniel Drazner
Thanks a lot. I also started to run Nutch in debug mode. It's interesting experience but any Tech documentation will definitely save me some time. Will wait to see what others have to add here. Thanks, Daniel -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of

Re: [Nutch-dev] Re: NameNode scalibility

2005-03-08 Thread Michael Cafarella
Angel, Much of what you're seeing is part of the replication problem. 1) The "Replicated " message is when a successful replication happens. It's not surprising that you see a lot of them. 2) The "Block XX is valid, and cannot be written to" happens when one node tries to replicate

[Nutch-dev] getting a list of matching URLs from a start URL

2005-03-08 Thread Fabrice EstiƩvenart
Hello, From a list of start URLs (each associated with a regular expression), I'd like to get - for each start URL - all URLs that come from the same domain and that match the expression...I don't wanna analyse or index the URLs, just to write them down in a flat file. Example : start URL : htt

Re: [Nutch-dev] Re: NameNode scalibility

2005-03-08 Thread Angel Faus
Hi, Great. Thanks for the tips. I've tried the following startup sequences: * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes. Start all DataNodes. * Start NameNode. Wait until CPU goes to 0. Wait 2 extra minutes. Start each DataNode with a 10 minutes pause between them. * Star