[Nutch-dev] Re: NDFS Connection reset

2005-12-08 Thread Paul Baclace
Jack Tang wrote: It was odd that when I input every command, the NameNode would throw exception: 051206 003714 Server connection on port 9000 from 127.0.0.1: starting 051206 003715 Server connection on port 9000 from 127.0.0.1 caught: java.net.SocketException: Connection reset java.net.SocketExc

[Nutch-dev] Should nutch try to reduce first?

2005-12-08 Thread Rod Taylor
When you run multiple commands within nutch it seems to process the pending tasks in the order that they were added to the queue. In some cases this means you may be 50% through many jobs (complete map but not reduce) while processes maps for yet more jobs. I think Nutch should prioritize a pendi

[Nutch-dev] nutch questions

2005-12-08 Thread Ken van Mulder
Hey folks, We're looking at launching a search engine in the beginning of the new year that will eventually grow to being a multi-billion page index. Three questions: First, and most important for now, does anyone have any useful numbers for what the hardware requirements are to run such an

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359770 ] Jerome Charron commented on NUTCH-133: -- Doug, Oh, yes, I understand what you mean. Yes, that realy make sense. I will commit a patch of Content for this (and removing all

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Doug Cutting
Doug Cutting wrote: Implementing something like this for Lucene would not be too difficult. The index would need to be re-sorted by document boost: documents would be re-numbered so that highly-boosted documents had low document numbers. In particular, one could: 1. Create an array of int[ma

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when for any query the response time is well below 1 second. Otherwise the service seems sluggish. Response times over 3 seconds are normally not acceptable. It depends. Clearl

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359753 ] Doug Cutting commented on NUTCH-133: Stefan, The primary reason to keep classes and method names the same is to simplify the evaluation of your patch. A good patch should

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Doug Cutting
Andrzej Bialecki wrote: Hmm... Please define what "adequate" means. :-) IMHO, "adequate" is when for any query the response time is well below 1 second. Otherwise the service seems sluggish. Response times over 3 seconds are normally not acceptable. It depends. Clearly an average response ti

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Hi, I started to think about implementing special kind of Lucene Query (if I remember correctly I would have to write my own Scorer and probably a few other classes) optimized for Nutch some time ago. I assumed having specialized query I would be able to avoid accessing

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359729 ] Jerome Charron commented on NUTCH-133: -- Stefan: Taking a closer look at the ParserFactory patch: 1. You can use the MimeType.clean(String) static method to clean the cont

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Piotr Kosiorowski
Hi, I started to think about implementing special kind of Lucene Query (if I remember correctly I would have to write my own Scorer and probably a few other classes) optimized for Nutch some time ago. I assumed having specialized query I would be able to avoid accessing some of lucene index structu

[Nutch-dev] [jira] Closed: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=all ] Stefan Groschupf closed NUTCH-133: -- Resolution: Won't Fix We will split the problems described here into a set of bugs to fix things step by step. > ParserFactory does not work as expect

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359725 ] Stefan Groschupf commented on NUTCH-133: Doug, ok, I will split things in different patches and open a set of new bugs. Jerome: If you take a carefully look to my pat

[Nutch-dev] about the question of clustering-carrot2

2005-12-08 Thread charlie
Dear all, Currently I'm using the Nutch plug-in "clustering-carrot2" and would like to ask for some help. When I built the search result clusters, only the search results that occurred twice or more will be grouped into one cluster. At the same time, if some results(keywords) only occur once, i

[Nutch-dev] [jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Jerome Charron (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359717 ] Jerome Charron commented on NUTCH-133: -- I think Doug proposal is the good way of solving this content-type issue. This solution just miss the "guess mime-type from file ext

[Nutch-dev] Re: Lucene performance bottlenecks

2005-12-08 Thread Andrzej Bialecki
(Moving the discussion to nutch-dev, please drop the cc: when responding) Doug Cutting wrote: Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by som