Re: Configuration improvements to GeneratorJob

2013-02-24 Thread feng lu
@Tejas +1 I think: Keep Property - - generate.max.count. keep it because it still used GeneratorJob, Reducer. - GENERATOR_MAX_COUNT Deprecate Property -- - GENERATOR_MIN_SCORE - GENERATOR_COUNT_VALUE_IP Add in nutch-default.xml -

Re: Configuration improvements to GeneratorJob

2013-02-24 Thread Tejas Patil
Hi Lewis, We have not came to a conclusion for this topic. Here is what I propose: 1. keep "generate.max.count" 2. GENERATOR_MIN_SCORE and GENERATOR_MAX_COUNT: once we get to know that if they were kept back in 2.x for some valid reason, then we can safely remove these params. These seem to do not

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
Hi Lufeng, On Wed, Feb 20, 2013 at 9:19 PM, feng lu wrote: > Hi Tejas > > Yes , your are right. I misread the description of property > "generate.count.mode". I'm so sorry, i did also not found any information > about why disabled the IP based counting mode of "generate.count.mode". > > Yes, i s

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread feng lu
Hi Tejas Yes , your are right. I misread the description of property "generate.count.mode". I'm so sorry, i did also not found any information about why disabled the IP based counting mode of "generate.count.mode". Yes, i see that the FetchEntryPartitioner class (combination of URLPartitioner) is

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
Hi Lufeng, On Wed, Feb 20, 2013 at 7:16 PM, feng lu wrote: > Hi Lewis > > Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x. > > To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to > GeneratorJob, instead of deprecated it. patch may like this. > > if (G

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread feng lu
Hi Lewis Sorry, I am wrong, The GeneratorJob is only used in Nutch 2.x not 1.x. To the property of GENERATOR_COUNT_VALUE_IP, i think we can add a patch to GeneratorJob, instead of deprecated it. patch may like this. if (GENERATOR_COUNT_VALUE_HOST.equalsIgnoreCase(mode)) { getConf().set(URL

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread Tejas Patil
Hey Lewis, On Wed, Feb 20, 2013 at 1:05 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi, > Following on from a discussion on user@ I dived into the GeneratorJob > code and have the following general comment based on my observation... > Usage of configuration options is really un

Re: Configuration improvements to GeneratorJob

2013-02-20 Thread feng lu
Hi Lewis i think generate.max.count is used by someone who want to limits the number urls per domain (host). see http://wiki.apache.org/nutch/Nutch2Crawling#Reducer The generate.min.score property is already defined in nutch-default.xml. The generate.(filter|normalise|topN) can be passed through

Configuration improvements to GeneratorJob

2013-02-20 Thread Lewis John Mcgibbney
Hi, Following on from a discussion on user@ I dived into the GeneratorJob code and have the following general comment based on my observation... Usage of configuration options is really unstructured and loosely applied. This should not be the case. For example Observations === nutch-defau