Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed.
Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>wrote: > Those filters are applied only to URLs which do not have a null > GENERATE_MARK > e.g. > > if (Mark.GENERATE_MARK.checkMark(page) != null) { > if (GeneratorJob.LOG.isDebugEnabled()) { > GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); > } > return; > > Therefore filters will be applied to all URLs which have a null > GENERATE_MARK value. > > On Wed, Feb 20, 2013 at 2:45 PM, <alx...@aim.com> wrote: > > > Hi, > > > > Are those filters put on all data selected from hbase or sent to hbase as > > filters to select a subset of all hbase records? > > > > Thanks. > > Alex. > > > > > > > > > > > > > > > > -----Original Message----- > > From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > > To: user <user@nutch.apache.org> > > Sent: Wed, Feb 20, 2013 12:56 pm > > Subject: Re: nutch with cassandra internal network usage > > > > > > Hi Alex, > > > > On Wed, Feb 20, 2013 at 11:54 AM, <alx...@aim.com> wrote: > > > > > > > > The generator also does not have filters. Its mapper goes over all > > > records as far as I know. If you use hadoop you can see how many > records > > go > > > as input to mappers. Also see this > > > > > > > I don't think this is true. The GeneratorMapper filters URLs before > > selecting them for inclusion based on the following > > - distance > > - URLNormalizer(s) > > - URLFilter(s) > > in that order. > > I am going to start a new thread on improvements to the GeneratorJob > > regarding better configuration as it is a crucial stage in the crawl > > process. > > > > So the issue here, as you correctly explain, is with the Fetcher > obtaining > > the URLs which have been marked with a desired batchId. This would be > done > > via scanner in Gora. > > > > > > > > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble