Re: nutch with cassandra internal network usage

Julien Nioche Thu, 21 Feb 2013 03:27:33 -0800

Lewis,

The point is whether the filtering is done on the backend side (e.g. using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to mapreduce
even if only a handful of entries are to be processed.


Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
<lewis.mcgibb...@gmail.com>wrote:

> Those filters are applied only to URLs which do not have a null
> GENERATE_MARK
> e.g.
>
>     if (Mark.GENERATE_MARK.checkMark(page) != null) {
>       if (GeneratorJob.LOG.isDebugEnabled()) {
>         GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
>       }
>       return;
>
> Therefore filters will be applied to all URLs which have a null
> GENERATE_MARK value.
>
> On Wed, Feb 20, 2013 at 2:45 PM, <alx...@aim.com> wrote:
>
> > Hi,
> >
> > Are those filters put on all data selected from hbase or sent to hbase as
> > filters to select a subset of all hbase records?
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
> > To: user <user@nutch.apache.org>
> > Sent: Wed, Feb 20, 2013 12:56 pm
> > Subject: Re: nutch with cassandra internal network usage
> >
> >
> > Hi Alex,
> >
> > On Wed, Feb 20, 2013 at 11:54 AM, <alx...@aim.com> wrote:
> >
> > >
> > > The generator also does not have filters. Its mapper  goes over all
> > > records as far as I know. If you use hadoop you can see how many
> records
> > go
> > > as input to mappers. Also see this
> > >
> >
> > I don't think this is true. The GeneratorMapper filters URLs before
> > selecting them for inclusion based on the following
> > - distance
> > - URLNormalizer(s)
> > - URLFilter(s)
> > in that order.
> > I am going to start a new thread on improvements to the GeneratorJob
> > regarding better configuration as it is a crucial stage in the crawl
> > process.
> >
> > So the issue here, as you correctly explain, is with the Fetcher
> obtaining
> > the URLs which have been marked with a desired batchId. This would be
> done
> > via scanner in Gora.
> >
> >
> >
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: nutch with cassandra internal network usage

Reply via email to