This is awesome Julien :) Thanks for sharing !!

On Mon, Sep 16, 2013 at 9:43 AM, Julien Nioche <
[email protected]> wrote:

> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with 2.x,
> we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
> Feel free to comment.
>
> Best,
>
> Julien
>
>
> On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected]
> >wrote:
>
> > I am sure that Renato (if he is watching) can plugin maybe as well.
> > We find in Gora that in every sense of the word, native Hadoop stores
> such
> > as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> > via getParitions we retrieve GoraInputSplits natively which means splits
> > are obtained for MapReduce jobs... such as many of the jobs we run in
> Nutch
> > as well. On  the other hand (currently) stores such as Cassandra and Web
> > service stores such as DynamoDB do not support Hadoop out of the box (the
> > former we are working on and hope to  have implemented in Gora soon)
> > therefore it is not as simple to get partitions in the same way we would
> in
> > a Hadoop native store. We therefore obtain one partition to be used as an
> > InputSplit for the MR job. This is certainly an area for concern and
> right
> > now a bottleneck for some operations. We continue to work on this.
> >
> >
> > On Wednesday, August 7, 2013, Julien Nioche <
> [email protected]
> > >
> > wrote:
> > > Hi Otis
> > >
> > > Definitely *not *the fetching speed. Actually everything but *not* the
> > > fetching speed. The fetcher is pretty much the same as 1.x and anyway
> the
> > > performance with fetching is pretty much always limited by the
> politeness
> > > settings, not the implementation.
> > >
> > > Re-backend : some backend implementations are more mature than others.
> > The
> > > one for HBase is probably the one most widely used, the Cassandra one
> has
> > > been greatly improved in particular performance-wise , the SQL one is
> > > broken etc... we need to measure this as this is just a gut feeling at
> > this
> > > stage
> > >
> > > Now for  what is slower and why, again this has to be measured but I
> > expect
> > > 2.x to be slower partly because of [1], i.e. the filtering of entries
> is
> > > not done by the backends (some might provide a way of doing it) but
> this
> > is
> > > done on the client side, when we create the input for mapred. In other
> > > words we pull things from the backend just to discard it. Since 2.x
> does
> > > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> > as
> > > single input) we scan the whole table even if we want to fetch or
> parse a
> > > handful of entries.
> > >
> > > On the other hand, 2.x specifies what columns to retrieve for a given
> > job,
> > > whereas 1.x will for instance deserialize the crawldatum entirely. The
> > > metadata objects are costly to read/write so 2.x might have the upper
> > hand
> > > from that point of view since it pulls and deserializes only what it
> > needs.
> > >
> > > Finally the most costly steps in a large crawl in 1.x are the
> generation
> > > and update as we have to read/write the crawldb entirely. The way the
> > > updates are done in 2.x is different and should be a lot faster.
> > >
> > > Please could anyone correct me if I am wrong. Some of this is based on
> my
> > > understanding of 2.x which dates back from quite a while and some of
> the
> > > stuff might have changed in the meantime. The performance would
> probably
> > > vary a lot based on the fine tuning of each backend implementation but
> > > having some basic comparison would confirm some of the assertions
> above.
> > >
> > > Julien
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/GORA-119
> > >
> > >
> > > Julien, could you please elaborate a bit about your comment about speed
> > >> depending on the backend used?
> > >>
> > >> Yes, you were the person I was referring to :)
> > >>
> > >> Oh, and *believe* you said it was the fetching speed that was
> different
> > >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> > 2.x?
> > >>
> > >> Thanks,
> > >> Otis
> > >> ----
> > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> > >> http://sematext.com/spm
> > >>
> > >>
> > >>
> > >>
> > >> >________________________________
> > >> > From: Julien Nioche <[email protected]>
> > >> >To: "[email protected]" <[email protected]>
> > >> >Sent: Tuesday, August 6, 2013 10:54 AM
> > >> >Subject: Re: 2.x vs. 1.x speed
> > >> >
> > >> >
> > >> >Hi Otis,
> > >> >
> > >> >That certainly depends on the backend used but on the whole it
> wouldn't
> > be
> > >> >surprising. Would be good to have some data to substantiate it. I am
> > >> >planning to put my intern on the case and have some basic comparison
> as
> > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> > else
> > >> >wants to do it please go ahead.
> > >> >
> > >> >In case I happen to be the person who told you that Otis, well at
> least
> > I
> > >> >am consistent ;-)
> > >> >
> > >> >Julien
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected]
> >
> > >> wrote:
> > >> >
> > >> >> Hello,
> > >> >>
> > >> >> At some point earlier this year I spoke to a person who told me 2.x
> > is
> > >> >> (a little?) slower than 1.x.  Is that still the case?
> > >> >>
> > >> >> Thanks,
> > >> >> Otis
> > >> >> --
> > >> >> Solr & ElasticSearch Support -- http://sematext.com/
> > >> >> Performance Monitoring -- http://sematext.com/spm
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> >--
> > >> >*
> > >> >*Open Source Solutions for Text Engineering
> > >> >
> > >> >http://digitalpebble.blogspot.com/
> > >> >http://www.digitalpebble.com
> > >> >http://twitter.com/digitalpebble
> > >> >
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to