Re: 2.x vs. 1.x speed

Renato Marroquín Mogrovejo Mon, 16 Sep 2013 11:10:04 -0700

Thanks for sharing Julien! These are indeed interesting results.
Just a quick question, did you use a single server to run this? or did you
set up a minimum number of servers for it? this is because HBase or
Cassandra will improve their latency if we scale them out.



Renato M.


2013/9/16 Markus Jelsma <[email protected]>

> Thanks! That was interesting.
>
> -----Original message-----
> From: Julien Nioche<[email protected]>
> Sent: Monday 16th September 2013 18:45
> To: [email protected]; [email protected]
> Cc: Otis Gospodnetic <[email protected]>
> Subject: Re: 2.x vs. 1.x speed
>
> Guys,
>
> Following the discussion we had some time ago about comparing 1.x with
> 2.x, we did dome tests and put the results on
>
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
>
> Feel free to comment.
>
> Best,
>
> Julien
>
> On 24 August 2013 05:51, Lewis John Mcgibbney 
> <[email protected]<mailto:
> [email protected]>> wrote:
>
> I am sure that Renato (if he is watching) can plugin maybe as well.
>
> We find in Gora that in every sense of the word, native Hadoop stores such
>
> as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
>
> via getParitions we retrieve GoraInputSplits natively which means splits
>
> are obtained for MapReduce jobs... such as many of the jobs we run in Nutch
>
> as well. On  the other hand (currently) stores such as Cassandra and Web
>
> service stores such as DynamoDB do not support Hadoop out of the box (the
>
> former we are working on and hope to  have implemented in Gora soon)
>
> therefore it is not as simple to get partitions in the same way we would in
>
> a Hadoop native store. We therefore obtain one partition to be used as an
>
> InputSplit for the MR job. This is certainly an area for concern and right
>
> now a bottleneck for some operations. We continue to work on this.
>
> On Wednesday, August 7, 2013, Julien Nioche 
> <[email protected]<mailto:
> [email protected]>>
>
> wrote:
>
> > Hi Otis
>
> >
>
> > Definitely *not *the fetching speed. Actually everything but *not* the
>
> > fetching speed. The fetcher is pretty much the same as 1.x and anyway the
>
> > performance with fetching is pretty much always limited by the politeness
>
> > settings, not the implementation.
>
> >
>
> > Re-backend : some backend implementations are more mature than others.
> The
>
> > one for HBase is probably the one most widely used, the Cassandra one has
>
> > been greatly improved in particular performance-wise , the SQL one is
>
> > broken etc... we need to measure this as this is just a gut feeling at
>
> this
>
> > stage
>
> >
>
> > Now for  what is slower and why, again this has to be measured but I
>
> expect
>
> > 2.x to be slower partly because of [1], i.e. the filtering of entries is
>
> > not done by the backends (some might provide a way of doing it) but this
>
> is
>
> > done on the client side, when we create the input for mapred. In other
>
> > words we pull things from the backend just to discard it. Since 2.x does
>
> > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> as
>
> > single input) we scan the whole table even if we want to fetch or parse a
>
> > handful of entries.
>
> >
>
> > On the other hand, 2.x specifies what columns to retrieve for a given
> job,
>
> > whereas 1.x will for instance deserialize the crawldatum entirely. The
>
> > metadata objects are costly to read/write so 2.x might have the upper
> hand
>
> > from that point of view since it pulls and deserializes only what it
>
> needs.
>
> >
>
> > Finally the most costly steps in a large crawl in 1.x are the generation
>
> > and update as we have to read/write the crawldb entirely. The way the
>
> > updates are done in 2.x is different and should be a lot faster.
>
> >
>
> > Please could anyone correct me if I am wrong. Some of this is based on my
>
> > understanding of 2.x which dates back from quite a while and some of the
>
> > stuff might have changed in the meantime. The performance would probably
>
> > vary a lot based on the fine tuning of each backend implementation but
>
> > having some basic comparison would confirm some of the assertions above.
>
> >
>
> > Julien
>
> >
>
> >
>
> > [1] https://issues.apache.org/jira/browse/GORA-119 <
> https://issues.apache.org/jira/browse/GORA-119>
>
> >
>
> >
>
> > Julien, could you please elaborate a bit about your comment about speed
>
> >> depending on the backend used?
>
> >>
>
> >> Yes, you were the person I was referring to :)
>
> >>
>
> >> Oh, and *believe* you said it was the fetching speed that was different
>
> >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
>
> 2.x?
>
> >>
>
> >> Thanks,
>
> >> Otis
>
> >> ----
>
> >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
>
> >> http://sematext.com/spm <http://sematext.com/spm>
>
> >>
>
> >>
>
> >>
>
> >>
>
> >> >________________________________
>
> >> > From: Julien Nioche <[email protected] <mailto:
> [email protected]>>
>
> >> >To: "[email protected] <mailto:[email protected]>" <
> [email protected] <mailto:[email protected]>>
>
> >> >Sent: Tuesday, August 6, 2013 10:54 AM
>
> >> >Subject: Re: 2.x vs. 1.x speed
>
> >> >
>
> >> >
>
> >> >Hi Otis,
>
> >> >
>
> >> >That certainly depends on the backend used but on the whole it wouldnt
>
> be
>
> >> >surprising. Would be good to have some data to substantiate it. I am
>
> >> >planning to put my intern on the case and have some basic comparison as
>
> >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
>
> else
>
> >> >wants to do it please go ahead.
>
> >> >
>
> >> >In case I happen to be the person who told you that Otis, well at least
>
> I
>
> >> >am consistent ;-)
>
> >> >
>
> >> >Julien
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >
>
> >> >On 6 August 2013 09:08, Otis Gospodnetic 
> >> ><[email protected]<mailto:
> [email protected]>>
>
> >> wrote:
>
> >> >
>
> >> >> Hello,
>
> >> >>
>
> >> >> At some point earlier this year I spoke to a person who told me 2.x
> is
>
> >> >> (a little?) slower than 1.x.  Is that still the case?
>
> >> >>
>
> >> >> Thanks,
>
> >> >> Otis
>
> >> >> --
>
> >> >> Solr & ElasticSearch Support -- http://sematext.com/ <
> http://sematext.com/>
>
> >> >> Performance Monitoring -- http://sematext.com/spm <
> http://sematext.com/spm>
>
> >> >>
>
> >> >
>
> >> >
>
> >> >
>
> >> >--
>
> >> >*
>
> >> >*Open Source Solutions for Text Engineering
>
> >> >
>
> >> >http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/
> >
>
> >> >http://www.digitalpebble.com <http://www.digitalpebble.com>
>
> >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
> >> >
>
> >> >
>
> >> >
>
> >>
>
> >
>
> >
>
> >
>
> > --
>
> > *
>
> > *Open Source Solutions for Text Engineering
>
> >
>
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
> >
>
> --
>
> *Lewis*
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> http://www.digitalpebble.com <http://www.digitalpebble.com>
> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
>
>
>

Re: 2.x vs. 1.x speed

Reply via email to