Thanks for sharing Julien! These are indeed interesting results. Just a quick question, did you use a single server to run this? or did you set up a minimum number of servers for it? this is because HBase or Cassandra will improve their latency if we scale them out.
Renato M. 2013/9/16 Markus Jelsma <[email protected]> > Thanks! That was interesting. > > -----Original message----- > From: Julien Nioche<[email protected]> > Sent: Monday 16th September 2013 18:45 > To: [email protected]; [email protected] > Cc: Otis Gospodnetic <[email protected]> > Subject: Re: 2.x vs. 1.x speed > > Guys, > > Following the discussion we had some time ago about comparing 1.x with > 2.x, we did dome tests and put the results on > > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html < > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html> > > Feel free to comment. > > Best, > > Julien > > On 24 August 2013 05:51, Lewis John Mcgibbney > <[email protected]<mailto: > [email protected]>> wrote: > > I am sure that Renato (if he is watching) can plugin maybe as well. > > We find in Gora that in every sense of the word, native Hadoop stores such > > as Avro, HBase and Accumulo when we execute a query with GiraInputFormat > > via getParitions we retrieve GoraInputSplits natively which means splits > > are obtained for MapReduce jobs... such as many of the jobs we run in Nutch > > as well. On the other hand (currently) stores such as Cassandra and Web > > service stores such as DynamoDB do not support Hadoop out of the box (the > > former we are working on and hope to have implemented in Gora soon) > > therefore it is not as simple to get partitions in the same way we would in > > a Hadoop native store. We therefore obtain one partition to be used as an > > InputSplit for the MR job. This is certainly an area for concern and right > > now a bottleneck for some operations. We continue to work on this. > > On Wednesday, August 7, 2013, Julien Nioche > <[email protected]<mailto: > [email protected]>> > > wrote: > > > Hi Otis > > > > > > Definitely *not *the fetching speed. Actually everything but *not* the > > > fetching speed. The fetcher is pretty much the same as 1.x and anyway the > > > performance with fetching is pretty much always limited by the politeness > > > settings, not the implementation. > > > > > > Re-backend : some backend implementations are more mature than others. > The > > > one for HBase is probably the one most widely used, the Cassandra one has > > > been greatly improved in particular performance-wise , the SQL one is > > > broken etc... we need to measure this as this is just a gut feeling at > > this > > > stage > > > > > > Now for what is slower and why, again this has to be measured but I > > expect > > > 2.x to be slower partly because of [1], i.e. the filtering of entries is > > > not done by the backends (some might provide a way of doing it) but this > > is > > > done on the client side, when we create the input for mapred. In other > > > words we pull things from the backend just to discard it. Since 2.x does > > > not have segments like 1.x (which the fetch + parse mapreduce jobs take > as > > > single input) we scan the whole table even if we want to fetch or parse a > > > handful of entries. > > > > > > On the other hand, 2.x specifies what columns to retrieve for a given > job, > > > whereas 1.x will for instance deserialize the crawldatum entirely. The > > > metadata objects are costly to read/write so 2.x might have the upper > hand > > > from that point of view since it pulls and deserializes only what it > > needs. > > > > > > Finally the most costly steps in a large crawl in 1.x are the generation > > > and update as we have to read/write the crawldb entirely. The way the > > > updates are done in 2.x is different and should be a lot faster. > > > > > > Please could anyone correct me if I am wrong. Some of this is based on my > > > understanding of 2.x which dates back from quite a while and some of the > > > stuff might have changed in the meantime. The performance would probably > > > vary a lot based on the fine tuning of each backend implementation but > > > having some basic comparison would confirm some of the assertions above. > > > > > > Julien > > > > > > > > > [1] https://issues.apache.org/jira/browse/GORA-119 < > https://issues.apache.org/jira/browse/GORA-119> > > > > > > > > > Julien, could you please elaborate a bit about your comment about speed > > >> depending on the backend used? > > >> > > >> Yes, you were the person I was referring to :) > > >> > > >> Oh, and *believe* you said it was the fetching speed that was different > > >> between 1.x and 2.x. Is that right? Or is some other phase slower in > > 2.x? > > >> > > >> Thanks, > > >> Otis > > >> ---- > > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - > > >> http://sematext.com/spm <http://sematext.com/spm> > > >> > > >> > > >> > > >> > > >> >________________________________ > > >> > From: Julien Nioche <[email protected] <mailto: > [email protected]>> > > >> >To: "[email protected] <mailto:[email protected]>" < > [email protected] <mailto:[email protected]>> > > >> >Sent: Tuesday, August 6, 2013 10:54 AM > > >> >Subject: Re: 2.x vs. 1.x speed > > >> > > > >> > > > >> >Hi Otis, > > >> > > > >> >That certainly depends on the backend used but on the whole it wouldnt > > be > > >> >surprising. Would be good to have some data to substantiate it. I am > > >> >planning to put my intern on the case and have some basic comparison as > > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone > > else > > >> >wants to do it please go ahead. > > >> > > > >> >In case I happen to be the person who told you that Otis, well at least > > I > > >> >am consistent ;-) > > >> > > > >> >Julien > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> >On 6 August 2013 09:08, Otis Gospodnetic > >> ><[email protected]<mailto: > [email protected]>> > > >> wrote: > > >> > > > >> >> Hello, > > >> >> > > >> >> At some point earlier this year I spoke to a person who told me 2.x > is > > >> >> (a little?) slower than 1.x. Is that still the case? > > >> >> > > >> >> Thanks, > > >> >> Otis > > >> >> -- > > >> >> Solr & ElasticSearch Support -- http://sematext.com/ < > http://sematext.com/> > > >> >> Performance Monitoring -- http://sematext.com/spm < > http://sematext.com/spm> > > >> >> > > >> > > > >> > > > >> > > > >> >-- > > >> >* > > >> >*Open Source Solutions for Text Engineering > > >> > > > >> >http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/ > > > > >> >http://www.digitalpebble.com <http://www.digitalpebble.com> > > >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> > > > http://www.digitalpebble.com <http://www.digitalpebble.com> > > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > > > > -- > > *Lewis* > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> > http://www.digitalpebble.com <http://www.digitalpebble.com> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > >

