This is awesome Julien :) Thanks for sharing !!
On Mon, Sep 16, 2013 at 9:43 AM, Julien Nioche < [email protected]> wrote: > Guys, > > Following the discussion we had some time ago about comparing 1.x with 2.x, > we did dome tests and put the results on > > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html > > Feel free to comment. > > Best, > > Julien > > > On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected] > >wrote: > > > I am sure that Renato (if he is watching) can plugin maybe as well. > > We find in Gora that in every sense of the word, native Hadoop stores > such > > as Avro, HBase and Accumulo when we execute a query with GiraInputFormat > > via getParitions we retrieve GoraInputSplits natively which means splits > > are obtained for MapReduce jobs... such as many of the jobs we run in > Nutch > > as well. On the other hand (currently) stores such as Cassandra and Web > > service stores such as DynamoDB do not support Hadoop out of the box (the > > former we are working on and hope to have implemented in Gora soon) > > therefore it is not as simple to get partitions in the same way we would > in > > a Hadoop native store. We therefore obtain one partition to be used as an > > InputSplit for the MR job. This is certainly an area for concern and > right > > now a bottleneck for some operations. We continue to work on this. > > > > > > On Wednesday, August 7, 2013, Julien Nioche < > [email protected] > > > > > wrote: > > > Hi Otis > > > > > > Definitely *not *the fetching speed. Actually everything but *not* the > > > fetching speed. The fetcher is pretty much the same as 1.x and anyway > the > > > performance with fetching is pretty much always limited by the > politeness > > > settings, not the implementation. > > > > > > Re-backend : some backend implementations are more mature than others. > > The > > > one for HBase is probably the one most widely used, the Cassandra one > has > > > been greatly improved in particular performance-wise , the SQL one is > > > broken etc... we need to measure this as this is just a gut feeling at > > this > > > stage > > > > > > Now for what is slower and why, again this has to be measured but I > > expect > > > 2.x to be slower partly because of [1], i.e. the filtering of entries > is > > > not done by the backends (some might provide a way of doing it) but > this > > is > > > done on the client side, when we create the input for mapred. In other > > > words we pull things from the backend just to discard it. Since 2.x > does > > > not have segments like 1.x (which the fetch + parse mapreduce jobs take > > as > > > single input) we scan the whole table even if we want to fetch or > parse a > > > handful of entries. > > > > > > On the other hand, 2.x specifies what columns to retrieve for a given > > job, > > > whereas 1.x will for instance deserialize the crawldatum entirely. The > > > metadata objects are costly to read/write so 2.x might have the upper > > hand > > > from that point of view since it pulls and deserializes only what it > > needs. > > > > > > Finally the most costly steps in a large crawl in 1.x are the > generation > > > and update as we have to read/write the crawldb entirely. The way the > > > updates are done in 2.x is different and should be a lot faster. > > > > > > Please could anyone correct me if I am wrong. Some of this is based on > my > > > understanding of 2.x which dates back from quite a while and some of > the > > > stuff might have changed in the meantime. The performance would > probably > > > vary a lot based on the fine tuning of each backend implementation but > > > having some basic comparison would confirm some of the assertions > above. > > > > > > Julien > > > > > > > > > [1] https://issues.apache.org/jira/browse/GORA-119 > > > > > > > > > Julien, could you please elaborate a bit about your comment about speed > > >> depending on the backend used? > > >> > > >> Yes, you were the person I was referring to :) > > >> > > >> Oh, and *believe* you said it was the fetching speed that was > different > > >> between 1.x and 2.x. Is that right? Or is some other phase slower in > > 2.x? > > >> > > >> Thanks, > > >> Otis > > >> ---- > > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - > > >> http://sematext.com/spm > > >> > > >> > > >> > > >> > > >> >________________________________ > > >> > From: Julien Nioche <[email protected]> > > >> >To: "[email protected]" <[email protected]> > > >> >Sent: Tuesday, August 6, 2013 10:54 AM > > >> >Subject: Re: 2.x vs. 1.x speed > > >> > > > >> > > > >> >Hi Otis, > > >> > > > >> >That certainly depends on the backend used but on the whole it > wouldn't > > be > > >> >surprising. Would be good to have some data to substantiate it. I am > > >> >planning to put my intern on the case and have some basic comparison > as > > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone > > else > > >> >wants to do it please go ahead. > > >> > > > >> >In case I happen to be the person who told you that Otis, well at > least > > I > > >> >am consistent ;-) > > >> > > > >> >Julien > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected] > > > > >> wrote: > > >> > > > >> >> Hello, > > >> >> > > >> >> At some point earlier this year I spoke to a person who told me 2.x > > is > > >> >> (a little?) slower than 1.x. Is that still the case? > > >> >> > > >> >> Thanks, > > >> >> Otis > > >> >> -- > > >> >> Solr & ElasticSearch Support -- http://sematext.com/ > > >> >> Performance Monitoring -- http://sematext.com/spm > > >> >> > > >> > > > >> > > > >> > > > >> >-- > > >> >* > > >> >*Open Source Solutions for Text Engineering > > >> > > > >> >http://digitalpebble.blogspot.com/ > > >> >http://www.digitalpebble.com > > >> >http://twitter.com/digitalpebble > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > -- > > *Lewis* > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

