Finally, someone posted some metrics, thanks Julian. I just need to point out, in addition to Renato's question, the size of the data that you choose to use for the test is not really fair.IMHO, for 2.x to be some what realistic, your gonna want to have a crawldb with at least afew hundreds of millions of links and fetch list of again at least 1 or 2 million. what do you guys think?

On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:
Thanks for sharing Julien! These are indeed interesting results.
Just a quick question, did you use a single server to run this? or did you
set up a minimum number of servers for it? this is because HBase or
Cassandra will improve their latency if we scale them out.


Renato M.


2013/9/16 Markus Jelsma <[email protected]>

Thanks! That was interesting.

-----Original message-----
From: Julien Nioche<[email protected]>
Sent: Monday 16th September 2013 18:45
To: [email protected]; [email protected]
Cc: Otis Gospodnetic <[email protected]>
Subject: Re: 2.x vs. 1.x speed

Guys,

Following the discussion we had some time ago about comparing 1.x with
2.x, we did dome tests and put the results on

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>

Feel free to comment.

Best,

Julien

On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected]<mailto:
[email protected]>> wrote:

I am sure that Renato (if he is watching) can plugin maybe as well.

We find in Gora that in every sense of the word, native Hadoop stores such

as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat

via getParitions we retrieve GoraInputSplits natively which means splits

are obtained for MapReduce jobs... such as many of the jobs we run in Nutch

as well. On  the other hand (currently) stores such as Cassandra and Web

service stores such as DynamoDB do not support Hadoop out of the box (the

former we are working on and hope to  have implemented in Gora soon)

therefore it is not as simple to get partitions in the same way we would in

a Hadoop native store. We therefore obtain one partition to be used as an

InputSplit for the MR job. This is certainly an area for concern and right

now a bottleneck for some operations. We continue to work on this.

On Wednesday, August 7, 2013, Julien Nioche 
<[email protected]<mailto:
[email protected]>>

wrote:

Hi Otis



Definitely *not *the fetching speed. Actually everything but *not* the

fetching speed. The fetcher is pretty much the same as 1.x and anyway the

performance with fetching is pretty much always limited by the politeness

settings, not the implementation.



Re-backend : some backend implementations are more mature than others.
The

one for HBase is probably the one most widely used, the Cassandra one has

been greatly improved in particular performance-wise , the SQL one is

broken etc... we need to measure this as this is just a gut feeling at

this

stage



Now for  what is slower and why, again this has to be measured but I

expect

2.x to be slower partly because of [1], i.e. the filtering of entries is

not done by the backends (some might provide a way of doing it) but this

is

done on the client side, when we create the input for mapred. In other

words we pull things from the backend just to discard it. Since 2.x does

not have segments like 1.x (which the fetch + parse mapreduce jobs take
as

single input) we scan the whole table even if we want to fetch or parse a

handful of entries.



On the other hand, 2.x specifies what columns to retrieve for a given
job,

whereas 1.x will for instance deserialize the crawldatum entirely. The

metadata objects are costly to read/write so 2.x might have the upper
hand

from that point of view since it pulls and deserializes only what it

needs.



Finally the most costly steps in a large crawl in 1.x are the generation

and update as we have to read/write the crawldb entirely. The way the

updates are done in 2.x is different and should be a lot faster.



Please could anyone correct me if I am wrong. Some of this is based on my

understanding of 2.x which dates back from quite a while and some of the

stuff might have changed in the meantime. The performance would probably

vary a lot based on the fine tuning of each backend implementation but

having some basic comparison would confirm some of the assertions above.



Julien





[1] https://issues.apache.org/jira/browse/GORA-119 <
https://issues.apache.org/jira/browse/GORA-119>





Julien, could you please elaborate a bit about your comment about speed

depending on the backend used?



Yes, you were the person I was referring to :)



Oh, and *believe* you said it was the fetching speed that was different

between 1.x and 2.x.  Is that right?  Or is some other phase slower in

2.x?



Thanks,

Otis

----

Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -

http://sematext.com/spm <http://sematext.com/spm>









________________________________

From: Julien Nioche <[email protected] <mailto:
[email protected]>>

To: "[email protected] <mailto:[email protected]>" <
[email protected] <mailto:[email protected]>>

Sent: Tuesday, August 6, 2013 10:54 AM

Subject: Re: 2.x vs. 1.x speed





Hi Otis,



That certainly depends on the backend used but on the whole it wouldnt

be

surprising. Would be good to have some data to substantiate it. I am

planning to put my intern on the case and have some basic comparison as

soon as she gets a good grip of Hadoop / Nutch etc... but if someone

else

wants to do it please go ahead.



In case I happen to be the person who told you that Otis, well at least

I

am consistent ;-)



Julien





















On 6 August 2013 09:08, Otis Gospodnetic <[email protected]<mailto:
[email protected]>>

wrote:



Hello,



At some point earlier this year I spoke to a person who told me 2.x
is

(a little?) slower than 1.x.  Is that still the case?



Thanks,

Otis

--

Solr & ElasticSearch Support -- http://sematext.com/ <
http://sematext.com/>

Performance Monitoring -- http://sematext.com/spm <
http://sematext.com/spm>









--

*

*Open Source Solutions for Text Engineering



http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/


http://www.digitalpebble.com <http://www.digitalpebble.com>

http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>















--

*

*Open Source Solutions for Text Engineering



http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>

http://www.digitalpebble.com <http://www.digitalpebble.com>

http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>



--

*Lewis*

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
http://www.digitalpebble.com <http://www.digitalpebble.com>
http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>





--
Kaveh Minooie

Reply via email to