Re: 2.x vs. 1.x speed

kaveh minooie Mon, 16 Sep 2013 16:18:14 -0700

:) believe me, what ever attitude you might have seen in that sentencewas just my own guilty conscious manifesting itself. never the less, youare right and I absolutely apologize for that.

Now I have to say that the reason that I haven't really posted anythingis not just cause I am lazy, but because I am not sure how to go aboutit in a way that would be meaningful to whoever is going to read it. theperformance, in a distribute environment, is affected by many things ofwhich few are directly related to nutch. a lot of it has to do with howthe hadoop is set up( how many map or reduce jobs are being run percore? what is the replication factor in the hadoop, if and what kind ofcompression is being used, etc, ) the hardware that is being used, andif we are using gora then the performance of the storage backend and howthat has been set up is also gonna be a big factor as well. not tomention, at least for the current version of gora, that the storagebackends that run on top of hadoop have fundamentally differentcharacteristics with the ones that do not run on top of hadoop, so I amnot sure if a head to head comparison between just the numbers would beinformative or just misleading.

What I am trying to say, I guess, is that if people who have moreexperience in creating this kinds of report could suggest some sort ofguideline or something, it would be very helpful to me and, I am sure,other people as well, to post these kind of numbers. I think that thebest possible outcome would be to have some sort of 'zoo' section on thesite which would have all these reports for different scenarios. from myown experience, I can say that one of the biggest problems that I hadwhen I started using nutch and still have to some degree, was that I wasnever sure what I am doing is right because there were never a referencepoint with which I could compare my own results, and if it wasn'tbecause of this fantastic mailing list, I would have been dead.

also, "realistic" was definitely the wrong word to use. I do agree withyou, base on what I have seen on the list, that too many people startusing the 2.x version without having enough amount of data to justifyit. This definitely would be a very good point to mention, specially onthe web site, that if you don't have more than x number of links to workwith, do not use 2.x version, at least not yet.

that being said I'll start keeping track of my results and I'll share itwith everyone hopefully in near future.


again thanks thou for posting those numbers.


On 09/16/2013 12:06 PM, Julien Nioche wrote:

Hi Kaveh

Finally, someone posted some metrics, thanks Julian.


No probs. You could have done the same experiment since you felt it was
needed ;-)

I just need to point out, in addition to Renato's question, the size of
the data that you choose to use for the test is not really fair.IMHO, for
2.x to be some what realistic,


your gonna want to have a crawldb with at least afew hundreds of millions

of links and fetch list of again at least 1 or 2 million. what do you guys
think?



If realistic means close to real usage then you'll find that most people
use Nutch on dbs smaller than 3M urls. From that point of view, this
experiment is realistic. It is also realistic with the meaning that it can
be reproduce easily : fetching millions or urls would take a lot of time
and having 00's M pages requires a larger cluster ($$$$)

Again, I mentioned I my post that it would be interesting to do it with a
larger cluster but at least we can discuss the limitations in design and
implementation that Nutch 2 currently has.

The main point is that this test was a relative comparison between 2
versions, not an absolute benchmark of how long it takes to run a crawl.
Knowing how Nutch 2 fairs in relation to Nutch 1 is quite useful,
especially with new users expecting a more recent version to perform better
than the old one.

Feel free to try on a larger cluster and dataset and share your results, it
will be interesting to see if there is a difference from what I measured on
a single machine

Thanks

Julien



On 09/16/2013 10:42 AM, Renato Marroquín Mogrovejo wrote:

Thanks for sharing Julien! These are indeed interesting results.
Just a quick question, did you use a single server to run this? or did you
set up a minimum number of servers for it? this is because HBase or
Cassandra will improve their latency if we scale them out.


Renato M.


2013/9/16 Markus Jelsma <[email protected]>

  Thanks! That was interesting.


-----Original message-----
From: Julien 
Nioche<lists.digitalpebble@**gmail.com<[email protected]>

Sent: Monday 16th September 2013 18:45
To: [email protected]; [email protected]
Cc: Otis Gospodnetic <[email protected]>
Subject: Re: 2.x vs. 1.x speed

Guys,

Following the discussion we had some time ago about comparing 1.x with
2.x, we did dome tests and put the results on

http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html><
http://digitalpebble.blogspot.**co.uk/2013/09/nutch-fight-17-**
vs-221.html<http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>


Feel free to comment.

Best,

Julien

On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected]
<**mailto:
[email protected]>> wrote:

I am sure that Renato (if he is watching) can plugin maybe as well.

We find in Gora that in every sense of the word, native Hadoop stores
such

as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat

via getParitions we retrieve GoraInputSplits natively which means splits

are obtained for MapReduce jobs... such as many of the jobs we run in
Nutch

as well. On  the other hand (currently) stores such as Cassandra and Web

service stores such as DynamoDB do not support Hadoop out of the box (the

former we are working on and hope to  have implemented in Gora soon)

therefore it is not as simple to get partitions in the same way we would
in

a Hadoop native store. We therefore obtain one partition to be used as an

InputSplit for the MR job. This is certainly an area for concern and
right

now a bottleneck for some operations. We continue to work on this.

On Wednesday, August 7, 2013, Julien Nioche <
[email protected]**<mailto:
[email protected]>**>

wrote:

  Hi Otis

  Definitely *not *the fetching speed. Actually everything but *not* the


  fetching speed. The fetcher is pretty much the same as 1.x and anyway

the


  performance with fetching is pretty much always limited by the

politeness


  settings, not the implementation.

  Re-backend : some backend implementations are more mature than others.

The

  one for HBase is probably the one most widely used, the Cassandra one

has


  been greatly improved in particular performance-wise , the SQL one is


  broken etc... we need to measure this as this is just a gut feeling at


this

  stage

  Now for  what is slower and why, again this has to be measured but I


expect

  2.x to be slower partly because of [1], i.e. the filtering of entries is


  not done by the backends (some might provide a way of doing it) but this


is

  done on the client side, when we create the input for mapred. In other


  words we pull things from the backend just to discard it. Since 2.x does


  not have segments like 1.x (which the fetch + parse mapreduce jobs take

as

  single input) we scan the whole table even if we want to fetch or parse


  handful of entries.

  On the other hand, 2.x specifies what columns to retrieve for a given

job,

  whereas 1.x will for instance deserialize the crawldatum entirely. The


  metadata objects are costly to read/write so 2.x might have the upper

hand

  from that point of view since it pulls and deserializes only what it


needs.

  Finally the most costly steps in a large crawl in 1.x are the generation


  and update as we have to read/write the crawldb entirely. The way the


  updates are done in 2.x is different and should be a lot faster.

  Please could anyone correct me if I am wrong. Some of this is based on

my


  understanding of 2.x which dates back from quite a while and some of the


  stuff might have changed in the meantime. The performance would probably


  vary a lot based on the fine tuning of each backend implementation but


  having some basic comparison would confirm some of the assertions above.

  Julien

  [1] 
https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119><

https://issues.apache.org/**jira/browse/GORA-119<https://issues.apache.org/jira/browse/GORA-119>

  Julien, could you please elaborate a bit about your comment about speed


  depending on the backend used?

  Yes, you were the person I was referring to :)

  Oh, and *believe* you said it was the fetching speed that was different

  between 1.x and 2.x.  Is that right?  Or is some other phase slower in

2.x?

  Thanks,

  Otis

  ----

  Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -

  http://sematext.com/spm <http://sematext.com/spm>

  ______________________________**__

  From: Julien Nioche <[email protected] <mailto:

[email protected]>**>


  To: "[email protected] <mailto:[email protected]>**" <

[email protected] <mailto:[email protected]>**>


  Sent: Tuesday, August 6, 2013 10:54 AM

  Subject: Re: 2.x vs. 1.x speed

  Hi Otis,

  That certainly depends on the backend used but on the whole it wouldnt

be

  surprising. Would be good to have some data to substantiate it. I am

  planning to put my intern on the case and have some basic comparison as

  soon as she gets a good grip of Hadoop / Nutch etc... but if someone

else

  wants to do it please go ahead.

  In case I happen to be the person who told you that Otis, well at least

I

  am consistent ;-)

  Julien

  On 6 August 2013 09:08, Otis Gospodnetic <[email protected]<**

mailto:

[email protected]>>


  wrote:

  Hello,

  At some point earlier this year I spoke to a person who told me 2.x

is


  (a little?) slower than 1.x.  Is that still the case?

  Thanks,

  Otis

--

  Solr & ElasticSearch Support -- http://sematext.com/ <

http://sematext.com/>


  Performance Monitoring -- http://sematext.com/spm <

http://sematext.com/spm>

--

  *Open Source Solutions for Text Engineering

  http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><

http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>

  http://www.digitalpebble.com <http://www.digitalpebble.com>

  http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <

http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>

--


  *Open Source Solutions for Text Engineering

  http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><

http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>


  http://www.digitalpebble.com <http://www.digitalpebble.com>


  http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <

http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>

--

*Lewis*

--

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/><
http://digitalpebble.**blogspot.com/<http://digitalpebble.blogspot.com/>

http://www.digitalpebble.com <http://www.digitalpebble.com>
http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble> <
http://twitter.com/**digitalpebble <http://twitter.com/digitalpebble>>

--
Kaveh Minooie


--
Kaveh Minooie

Re: 2.x vs. 1.x speed

Reply via email to