Re: Merge taking forever

Ken Krugler Sat, 06 Jun 2009 07:17:38 -0700

Hi Raymond,

Don't know Bixo, how does it compare Nutch.

Very early project to create a web crawlertoolkit (versus a system like Nutch).

What do you mean by copy to S3...

Amazon's S3 (Simple Storage Service -http://s3.amazonaws.com/) is what I use forpersistent storage, since the clusters go up/downas needed. The cost of storage in S3 is almostnegligible, though each transfer has a smallprice.

A potential major drawback is the time & cost ofgetting content out of S3 to some other, non-AWSset of servers. E.g. if you want to run yoursearch servers in your own data center, but crawlusing EC2, the time & cost of moving the contentcould be excessive. Though Amazon recentlyintroduced AWS Import/Export to help address thisissue.


-- Ken

2009/6/5 Ken Krugler <[email protected]>

 how long does it take for your 6 millions URLs to be

 crawled/parsed/indexed... I'm curious to know because I'm about to shoot
 in
 this area but I have no idea how long it will take.


 Not sure this helps, but below are some general stats from a fresh crawl I
 just did using Bixo, which should be similar to Nutch times.

 This is for a 5 server cluster (1 master, 4 slaves) of lowest-end boxes at
 Amazon's EC2.

 * 1.2M URLs from 42K domains
 * Fetch took 3.5 hours, mostly due to long tail
 * Parse took 2.5 hours
 * Index took 25 minutes
 * Copy to S3 took 1 hour

 Adding more servers would have made some things faster, but it wouldn't
 have significantly increased the speed of the fetch, due to the limited
 number of domains.

 So total time was 7.5 hours. 8 hours * 5 servers = 40 compute hours, or $4
 using EC2 pricing of $0.10/hr.

 Amazon's Elastic MapReduce (EMR) is slightly more expensive, but you could
 avoid dealing with Hadoop config/setup time.

 -- Ken

  2009/6/5 John Martyniak <[email protected]>


   Arkady,


   > I think that is beauty of nutch I have built a index of a little more

  million urls with "out of the box" Nutch.  I would say that is pretty
 good
  for most situations before you have to start getting into hadoop and

  > multiple machines.
  >

  -John


  On Jun 4, 2009, at 5:19 PM, <[email protected]> wrote:

  Hi Andrzej,


  -----Original Message-----

  From: Andrzej Bialecki [mailto:[email protected]]
  Sent: Thursday, June 04, 2009 9:47 PM
  To: [email protected]
  Subject: Re: Merge taking forever

  Bartosz Gadzimski wrote:

   As Arkadi said, your hdd is to slow for 2 x quad core processor. I

 have
  the same problem and now thinking of using more boxes or very fast
  drives (sas 15k).

  Raymond Balmãs pisze:


   Well I suspect the sort function is mono-threaded as usually they

 are

   so


   only one core is used 25% is the max you will get.

  I have a dual core and it only goes to 50% CPU in many of the steps
 ...

   I


   assumed that some phases are mono-threaded.

   Folks,


  From your conversation I suspect that you are running Hadoop with
  LocalJobtracker, i.e. in a single JVM - correct?

  While this works ok for small datasets, you don't really benefit from
  map-reduce parallelism (and you still pay the penalty for the
  overheads). As your dataset grows, you will quickly reach the
  scalability limits - in this case, the limit of IO throughput of a
  single drive, during the sort phase of a large dataset. The excessive
 IO
  demands can be solved by distributing the load (over many drives, and
  over many machines), which is what HDFS is designed to do well.

  Hadoop tasks are usually single-threaded, and additionally
  LocalJobTracker implements only a primitive non-parallel model of task

 >>>>>  execution - i.e. each task is scheduled to run sequentially in turn.
 >>>>> If

  you run the regular distributed JobTracker, Hadoop splits the load
 among
  many tasks running in parallel.

  So, the solution is this: set up a distributed Hadoop cluster, even if
  it's going to consist of a single node - because then the data will be
  split and processed in parallel by several JVM instances. This will
 also
  help the operating system to schedule these processes over multiple
  CPU-s. Additionally, if you still experience IO contention, consider
  moving to HDFS as the filestystem, and spread it over more than 1

  >>> machine and more than 1 disk in each machine.

>>

  Thank you for these recommendations.


  I think that there is a large group of users (perhaps limited by budget
 or
  time they are willing to spend) that will give up on trying to use
 Nutch
  unless they can run it on a single box with simple configuration.

  Regards,

   >> Arkadi



 --
 Ken Krugler
 +1 530-210-6378



--
Ken Krugler
+1 530-210-6378

Re: Merge taking forever

Reply via email to