Re: hadoop jobs take long time to setup

2009-06-29 Thread Marcus Herou
Of course... Thanks for the help!

Cheers

//Marcus

On Mon, Jun 29, 2009 at 12:32 AM, Mikhail Bautin mbau...@gmail.com wrote:

 Marcus,

 The code that needs to patched is in the tasktracker, because the
 tasktracker is what starts the child JVM that runs user code.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Just to be clear. It is the jobtracker that needs the patched code right
 ?
  Or is it the tasktrackers ?
 
  Kindly
 
  //Marcus
 
  On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com
  wrote:
 
   Marcus,
  
   We currently use 0.20.0 but this patch just inserts 8 lines of code
 into
   TaskRunner.java, which could certainly be done with 0.18.3.
  
   Yes, this patch just appends additional jars to the child JVM
 classpath.
  
   I've never really used tmpjars myself, but if it involves uploading
   multiple
   jar files into HDFS every time a job is started, I see how it can be
  really
   slow. On our ~80-job workflow this would have really slowed things
 down.
  
   Thanks,
   Mikhail
  
   On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou 
  marcus.he...@tailsweep.com
   wrote:
  
Makes sense... I will try both rsync and NFS but I think rsync will
  beat
NFS
since NFS can be slow as hell sometimes but what the heck we already
  have
our maven2 repo on NFS so why not :)
   
Are you saying that this patch make the client able to configure
 which
extra local jar files to add as classpath when firing up the
TaskTrackerChild ?
   
To be explicit: Do you confirm that using tmpjars like I do is a
  costful
slow operation ?
   
To what branch to you apply the patch (we use 0.18.3) ?
   
Cheers
   
//Marcus
   
   
On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
wrote:
   
 This is the way we deal with this problem, too. We put our jar
 files
  on
 NFS, and the attached patch makes possible to add those jar files
 to
   the
 tasktracker classpath through a configuration property.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:21 PM, Stuart White 
  stuart.whi...@gmail.com
wrote:

 Although I've never done it, I believe you could manually copy
 your
   jar
 files out to your cluster somewhere in hadoop's classpath, and
 that
would
 remove the need for you to copy them to your cluster at the start
 of
each
 job.

 On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Running without a jobtracker makes the job start almost
 instantly.
  I think it is due to something with the classloader. I use a
 huge
amount
 of
  jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which
 need
   to
be
  loaded every time I guess.
 
  By issuing conf.setNumTasksToExecutePerJvm(-1); will the
  TaskTracker
 child
  live forever then ?
 
  Cheers
 
  //Marcus
 
  On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
 timrobertson...@gmail.com
  wrote:
 
   How long does it take to start the code locally in a single
   thread?
  
   Can you reuse the JVM so it only starts once per node per job?
   conf.setNumTasksToExecutePerJvm(-1)
  
   Cheers,
   Tim
  
  
  
   On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
 marcus.he...@tailsweep.com
  
   wrote:
Hi.
   
Wonder how one should improve the startup times of a hadoop
  job.
 Some
  of
   my
jobs which have a lot of dependencies in terms of many jar
  files
 take a
   long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it
 seems
that
   Hadoop
have a really high setup cost, which I can live with but
 this
seems
 to
   much.
   
Let's say a job takes 10 minutes to complete then it is bad
 if
   it
 takes
  2
mins to set it up... 20-30 sec max would be a lot more
   reasonable.
   
Hints ?
   
//Marcus
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 



   
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi.

Wonder how one should improve the startup times of a hadoop job. Some of my
jobs which have a lot of dependencies in terms of many jar files take a long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that Hadoop
have a really high setup cost, which I can live with but this seems to much.

Let's say a job takes 10 minutes to complete then it is bad if it takes 2
mins to set it up... 20-30 sec max would be a lot more reasonable.

Hints ?

//Marcus


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Scaling out/up or a mix

2009-06-28 Thread Marcus Herou
Hi.

The crawlers are _very_ threaded but no we use our own threading framework
since it was not available at the time on hadoop-core.

Crawlers normally just wait a lot on clients inducing very little CPU but
consumes some memory due to the parallellism.

//Marcus

On Sat, Jun 27, 2009 at 6:10 PM, jason hadoop jason.had...@gmail.comwrote:

 How about multi-threaded mappers?
 Multi-Threaded mappers are ideal for map tasks that are non locally io
 bound
 with many distinct endpoints.
 You can also control the thread count on a per job basis.

 On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  The argument currently against increasing num-mappers is that the
 machines
  will get into oom and since a lot of the jobs are crawlers I need more
  ip-numbers so I don't get banned :)
 
  Thing is that we currently have solr on the very same machines and
  data-nodes as well so I can only give the MR nodes about 1G memory since
 I
  need SOLR to have 4G...
 
  Now I see that I should get some obvious and juste critique about the
  layout
  of this arch but I'm a little limited in budget and so is then the arch
 :)
 
  However is it wise to have the MR tasks on the same nodes as the
 data-nodes
  or should I split the arch ? I mean the data-nodes perhaps need more
  disk-IO
  and the MR more memory and CPU ?
 
  Trying to find a sweetspot hardware spec of those two roles.
 
  //Marcus
 
 
 
  On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman bbock...@cse.unl.edu
  wrote:
 
   Hey Marcus,
  
   Are you recording the data rates coming out of HDFS?  Since you have
 such
  a
   low CPU utilizations, I'd look at boxes utterly packed with big hard
  drives
   (also, why are you using RAID1 for Hadoop??).
  
   You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
Based on the data rates you see, make the call.
  
   On the other hand, what's the argument against running 3x more mappers
  per
   box?  It seems that your boxes still have more overhead to use --
 there's
  no
   I/O wait.
  
   Brian
  
  
   On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
  
Hi.
  
   We have a deployment of 10 hadoop servers and I now need more mapping
   capability (no not just add more mappers per instance) since I have so
   many
   jobs running. Now I am wondering what I should aim on...
   Memory, cpu or disk... How long is a rope perhaps you would say ?
  
   A typical server is currently using about 15-20% cpu today on a
  quad-core
   2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
  
   Some specs below.
  
   mpstat 2 5
  
   Linux 2.6.24-19-server (mapreduce2) 06/26/2009
  
   11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft
  %steal
   %idleintr/s
   11:36:15 PM  all   22.820.003.241.370.622.49
  0.00
   69.45   8572.50
   11:36:17 PM  all   13.560.001.741.990.622.61
  0.00
   79.48   8075.50
   11:36:19 PM  all   14.320.002.241.121.122.24
  0.00
   78.95   9219.00
   11:36:21 PM  all   14.710.000.871.620.251.75
  0.00
   80.80   8489.50
   11:36:23 PM  all   12.690.000.871.240.500.75
  0.00
   83.96   5495.00
   Average: all   15.620.001.791.470.621.97
  0.00
   78.53   7970.30
  
   What I am thinking is... Is it wiser to go for many of these cheap
 boxes
   with 8GB of RAM or should I for instance focus on machines which can
  give
   more I|O throughput ?
  
   I know that these things are hard but perhaps someone have draw some
   conclusions before the pragmatic way.
  
   Kindly
  
   //Marcus
  
  
   --
   Marcus Herou CTO and co-founder Tailsweep AB
   +46702561312
   marcus.he...@tailsweep.com
   http://www.tailsweep.com/
  
  
  
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi.

Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge amount of
jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
loaded every time I guess.

By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
live forever then ?

Cheers

//Marcus

On Sun, Jun 28, 2009 at 9:54 PM, tim robertson timrobertson...@gmail.comwrote:

 How long does it take to start the code locally in a single thread?

 Can you reuse the JVM so it only starts once per node per job?
 conf.setNumTasksToExecutePerJvm(-1)

 Cheers,
 Tim



 On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com
 wrote:
  Hi.
 
  Wonder how one should improve the startup times of a hadoop job. Some of
 my
  jobs which have a lot of dependencies in terms of many jar files take a
 long
  time to start in hadoop up to 2 minutes some times.
  The data input amounts in these cases are neglible so it seems that
 Hadoop
  have a really high setup cost, which I can live with but this seems to
 much.
 
  Let's say a job takes 10 minutes to complete then it is bad if it takes 2
  mins to set it up... 20-30 sec max would be a lot more reasonable.
 
  Hints ?
 
  //Marcus
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS
since NFS can be slow as hell sometimes but what the heck we already have
our maven2 repo on NFS so why not :)

Are you saying that this patch make the client able to configure which
extra local jar files to add as classpath when firing up the
TaskTrackerChild ?

To be explicit: Do you confirm that using tmpjars like I do is a costful
slow operation ?

To what branch to you apply the patch (we use 0.18.3) ?

Cheers

//Marcus


On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com wrote:

 This is the way we deal with this problem, too. We put our jar files on
 NFS, and the attached patch makes possible to add those jar files to the
 tasktracker classpath through a configuration property.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote:

 Although I've never done it, I believe you could manually copy your jar
 files out to your cluster somewhere in hadoop's classpath, and that would
 remove the need for you to copy them to your cluster at the start of each
 job.

 On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Hi.
 
  Running without a jobtracker makes the job start almost instantly.
  I think it is due to something with the classloader. I use a huge amount
 of
  jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
  loaded every time I guess.
 
  By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
 child
  live forever then ?
 
  Cheers
 
  //Marcus
 
  On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
 timrobertson...@gmail.com
  wrote:
 
   How long does it take to start the code locally in a single thread?
  
   Can you reuse the JVM so it only starts once per node per job?
   conf.setNumTasksToExecutePerJvm(-1)
  
   Cheers,
   Tim
  
  
  
   On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
 marcus.he...@tailsweep.com
  
   wrote:
Hi.
   
Wonder how one should improve the startup times of a hadoop job.
 Some
  of
   my
jobs which have a lot of dependencies in terms of many jar files
 take a
   long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that
   Hadoop
have a really high setup cost, which I can live with but this seems
 to
   much.
   
Let's say a job takes 10 minutes to complete then it is bad if it
 takes
  2
mins to set it up... 20-30 sec max would be a lot more reasonable.
   
Hints ?
   
//Marcus
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
 
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi.

Just to be clear. It is the jobtracker that needs the patched code right ?
Or is it the tasktrackers ?

Kindly

//Marcus

On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote:

 Marcus,

 We currently use 0.20.0 but this patch just inserts 8 lines of code into
 TaskRunner.java, which could certainly be done with 0.18.3.

 Yes, this patch just appends additional jars to the child JVM classpath.

 I've never really used tmpjars myself, but if it involves uploading
 multiple
 jar files into HDFS every time a job is started, I see how it can be really
 slow. On our ~80-job workflow this would have really slowed things down.

 Thanks,
 Mikhail

 On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Makes sense... I will try both rsync and NFS but I think rsync will beat
  NFS
  since NFS can be slow as hell sometimes but what the heck we already have
  our maven2 repo on NFS so why not :)
 
  Are you saying that this patch make the client able to configure which
  extra local jar files to add as classpath when firing up the
  TaskTrackerChild ?
 
  To be explicit: Do you confirm that using tmpjars like I do is a costful
  slow operation ?
 
  To what branch to you apply the patch (we use 0.18.3) ?
 
  Cheers
 
  //Marcus
 
 
  On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin mbau...@gmail.com
  wrote:
 
   This is the way we deal with this problem, too. We put our jar files on
   NFS, and the attached patch makes possible to add those jar files to
 the
   tasktracker classpath through a configuration property.
  
   Thanks,
   Mikhail
  
   On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.com
  wrote:
  
   Although I've never done it, I believe you could manually copy your
 jar
   files out to your cluster somewhere in hadoop's classpath, and that
  would
   remove the need for you to copy them to your cluster at the start of
  each
   job.
  
   On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou 
  marcus.he...@tailsweep.com
   wrote:
  
Hi.
   
Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge
  amount
   of
jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need
 to
  be
loaded every time I guess.
   
By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
   child
live forever then ?
   
Cheers
   
//Marcus
   
On Sun, Jun 28, 2009 at 9:54 PM, tim robertson 
   timrobertson...@gmail.com
wrote:
   
 How long does it take to start the code locally in a single
 thread?

 Can you reuse the JVM so it only starts once per node per job?
 conf.setNumTasksToExecutePerJvm(-1)

 Cheers,
 Tim



 On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou
   marcus.he...@tailsweep.com

 wrote:
  Hi.
 
  Wonder how one should improve the startup times of a hadoop job.
   Some
of
 my
  jobs which have a lot of dependencies in terms of many jar files
   take a
 long
  time to start in hadoop up to 2 minutes some times.
  The data input amounts in these cases are neglible so it seems
  that
 Hadoop
  have a really high setup cost, which I can live with but this
  seems
   to
 much.
 
  Let's say a job takes 10 minutes to complete then it is bad if
 it
   takes
2
  mins to set it up... 20-30 sec max would be a lot more
 reasonable.
 
  Hints ?
 
  //Marcus
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 

   
   
   
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
   
  
  
  
 
 




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Scaling out/up or a mix

2009-06-27 Thread Marcus Herou
The argument currently against increasing num-mappers is that the machines
will get into oom and since a lot of the jobs are crawlers I need more
ip-numbers so I don't get banned :)

Thing is that we currently have solr on the very same machines and
data-nodes as well so I can only give the MR nodes about 1G memory since I
need SOLR to have 4G...

Now I see that I should get some obvious and juste critique about the layout
of this arch but I'm a little limited in budget and so is then the arch :)

However is it wise to have the MR tasks on the same nodes as the data-nodes
or should I split the arch ? I mean the data-nodes perhaps need more disk-IO
and the MR more memory and CPU ?

Trying to find a sweetspot hardware spec of those two roles.

//Marcus



On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Marcus,

 Are you recording the data rates coming out of HDFS?  Since you have such a
 low CPU utilizations, I'd look at boxes utterly packed with big hard drives
 (also, why are you using RAID1 for Hadoop??).

 You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
  Based on the data rates you see, make the call.

 On the other hand, what's the argument against running 3x more mappers per
 box?  It seems that your boxes still have more overhead to use -- there's no
 I/O wait.

 Brian


 On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:

  Hi.

 We have a deployment of 10 hadoop servers and I now need more mapping
 capability (no not just add more mappers per instance) since I have so
 many
 jobs running. Now I am wondering what I should aim on...
 Memory, cpu or disk... How long is a rope perhaps you would say ?

 A typical server is currently using about 15-20% cpu today on a quad-core
 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.

 Some specs below.

 mpstat 2 5

 Linux 2.6.24-19-server (mapreduce2) 06/26/2009

 11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft  %steal
 %idleintr/s
 11:36:15 PM  all   22.820.003.241.370.622.490.00
 69.45   8572.50
 11:36:17 PM  all   13.560.001.741.990.622.610.00
 79.48   8075.50
 11:36:19 PM  all   14.320.002.241.121.122.240.00
 78.95   9219.00
 11:36:21 PM  all   14.710.000.871.620.251.750.00
 80.80   8489.50
 11:36:23 PM  all   12.690.000.871.240.500.750.00
 83.96   5495.00
 Average: all   15.620.001.791.470.621.970.00
 78.53   7970.30

 What I am thinking is... Is it wiser to go for many of these cheap boxes
 with 8GB of RAM or should I for instance focus on machines which can give
 more I|O throughput ?

 I know that these things are hard but perhaps someone have draw some
 conclusions before the pragmatic way.

 Kindly

 //Marcus


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Scaling out/up or a mix

2009-06-26 Thread Marcus Herou
Hi.

We have a deployment of 10 hadoop servers and I now need more mapping
capability (no not just add more mappers per instance) since I have so many
jobs running. Now I am wondering what I should aim on...
Memory, cpu or disk... How long is a rope perhaps you would say ?

A typical server is currently using about 15-20% cpu today on a quad-core
2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.

Some specs below.
 mpstat 2 5
Linux 2.6.24-19-server (mapreduce2) 06/26/2009

11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft  %steal
%idleintr/s
11:36:15 PM  all   22.820.003.241.370.622.490.00
69.45   8572.50
11:36:17 PM  all   13.560.001.741.990.622.610.00
79.48   8075.50
11:36:19 PM  all   14.320.002.241.121.122.240.00
78.95   9219.00
11:36:21 PM  all   14.710.000.871.620.251.750.00
80.80   8489.50
11:36:23 PM  all   12.690.000.871.240.500.750.00
83.96   5495.00
Average: all   15.620.001.791.470.621.970.00
78.53   7970.30

What I am thinking is... Is it wiser to go for many of these cheap boxes
with 8GB of RAM or should I for instance focus on machines which can give
more I|O throughput ?

I know that these things are hard but perhaps someone have draw some
conclusions before the pragmatic way.

Kindly

//Marcus


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/


Re: Very assymetric data allocation

2009-04-08 Thread Marcus Herou
Great thanks for the info!

Right after I finished my last question I started to think about how Hadoop
measures data allocation. Are the figures presented actually the size of
HDFS on each machine or the amount of disk allocated and measured by issuing
something like df.

The reason why I am asking is that df -h is quite close to the figures
presented in the GUI but it could be a coincidence.

//Marcus

On Tue, Apr 7, 2009 at 4:02 PM, Koji Noguchi knogu...@yahoo-inc.com wrote:

 Marcus,

 One known issue in 0.18.3 is HADOOP-5465.

 CopyPaste from
 https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693
 956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa
 nel#action_12693956https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693%0A956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa%0Anel#action_12693956

 Hairong said:
  This bug might be caused by HADOOP-5465. Once a datanode hits
 HADOOP-5465, NameNode sends an empty replication request to the data
 node on every reply to a heartbeat, thus not a single scheduled block
 deletion request can be sent to the data node.

 (Also, if you're always writing from one of the nodes, that node is more
 likely to get full.)



 Nigel, not sure if this is the issue, but it would be nice to have
 0.18.4 out.


 Koji



 -Original Message-
 From: Marcus Herou [mailto:marcus.he...@tailsweep.com]
 Sent: Tuesday, April 07, 2009 12:45 AM
 To: hadoop-u...@lucene.apache.org
 Subject: Very assymetric data allocation

 Hi.

 We are running Hadoop 0.18.3 and noticed a strange issue when one of our
 machines went out of disk yesterday.
 If you can see the table below it would display that the server
 mapredcoord is 66.91% allocated and the others are almost empty.
 How can that be ?

 Any information about this would be very helpful.

 mapredcoord is as well our jobtracker.

 //Marcus

 Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB)
 Blocks
 mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor
 t=50070dir=%2Fhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor%0At=50070dir=%2F
 2In
 Service416.6966.91

 90.9419806
 mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.696.71

 303.54456
 mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.690.44
 351.693975
 mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.25
 355.821549
 mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.690.42
 347.683995
 mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.43
 352.73982
 mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.5
 351.914079
 mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 1In
 Service416.690.48
 350.154169


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Very assymetric data allocation

2009-04-07 Thread Marcus Herou
Hi.

We are running Hadoop 0.18.3 and noticed a strange issue when one of our
machines went out of disk yesterday.
If you can see the table below it would display that the server
mapredcoord is 66.91% allocated and the others are almost empty.
How can that be ?

Any information about this would be very helpful.

mapredcoord is as well our jobtracker.

//Marcus

Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB) Blocks
mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service416.6966.91

90.9419806 
mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service416.696.71

303.54456 
mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service416.690.44
351.693975 
mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service416.690.25
355.821549 
mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F2In
Service416.690.42
347.683995 
mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service416.690.43
352.73982 
mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F0In
Service416.690.5
351.914079 
mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F1In
Service416.690.48
350.154169


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Lingering TaskTracker$Child

2009-01-25 Thread Marcus Herou
Hi.

Today I noticed when I ran a Solr Indexing job through our Hadoop cluster
that the master MySQL database where screaming about Too Many Connections.

I wondered how that could happen so I logged into my Hadoop machines and
searched through the logs. Nothing strange there. Then I just did a jps:

r...@mapreduce1:~# jps
10701 TaskTracker$Child
9567 NameNode
5435 TaskTracker$Child
31801 Bootstrap
7349 TaskTracker$Child
6197 TaskTracker$Child
7761 TaskTracker$Child
10453 TaskTracker$Child
11232 TaskTracker$Child
3 TaskTracker$Child
9688 DataNode
10877 TaskTracker$Child
6504 TaskTracker$Child
10236 TaskTracker$Child
9852 TaskTracker
6515 TaskTracker$Child
11396 TaskTracker$Child
11741 Jps
6191 TaskTracker$Child
10981 TaskTracker$Child
7742 TaskTracker$Child
5946 TaskTracker$Child
11315 TaskTracker$Child
8112 TaskTracker$Child
11580 TaskTracker$Child
11490 TaskTracker$Child
5687 TaskTracker$Child
5927 TaskTracker$Child
27144 WrapperSimpleApp
7368 TaskTracker$Child

Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so it
only can have one active connection to any shard at any time.
Background: I ran out of connections during the Christmas holidays since I
have 60 shards (10 per MySQL machine) and each required a DB-Pool which
allowed too many active+idle connections.

Anyway I have no active jobs at the moment so the children should have died
by themselves.
Fortunately I have a little nice script which kills the bastards: jps |egrep
TaskTracker.+ | awk '{print $1}'|xargs kill
I will probably put that in a cronjob which kills long running children...

Anyway, how can this happen ? Am I doing something really stupid along the
way ?
Hard facts:
Ubuntu Hardy-Heron, 2.6.24-19-server
java version 1.6.0_06
Hadoop-0.18.2
It's my own classes which fires the jobs through JobClient
(JobClient.runJob(job))
I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a bash
script)
I feed deps into hadoop by issuing: job.set(tmpjars, jarFiles); (comes by
parsing external CLASSPATH ENV in bash)

The client do not complain, se example output below (I write no data to HDFS
((HDFS bytes written=774)), since I mostly use it for crawling and all
crawlers/indexers access my sharding db structure directly without
intermediate storage):
2009-01-25 17:12:11.175 INFO main org.apache.hadoop.mapred.FileInputFormat -
Total input paths to process : 1
2009-01-25 17:12:11.176 INFO main org.apache.hadoop.mapred.FileInputFormat -
Total input paths to process : 1
2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
Running job: job_200901251629_0011
2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -  map
0% reduce 0%
2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -  map
6% reduce 0%
2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -  map
21% reduce 0%
2009-01-25 17:12:45.502 INFO main org.apache.hadoop.mapred.JobClient -  map
31% reduce 0%
2009-01-25 17:12:50.511 INFO main org.apache.hadoop.mapred.JobClient -  map
51% reduce 0%
2009-01-25 17:12:55.520 INFO main org.apache.hadoop.mapred.JobClient -  map
67% reduce 0%
2009-01-25 17:13:00.533 INFO main org.apache.hadoop.mapred.JobClient -  map
72% reduce 0%
2009-01-25 17:13:05.543 INFO main org.apache.hadoop.mapred.JobClient -  map
84% reduce 0%
2009-01-25 17:13:10.552 INFO main org.apache.hadoop.mapred.JobClient -  map
95% reduce 0%
2009-01-25 17:13:15.560 INFO main org.apache.hadoop.mapred.JobClient -  map
98% reduce 0%
2009-01-25 17:13:20.568 INFO main org.apache.hadoop.mapred.JobClient - Job
complete: job_200901251629_0011
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
Counters: 7
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
File Systems
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
HDFS bytes read=2741143
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
HDFS bytes written=774
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -   Job
Counters
2009-01-25 17:13:20.570 INFO main org.apache.hadoop.mapred.JobClient -
Rack-local map tasks=9
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Launched map tasks=9
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map-Reduce Framework
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map input records=48314
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map input bytes=2732424
2009-01-25 17:13:20.571 INFO main org.apache.hadoop.mapred.JobClient -
Map output records=0

Any suggestions or pointers would be greatly appreciated. Hmm Coming to
think about something. I start X threads from inside Hadoop almost cut'n
pasted from Nutch.
If a thread somehow would linger, would Hadoop not be able to shutdown even
though there is nothing more to read from the RecordReader ?

Kindly

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312

Re: Lingering TaskTracker$Child

2009-01-25 Thread Marcus Herou
0.78037554% 4094 docs, 0 errors, 107.7 docs/s
2009-01-25 17:49:10.115 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
0.84660184% 4448 docs, 0 errors, 108.5 docs/s
2009-01-25 17:49:13.117 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
0.9146803% 4810 docs, 0 errors, 109.3 docs/s
2009-01-25 17:49:16.118 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
0.9521195% 5044 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:18.626 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
1.0% 5364 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:18.631 INFO IPC Server handler 0 on 32274
org.apache.hadoop.mapred.TaskTracker - Task
attempt_200901251629_0012_m_04_0 is done.
2009-01-25 17:49:18.631 INFO main org.apache.hadoop.mapred.TaskRunner - Task
'attempt_200901251629_0012_m_04_0' done.
2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
1.0% 5364 docs, 0 errors, 107.3 docs/s
2009-01-25 17:49:19.120 INFO IPC Server handler 1 on 32274
org.apache.hadoop.mapred.TaskTracker - attempt_200901251629_0012_m_04_0
Ignoring status-update since task is 'done'
2009-01-25 17:49:35.582 INFO taskCleanup
org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
job_200901251629_0012
2009-01-25 17:49:35.582 INFO taskCleanup org.apache.hadoop.mapred.TaskRunner
- attempt_200901251629_0012_m_04_0 done; removing files.

# Still processes left even though the TaskTracker said: Received
'KillJobAction' for job: job_200901251629_0012
r...@mapreduce2:~# jps
10732 Jps
10634 TaskTracker$Child
8660 DataNode
8824 TaskTracker
8730 SecondaryNameNode
25060 Bootstrap
r...@mapreduce2:~# date
Sun Jan 25 17:51:48 CET 2009
r...@mapreduce2:~#

On Sun, Jan 25, 2009 at 5:42 PM, Marcus Herou marcus.he...@tailsweep.comwrote:

 Hi.

 Today I noticed when I ran a Solr Indexing job through our Hadoop cluster
 that the master MySQL database where screaming about Too Many Connections.

 I wondered how that could happen so I logged into my Hadoop machines and
 searched through the logs. Nothing strange there. Then I just did a jps:

 r...@mapreduce1:~# jps
 10701 TaskTracker$Child
 9567 NameNode
 5435 TaskTracker$Child
 31801 Bootstrap
 7349 TaskTracker$Child
 6197 TaskTracker$Child
 7761 TaskTracker$Child
 10453 TaskTracker$Child
 11232 TaskTracker$Child
 3 TaskTracker$Child
 9688 DataNode
 10877 TaskTracker$Child
 6504 TaskTracker$Child
 10236 TaskTracker$Child
 9852 TaskTracker
 6515 TaskTracker$Child
 11396 TaskTracker$Child
 11741 Jps
 6191 TaskTracker$Child
 10981 TaskTracker$Child
 7742 TaskTracker$Child
 5946 TaskTracker$Child
 11315 TaskTracker$Child
 8112 TaskTracker$Child
 11580 TaskTracker$Child
 11490 TaskTracker$Child
 5687 TaskTracker$Child
 5927 TaskTracker$Child
 27144 WrapperSimpleApp
 7368 TaskTracker$Child

 Damn! Each Child have it's own DataSource (dbcp pool) tweaked down so it
 only can have one active connection to any shard at any time.
 Background: I ran out of connections during the Christmas holidays since I
 have 60 shards (10 per MySQL machine) and each required a DB-Pool which
 allowed too many active+idle connections.

 Anyway I have no active jobs at the moment so the children should have died
 by themselves.
 Fortunately I have a little nice script which kills the bastards: jps
 |egrep TaskTracker.+ | awk '{print $1}'|xargs kill
 I will probably put that in a cronjob which kills long running children...

 Anyway, how can this happen ? Am I doing something really stupid along the
 way ?
 Hard facts:
 Ubuntu Hardy-Heron, 2.6.24-19-server
 java version 1.6.0_06
 Hadoop-0.18.2
 It's my own classes which fires the jobs through JobClient
 (JobClient.runJob(job))
 I feed the jar to hadoop by issuing: job.setJar(jarFile); (comes from a
 bash script)
 I feed deps into hadoop by issuing: job.set(tmpjars, jarFiles); (comes by
 parsing external CLASSPATH ENV in bash)

 The client do not complain, se example output below (I write no data to
 HDFS ((HDFS bytes written=774)), since I mostly use it for crawling and all
 crawlers/indexers access my sharding db structure directly without
 intermediate storage):
 2009-01-25 17:12:11.175 INFO main org.apache.hadoop.mapred.FileInputFormat
 - Total input paths to process : 1
 2009-01-25 17:12:11.176 INFO main org.apache.hadoop.mapred.FileInputFormat
 - Total input paths to process : 1
 2009-01-25 17:12:11.437 INFO main org.apache.hadoop.mapred.JobClient -
 Running job: job_200901251629_0011
 2009-01-25 17:12:12.439 INFO main org.apache.hadoop.mapred.JobClient -  map
 0% reduce 0%
 2009-01-25 17:12:35.481 INFO main org.apache.hadoop.mapred.JobClient -  map
 6% reduce 0%
 2009-01-25 17:12:40.493 INFO main org.apache.hadoop.mapred.JobClient -  map
 21% reduce

Re: Lingering TaskTracker$Child

2009-01-25 Thread Marcus Herou
Thanks!

So by you experience would this be good enough ? (Notice the System.exit(0))


I implement the MapRunnable interface.

 public void run(RecordReaderLongWritable, Text recordReader,
OutputCollectorWritableComparable, WritableComparable outputCollector,
Reporter reporter) throws IOException
{
this.recordReader = recordReader;
this.outputCollector = outputCollector;
this.reporter = reporter;
int threads =
Integer.valueOf(this.getConf().get(getClass().getName()+.threads, 10));
log.info(Starting with +threads +  threads);
long timeout =
Long.valueOf(this.getConf().get(getClass().getName()+.timeout, 60));

for (int i = 0; i  threads; i++)
{
// spawn threads
new FetcherThread().start();
}
do
{  // wait for threads to
exit
try
{
Thread.sleep(1000);
} catch (InterruptedException e) {}

reportStatus();

// some requests seem to hang, despite all intentions
synchronized (this)
{
if ((System.currentTimeMillis() - lastRequestStart) 
timeout)
{
if (log.isWarnEnabled())
{
log.warn(Aborting with +activeThreads+ hung
threads.);
}
return;
}
}

} while (activeThreads  0);
log.info(All threads seem to be done, exiting);
System.exit(0);
}

On Sun, Jan 25, 2009 at 5:57 PM, jason hadoop jason.had...@gmail.comwrote:

 We had trouble like that with some jobs, when the child ran additional
 threads that were not set at daemon priority. These hold the Child JVM from
 exiting.
 JMX was the cause in our case, but we have seen our JNI jobs do it also.
 In the end we made a local mod that forced a System.exit in the finally
 block of the Child main.


 On Sun, Jan 25, 2009 at 8:53 AM, Marcus Herou marcus.he...@tailsweep.com
 wrote:

  Some extra info, apparently the child exits with a status of 143.
 
  2009-01-25 17:13:11.110 INFO IPC Server handler 0 on 32274
  org.apache.hadoop.mapred.TaskTracker -
 attempt_200901251629_0011_m_05_0
  1.0% 5364 docs, 0 errors, 124.7 docs/s
  2009-01-25 17:13:11.114 INFO IPC Server handler 1 on 32274
  org.apache.hadoop.mapred.TaskTracker - Task
  attempt_200901251629_0011_m_05_0 is done.
  2009-01-25 17:13:11.116 INFO main org.apache.hadoop.mapred.TaskRunner -
  Task
  'attempt_200901251629_0011_m_05_0' done.
  2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
  org.apache.hadoop.mapred.TaskTracker -
 attempt_200901251629_0011_m_05_0
  1.0% 5364 docs, 0 errors, 124.7 docs/s
  2009-01-25 17:13:12.644 INFO IPC Server handler 0 on 32274
  org.apache.hadoop.mapred.TaskTracker -
 attempt_200901251629_0011_m_05_0
  Ignoring status-update since task is 'done'
  2009-01-25 17:13:24.996 INFO taskCleanup
  org.apache.hadoop.mapred.TaskTracker - Received 'KillJobAction' for job:
  job_200901251629_0011
  2009-01-25 17:13:24.996 INFO taskCleanup
  org.apache.hadoop.mapred.TaskRunner
  - attempt_200901251629_0011_m_05_0 done; removing files.
  2009-01-25 17:47:22.668 WARN Thread-23
 org.apache.hadoop.mapred.TaskRunner
  -
  attempt_200901251629_0001_m_06_0 Child Error
  java.io.IOException: Task process exit with nonzero status of 143.
 at
 org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
  2009-01-25 17:47:22.669 WARN Thread-23
 org.apache.hadoop.mapred.TaskTracker
  - Error from unknown child task: attempt_200901251629_0001_m_06_0.
  Ignored.
  2009-01-25 17:47:22.671 WARN Thread-23
 org.apache.hadoop.mapred.TaskTracker
  - Unknown child task finshed: attempt_200901251629_0001_m_06_0.
  Ignored.
  2009-01-25 17:47:22.713 WARN Thread-79
 org.apache.hadoop.mapred.TaskRunner
  -
  attempt_200901251629_0002_m_07_0 Child Error
  java.io.IOException: Task process exit with nonzero status of 143.
 at
 org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
  2009-01-25 17:47:22.713 WARN Thread-159
 org.apache.hadoop.mapred.TaskRunner
  - attempt_200901251629_0011_m_05_0 Child Error
  java.io.IOException: Task process exit with nonzero status of 143.
 at
 org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
  2009-01-25 17:47:22.713 WARN Thread-79
 org.apache.hadoop.mapred.TaskTracker
  - Error from unknown child task: attempt_200901251629_0002_m_07_0.
  Ignored.
  2009-01-25 17:47:22.714 WARN Thread-159
  org.apache.hadoop.mapred.TaskTracker
  - Error from unknown child task: attempt_200901251629_0011_m_05_0.
  Ignored.
  2009

DataNode/TaskTracker memory constraints.

2008-12-15 Thread Marcus Herou
Hi.

All Hadoop components are started with -Xmx1000M as per default. I am
planning to throw in some data/task nodes here and there in my arch. However
most machines have only 4G physical RAM so allocating 2G + overhead ~2.5G to
hadoop is a little risky since they could very well become inaccessible if
it needs to compete with other processes for RAM. I have experienced this
many times with java processes going haywire where I run other services in
parallell.
Anyway I would like to understand the reasoning about having 1G allocated
per process. I figure that the DataNode could survive with a little less as
well the TaskTracker if the jobs running in it do not consume so much
memory. Of course each process would like to have even more memory than 1G
but if I need to cut down I would like to know which to cut and what I loose
by doing so.

Any thoughts? Trial and error is of course an option but I would like to
hear the basic thoughts about how memory should be utilized to gain max out
of the boxes.

Kindly

//Marcus





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


NumMapTasks and NumReduceTasks with MapRunnable

2008-12-13 Thread Marcus Herou
Hi.

We are finally in the beta stage with our crawler and have tested it with a
few hundred thousand urls. However it performs worse than if we would run it
on a local machine without connecting to a hadoop JobTracker.
Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads
which all read the same RecordReader and starts to fetch the current url
assigned.
However I am not able to utilize all our 9 machines at the same time which
is really preferable since this is an external IO bound job (remote
servers).

How can I with a crawl list of just 9 urls (stupidly small I know) make sure
that all machines is used at least once ?
With a crawl list of 900 how can i make sure at least 100 are crawled at the
same time on all machines ?
And so on with much bigger crawl lists (which is why need hadoop anyway).

Just as I write this I launched a job where i manually set the numMapTasks
to 9 and it seems to be fruitful, quite fast crawl actually :) however I
wonder if this is how I should think with all MapRunnables ?
Next Job we call is PersistOutLinks and yep it goes through a massive list
of source-target links and saves them in a DB.

This list is of a magnitude of at least a 100 times larger than the Fetcher
list. Is it still smart to hardcode a value 9 to numMapTasks for this
MapRunnable job ? Or should I create some form of InputFormat.getInputSplits
based on the crawl/outlink sizes ? Of course the numMapTasks are not
hardcoded but they are injected into the Configuration based on a properties
file.

Kindly

//Marcus





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Parhely (ORM for HBase) released!

2008-09-07 Thread Marcus Herou
Hi guys.

Finally I released the first draft of an ORM for HBase named Parhely.
Check it out at http://dev.tailsweep.com/

Kindly

//Marcus


Perfect sysctl

2008-08-11 Thread Marcus Herou
Hi.

Just wondering if someone have found some good Linux settings for I|O
intensive workload with Hadoop.

Since most usecases with Hadoop is I|O-bound and since it uses the network
frequently I guess that the tcp buffers and kernel buffers should be
tweaked. (Even with CPU-bound load).

I as well guess that you should choose an I|O scheduler like deadline or
perhaps cfq.

We will use many of the tricks found here:
http://www.gluster.org/docs/index.php/Guide_to_Optimizing_GlusterFS

Kindly

//Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/