Re: return a parameter using Map only

2010-09-23 Thread cliff palmer
Shi, you can certainly use one of the in-memory file systems like mcached,
but it's a lot of work to set that up just to avoid two i/o operations.
HTH
Cliff

On Wed, Sep 22, 2010 at 5:06 PM, Shi Yu sh...@uchicago.edu wrote:

 Dear Hadoopers,

 I am stuck at a probably very simple problem but can't figure it out. In
 the Hadoop Map/Reduce framework, I want to search a huge file (which is
 generated by another Reduce task) for a unique line of record (a String,
 double value actually). That record is expected to be passed to another
 function. I have read the previous post about using Mapper only output to
 HBase (
 http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg06579.html)
 and another post (
 http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg07337.html).
 They are both very interesting, however, I am still confused about how to
 circle away from writing to HBase, but to use the returned record directly
 from memory? I guess my problem doesn't need a reducer, so basically
 load-balance the search task via multiple Mappers. I want to have something
 like this

   class myClass
  method seekResultbyMapper (string toSearch, path reduceFile)
   call Map(a,b)
   do some simple calculation
   return String, double result

class anotherClass
 String, double  para  =  myClass.seekResultbyMapper (c,d)


 I don't know whether this is doable (maybe it is not a valid style in
 Map/Reduce framework)? How to implement it using JAVA API? Thanks for any
 suggestion in advance.


 Best Regards,

 Shi

 --
 Postdoctoral Scholar
 Institute for Genomics and Systems Biology
 Department of Medicine, the University of Chicago
 Knapp Center for Biomedical Discovery
 900 E. 57th St. Room 10148
 Chicago, IL 60637, US
 Tel: 773-702-6799




Re: Shuffle tasks getting killed

2010-09-23 Thread cliff palmer
Aniket, I wonder if these tasks were run as Speculative Execution.  Have you
been able to determine whether the job runs successfully?
HTH
Cliff

On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote:

 Hi,

 I continuously run a series of batch job using Hadoop Map Reduce. I also
 have a managing daemon that moves data around on the hdfs making way for
 more jobs to be run.
 I use capacity scheduler to schedule many jobs in parallel.

 I see an issue on the Hadoop web monitoring UI at port 50030 which I
 believe
 may be causing a performance bottleneck and wanted to get more information.

 Approximately 10% of the reduce tasks show up as Killed in the UI. The
 logs say that the killed tasks are in the shuffle phase when they are
 killed
 but the logs don't show any exception.
 My understanding is that these killed tasks would be started again and this
 slows down the whole hadoop job.
 I was wondering what the possible issues maybe and how to debug this issue?

 I have tried on both the hadoop 0.20.2 and the latest version of hadoop
 from
 yahoo's github.
 I've monitored the nodes and there is a lot of free disk space and memory
 on
 all nodes (more than 1 TB free disk and 5 GB free memory at all times on
 all
 nodes).

 Since there are no exceptions and any other visible issues, I am finding it
 hard to figure out what the problem might be. Could anybody help?

 Thanks,
 -aniket



Re: Xcievers Load

2010-09-23 Thread Todd Lipcon
On Thu, Sep 23, 2010 at 7:05 AM, Michael Segel
michael_se...@hotmail.com wrote:


 4000 xcievers is a lot.

 I'm wondering if there's a correlation between the number of xcievers and 
 ulimit -n. Should they be configured on a 1 to 1 ratio?


2:1 ratio of file descriptors to xceivers. 4000 xceivers is quite
normal on a heavily loaded HBase cluster in my experience.

The cost is the RAM of the extra threads, but there's not much you can
do about it, given the current design of the datanode.

-Todd

 -Mike

 Date: Thu, 23 Sep 2010 08:04:40 -0400
 Subject: Xcievers Load
 From: marnan...@gmail.com
 To: u...@hbase.apache.org; common-user@hadoop.apache.org

 Hi,
   We have a job that writes many small files (using MultipleOutputFormat)
 and its exceeding the 4000 xcievers that we have configured. What is the
 effect on the cluster of increasing this count to some higher number?
 Many thanks,
    Martin

 PD: Hbase is also running on the cluster.




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Xcievers Load

2010-09-23 Thread Andrew Purtell
 From: Todd Lipcon t...@cloudera.com
[...]
  4000 xcievers is a lot.
 
 2:1 ratio of file descriptors to xceivers. 4000 xceivers is
 quite normal on a heavily loaded HBase cluster in my experience.

We run with 10K xceivers...

The problem is the pain is not quite high enough to devote months to making 
what amounts to a new DataNode; though it is high. We'll reach a tipping point 
when someone sets up a 1000+ node HBase cluster I expect.

Best regards,

- Andy



  



Re: Xcievers Load

2010-09-23 Thread Martin Arnandze
Thanks everyone, we're currently testing with 10K and no issues so far.

On Thu, Sep 23, 2010 at 2:57 PM, Andrew Purtell apurt...@apache.org wrote:

  From: Todd Lipcon t...@cloudera.com
 [...]
   4000 xcievers is a lot.
 
  2:1 ratio of file descriptors to xceivers. 4000 xceivers is
  quite normal on a heavily loaded HBase cluster in my experience.

 We run with 10K xceivers...

 The problem is the pain is not quite high enough to devote months to making
 what amounts to a new DataNode; though it is high. We'll reach a tipping
 point when someone sets up a 1000+ node HBase cluster I expect.

 Best regards,

- Andy








Re: Xcievers Load

2010-09-23 Thread Allen Wittenauer

On Sep 23, 2010, at 11:57 AM, Andrew Purtell wrote:

 From: Todd Lipcon t...@cloudera.com
 [...]
 4000 xcievers is a lot.
 
 2:1 ratio of file descriptors to xceivers. 4000 xceivers is
 quite normal on a heavily loaded HBase cluster in my experience.
 
 We run with 10K xceivers...
 
 The problem is the pain is not quite high enough to devote months to making 
 what amounts to a new DataNode; though it is high. We'll reach a tipping 
 point when someone sets up a 1000+ node HBase cluster I expect.

I hope if/when the datanode is rewritten, they spell check the parameters.



Relation between number of map tasks and input splits

2010-09-23 Thread Farhan Husain
Hello,

Can a map task work on more than one input split? I am using hadoop-0.20.1
and in my map method I need to know the name of the file I am getting input
from. I use the following code to get that:

String inputFile = ((FileSplit)
context.getInputSplit()).getPath().getName();

If a map works on only one input split then I can have that code in the
setup() method which would be more efficient if I am handling large amount
of data. Otherwise, I have to put the code in the map() method. But this
would slow me down as I have to do it for every input key value pair. I have
gone through the following two pages but did not get a clear picture:

http://wiki.apache.org/hadoop/HadoopMapReduce
http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Thanks,
Farhan


Re: Relation between number of map tasks and input splits

2010-09-23 Thread Greg Roelofs
 Can a map task work on more than one input split?

As far as I can tell from reading the code, no (at least, not yet).  Code
such as createCache() in JobInProgress implicitly assumes a one-to-one mapping
between maps[] and splits[].

MR-1220 (small-jobs combo task optimization) will change that in some sense,
but fundamentally, the correspondence between maps and splits is pretty well
baked in, I believe.  (In fact, I'm pretty sure splits are created based on
some goal for the number of maps--i.e., maps and splits are one-to-one almost
by definition.)

I might be wrong about all this, of course. :-)

Greg


Re: Questions about BN and CN

2010-09-23 Thread Konstantin Shvachko

Hi Shen,

Why do we need CheckpointNode?
1. First of all it is a compatible replacement of SecondaryNameNode.
2. Checkpointing is also needed for periodically compacting edits.
You can do it with CN or BN, but CN is more lightweight.
I assume there could be cases when streaming edits to BN over network
can be slower than writing them to disk, so you might want to turn BN
off for performance reasons.
3. Also in current implementation NN allows only one BN, but multiple CNs.
So if the single BN dies the checkpointing will stall.
You can prevent it by starting two CNs instead, or one BN and one CN.
But I agree with you CN is just a subset of BN by its functionality.

Thanks,
Konstantin

On 9/22/2010 5:50 PM, ChingShen wrote:

Thanks Konstantin,

   But, my main question is that because the CN can only provide an old state
of the namespace, so why do we need it? I think the BN is best solution.

Shen

On Thu, Sep 23, 2010 at 5:20 AM, Konstantin Shvachkos...@yahoo-inc.comwrote:


The CheckpointNode creates checkpoints of the namespace, but does not keep
an up-to-date state of the namespace in memory.
If primary NN fails CheckpointNode can only provide an old state of the
namespace
created during latest checkpoint.
Also CheckpointNode is a replacement for SecondaryNameNode in earlier
releases.

BackupNode does checkpoints too, but in addition keeps an up-to-date state
of the namespace in its memory.
When the primary NN dies you can ask BackupNode to save namespace, which
will
create the up-to-date image, and then start NN instead of BN on the node BN
was running using that saved image directly or start NN on a different node
using importCheckpoint from the saved inage directory.

See the guide here.

http://hadoop.apache.org/hdfs/docs/r0.21.0/hdfs_user_guide.html#Checkpoint+Node

Thanks,
--Konstantin


On 9/8/2010 11:36 PM, ChingShen wrote:


Hi all,

  I got the Backup node(BN) that includes all the checkpoint
responsibilities,
  and it maintains an up-to-date namespace state, which is always in
sync with the active NN.

  Q1. In which situation do we need a CN?

  Q2. If the NameNode machine fails, which different manual
intervention between BN and CN?

Thanks.

Shen










Re: Shuffle tasks getting killed

2010-09-23 Thread aniket ray
Hi Cliff,

Thanks it did turn out to be speculative execution. When I turned it off, no
more tasks were killed and the performance degraded.

So my initial assumptions were incorrect after all. I guess I'll have to
look at other ways to improve performance.

Thanks for the help.
-aniket

On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer palmercl...@gmail.com wrote:

 Aniket, I wonder if these tasks were run as Speculative Execution.  Have
 you
 been able to determine whether the job runs successfully?
 HTH
 Cliff

 On Thu, Sep 23, 2010 at 12:52 AM, aniket ray aniket@gmail.com wrote:

  Hi,
 
  I continuously run a series of batch job using Hadoop Map Reduce. I also
  have a managing daemon that moves data around on the hdfs making way for
  more jobs to be run.
  I use capacity scheduler to schedule many jobs in parallel.
 
  I see an issue on the Hadoop web monitoring UI at port 50030 which I
  believe
  may be causing a performance bottleneck and wanted to get more
 information.
 
  Approximately 10% of the reduce tasks show up as Killed in the UI. The
  logs say that the killed tasks are in the shuffle phase when they are
  killed
  but the logs don't show any exception.
  My understanding is that these killed tasks would be started again and
 this
  slows down the whole hadoop job.
  I was wondering what the possible issues maybe and how to debug this
 issue?
 
  I have tried on both the hadoop 0.20.2 and the latest version of hadoop
  from
  yahoo's github.
  I've monitored the nodes and there is a lot of free disk space and memory
  on
  all nodes (more than 1 TB free disk and 5 GB free memory at all times on
  all
  nodes).
 
  Since there are no exceptions and any other visible issues, I am finding
 it
  hard to figure out what the problem might be. Could anybody help?
 
  Thanks,
  -aniket