Re: Can I share datas for several map tasks?

2009-06-16 Thread Hello World
Thanks for your reply. Can you do me a favor to make a check?
I modified mapred-default.xml as follows:
540 property
541   namemapred.job.reuse.jvm.num.tasks/name
542   value-1/value
543   descriptionHow many tasks to run per jvm. If set to -1, there is
544   no limit.
545   /description
546 /property
And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;

This is my program:

 17 public class WordCount {
 18
 19   public static class TokenizerMapper
 20extends MapperObject, Text, Text, IntWritable{
 21
 22 private final static IntWritable one = new IntWritable(1);
 23 private Text word = new Text();
 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16];
 25
 26 protected void setup(Context context
 27 ) throws IOException, InterruptedException {
 28 //Init shared data
 29 ToBeSharedData[0] = 12345;
 30 System.out.println(setup shared data[0] =  +
ToBeSharedData[0]);
 31 }
 32
 33 public void map(Object key, Text value, Context context
 34 ) throws IOException, InterruptedException {
 35   StringTokenizer itr = new StringTokenizer(value.toString());
 36   while (itr.hasMoreTokens()) {
 37 word.set(itr.nextToken());
 38 context.write(word, one);
 39   }
 40   System.out.println(read shared data[0] =  +
ToBeSharedData[0]);
 41 }
 42   }

First, can you tell me how to make sure jvm reuse is taking effect, for I
didn't see anything different from before. I use top command under linux
and see the same number of java processes and same memory usage.

Second, can you tell me how to make the ToBeSharedData be inited only once
and can be read from other MapTasks on the same node? Or this is not a
suitable programming style for map-reduce?

By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
single-node.
thanks in advance

On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.comwrote:


 snowloong wrote:
  Hi,
  I want to share some data structures for the map tasks on a same node(not
 through files), I mean, if one map task has already initialized some data
 structures (e.g. an array or a list), can other map tasks share these
 memorys and directly access them, for I don't want to reinitialize these
 datas and I want to save some memory. Can hadoop help me do this?

 You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks
 in mapred-default.xml for usage. Then you can cache the data in a static
 variable in your mapper.

 - Sharad



Re: Can I share datas for several map tasks?

2009-06-16 Thread jason hadoop
In the examples for my book is a jvm reuse with static data shared between
jvm's example

On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote:

 Thanks for your reply. Can you do me a favor to make a check?
 I modified mapred-default.xml as follows:
540 property
541   namemapred.job.reuse.jvm.num.tasks/name
542   value-1/value
543   descriptionHow many tasks to run per jvm. If set to -1, there is
544   no limit.
545   /description
546 /property
 And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;

 This is my program:

 17 public class WordCount {
 18
 19   public static class TokenizerMapper
 20extends MapperObject, Text, Text, IntWritable{
 21
 22 private final static IntWritable one = new IntWritable(1);
 23 private Text word = new Text();
 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16];
 25
 26 protected void setup(Context context
 27 ) throws IOException, InterruptedException {
 28 //Init shared data
 29 ToBeSharedData[0] = 12345;
 30 System.out.println(setup shared data[0] =  +
 ToBeSharedData[0]);
 31 }
 32
 33 public void map(Object key, Text value, Context context
 34 ) throws IOException, InterruptedException {
 35   StringTokenizer itr = new StringTokenizer(value.toString());
 36   while (itr.hasMoreTokens()) {
 37 word.set(itr.nextToken());
 38 context.write(word, one);
 39   }
 40   System.out.println(read shared data[0] =  +
 ToBeSharedData[0]);
 41 }
 42   }

 First, can you tell me how to make sure jvm reuse is taking effect, for I
 didn't see anything different from before. I use top command under linux
 and see the same number of java processes and same memory usage.

 Second, can you tell me how to make the ToBeSharedData be inited only
 once
 and can be read from other MapTasks on the same node? Or this is not a
 suitable programming style for map-reduce?

 By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
 single-node.
 thanks in advance

 On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com
 wrote:

 
  snowloong wrote:
   Hi,
   I want to share some data structures for the map tasks on a same
 node(not
  through files), I mean, if one map task has already initialized some data
  structures (e.g. an array or a list), can other map tasks share these
  memorys and directly access them, for I don't want to reinitialize these
  datas and I want to save some memory. Can hadoop help me do this?
 
  You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks
  in mapred-default.xml for usage. Then you can cache the data in a static
  variable in your mapper.
 
  - Sharad
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Can I share datas for several map tasks?

2009-06-16 Thread Hello World
I can't get your book, so can you give me a few more words to describe the
solution? very appreciate.

-snowloong

On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.comwrote:

 In the examples for my book is a jvm reuse with static data shared between
 jvm's example

 On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote:

  Thanks for your reply. Can you do me a favor to make a check?
  I modified mapred-default.xml as follows:
 540 property
 541   namemapred.job.reuse.jvm.num.tasks/name
 542   value-1/value
 543   descriptionHow many tasks to run per jvm. If set to -1, there
 is
 544   no limit.
 545   /description
 546 /property
  And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;
 
  This is my program:
 
  17 public class WordCount {
  18
  19   public static class TokenizerMapper
  20extends MapperObject, Text, Text, IntWritable{
  21
  22 private final static IntWritable one = new IntWritable(1);
  23 private Text word = new Text();
  24 public static int[] ToBeSharedData = new int[1024 * 1024 *
 16];
  25
  26 protected void setup(Context context
  27 ) throws IOException, InterruptedException {
  28 //Init shared data
  29 ToBeSharedData[0] = 12345;
  30 System.out.println(setup shared data[0] =  +
  ToBeSharedData[0]);
  31 }
  32
  33 public void map(Object key, Text value, Context context
  34 ) throws IOException, InterruptedException {
  35   StringTokenizer itr = new StringTokenizer(value.toString());
  36   while (itr.hasMoreTokens()) {
  37 word.set(itr.nextToken());
  38 context.write(word, one);
  39   }
  40   System.out.println(read shared data[0] =  +
  ToBeSharedData[0]);
  41 }
  42   }
 
  First, can you tell me how to make sure jvm reuse is taking effect, for
 I
  didn't see anything different from before. I use top command under
 linux
  and see the same number of java processes and same memory usage.
 
  Second, can you tell me how to make the ToBeSharedData be inited only
  once
  and can be read from other MapTasks on the same node? Or this is not a
  suitable programming style for map-reduce?
 
  By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
  single-node.
  thanks in advance
 
  On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com
  wrote:
 
  
   snowloong wrote:
Hi,
I want to share some data structures for the map tasks on a same
  node(not
   through files), I mean, if one map task has already initialized some
 data
   structures (e.g. an array or a list), can other map tasks share these
   memorys and directly access them, for I don't want to reinitialize
 these
   datas and I want to save some memory. Can hadoop help me do this?
  
   You can enable jvm reuse across tasks. See
 mapred.job.reuse.jvm.num.tasks
   in mapred-default.xml for usage. Then you can cache the data in a
 static
   variable in your mapper.
  
   - Sharad
  
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Can I share datas for several map tasks?

2009-06-16 Thread Iman E
Thank you, Jason. I found the example. So, is there a way to share the same JVM 
between different jobs?





From: jason hadoop jason.had...@gmail.com
To: core-user@hadoop.apache.org
Sent: Tuesday, June 16, 2009 7:22:16 PM
Subject: Re: Can I share datas for several map tasks?

in the example code, download bundle, in the package
com.apress.hadoopbook.examples.advancedtechniques, is the class
JVMReuseAndStaticInitializers.java

which demonstrates sharing data between instances using jvm reuse.

I built this to prove to myself that it was possible.
It never got an actual write up in the book itself.

On Tue, Jun 16, 2009 at 6:55 PM, Hello World snowlo...@gmail.com wrote:

 I can't get your book, so can you give me a few more words to describe the
 solution? very appreciate.

 -snowloong

 On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.com
 wrote:

  In the examples for my book is a jvm reuse with static data shared
 between
  jvm's example
 
  On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com
 wrote:
 
   Thanks for your reply. Can you do me a favor to make a check?
   I modified mapred-default.xml as follows:
      540 property
      541  namemapred.job.reuse.jvm.num.tasks/name
      542  value-1/value
      543  descriptionHow many tasks to run per jvm. If set to -1,
 there
  is
      544  no limit.
      545  /description
      546 /property
   And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;
  
   This is my program:
  
      17 public class WordCount {
      18
      19  public static class TokenizerMapper
      20        extends MapperObject, Text, Text, IntWritable{
      21
      22    private final static IntWritable one = new IntWritable(1);
      23    private Text word = new Text();
      24    public static int[] ToBeSharedData = new int[1024 * 1024 *
  16];
      25
      26    protected void setup(Context context
      27            ) throws IOException, InterruptedException {
      28        //Init shared data
      29        ToBeSharedData[0] = 12345;
      30        System.out.println(setup shared data[0] =  +
   ToBeSharedData[0]);
      31    }
      32
      33    public void map(Object key, Text value, Context context
      34                    ) throws IOException, InterruptedException {
      35      StringTokenizer itr = new
 StringTokenizer(value.toString());
      36      while (itr.hasMoreTokens()) {
      37        word.set(itr.nextToken());
      38        context.write(word, one);
      39      }
      40      System.out.println(read shared data[0] =  +
   ToBeSharedData[0]);
      41    }
      42  }
  
   First, can you tell me how to make sure jvm reuse is taking effect,
 for
  I
   didn't see anything different from before. I use top command under
  linux
   and see the same number of java processes and same memory usage.
  
   Second, can you tell me how to make the ToBeSharedData be inited only
   once
   and can be read from other MapTasks on the same node? Or this is not a
   suitable programming style for map-reduce?
  
   By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
   single-node.
   thanks in advance
  
   On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal 
 shara...@yahoo-inc.com
   wrote:
  
   
snowloong wrote:
 Hi,
 I want to share some data structures for the map tasks on a same
   node(not
through files), I mean, if one map task has already initialized some
  data
structures (e.g. an array or a list), can other map tasks share these
memorys and directly access them, for I don't want to reinitialize
  these
datas and I want to save some memory. Can hadoop help me do this?
   
You can enable jvm reuse across tasks. See
  mapred.job.reuse.jvm.num.tasks
in mapred-default.xml for usage. Then you can cache the data in a
  static
variable in your mapper.
   
- Sharad
   
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals



  

Re: Can I share datas for several map tasks?

2009-06-15 Thread Sharad Agarwal

snowloong wrote:
 Hi,
 I want to share some data structures for the map tasks on a same node(not 
 through files), I mean, if one map task has already initialized some data 
 structures (e.g. an array or a list), can other map tasks share these memorys 
 and directly access them, for I don't want to reinitialize these datas and I 
 want to save some memory. Can hadoop help me do this?

You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in 
mapred-default.xml for usage. Then you can cache the data in a static variable 
in your mapper.

- Sharad