Re: Can I share datas for several map tasks?
Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541 namemapred.job.reuse.jvm.num.tasks/name 542 value-1/value 543 descriptionHow many tasks to run per jvm. If set to -1, there is 544 no limit. 545 /description 546 /property And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop; This is my program: 17 public class WordCount { 18 19 public static class TokenizerMapper 20extends MapperObject, Text, Text, IntWritable{ 21 22 private final static IntWritable one = new IntWritable(1); 23 private Text word = new Text(); 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16]; 25 26 protected void setup(Context context 27 ) throws IOException, InterruptedException { 28 //Init shared data 29 ToBeSharedData[0] = 12345; 30 System.out.println(setup shared data[0] = + ToBeSharedData[0]); 31 } 32 33 public void map(Object key, Text value, Context context 34 ) throws IOException, InterruptedException { 35 StringTokenizer itr = new StringTokenizer(value.toString()); 36 while (itr.hasMoreTokens()) { 37 word.set(itr.nextToken()); 38 context.write(word, one); 39 } 40 System.out.println(read shared data[0] = + ToBeSharedData[0]); 41 } 42 } First, can you tell me how to make sure jvm reuse is taking effect, for I didn't see anything different from before. I use top command under linux and see the same number of java processes and same memory usage. Second, can you tell me how to make the ToBeSharedData be inited only once and can be read from other MapTasks on the same node? Or this is not a suitable programming style for map-reduce? By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a single-node. thanks in advance On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.comwrote: snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad
Re: Can I share datas for several map tasks?
In the examples for my book is a jvm reuse with static data shared between jvm's example On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote: Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541 namemapred.job.reuse.jvm.num.tasks/name 542 value-1/value 543 descriptionHow many tasks to run per jvm. If set to -1, there is 544 no limit. 545 /description 546 /property And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop; This is my program: 17 public class WordCount { 18 19 public static class TokenizerMapper 20extends MapperObject, Text, Text, IntWritable{ 21 22 private final static IntWritable one = new IntWritable(1); 23 private Text word = new Text(); 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16]; 25 26 protected void setup(Context context 27 ) throws IOException, InterruptedException { 28 //Init shared data 29 ToBeSharedData[0] = 12345; 30 System.out.println(setup shared data[0] = + ToBeSharedData[0]); 31 } 32 33 public void map(Object key, Text value, Context context 34 ) throws IOException, InterruptedException { 35 StringTokenizer itr = new StringTokenizer(value.toString()); 36 while (itr.hasMoreTokens()) { 37 word.set(itr.nextToken()); 38 context.write(word, one); 39 } 40 System.out.println(read shared data[0] = + ToBeSharedData[0]); 41 } 42 } First, can you tell me how to make sure jvm reuse is taking effect, for I didn't see anything different from before. I use top command under linux and see the same number of java processes and same memory usage. Second, can you tell me how to make the ToBeSharedData be inited only once and can be read from other MapTasks on the same node? Or this is not a suitable programming style for map-reduce? By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a single-node. thanks in advance On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Can I share datas for several map tasks?
I can't get your book, so can you give me a few more words to describe the solution? very appreciate. -snowloong On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.comwrote: In the examples for my book is a jvm reuse with static data shared between jvm's example On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote: Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541 namemapred.job.reuse.jvm.num.tasks/name 542 value-1/value 543 descriptionHow many tasks to run per jvm. If set to -1, there is 544 no limit. 545 /description 546 /property And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop; This is my program: 17 public class WordCount { 18 19 public static class TokenizerMapper 20extends MapperObject, Text, Text, IntWritable{ 21 22 private final static IntWritable one = new IntWritable(1); 23 private Text word = new Text(); 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16]; 25 26 protected void setup(Context context 27 ) throws IOException, InterruptedException { 28 //Init shared data 29 ToBeSharedData[0] = 12345; 30 System.out.println(setup shared data[0] = + ToBeSharedData[0]); 31 } 32 33 public void map(Object key, Text value, Context context 34 ) throws IOException, InterruptedException { 35 StringTokenizer itr = new StringTokenizer(value.toString()); 36 while (itr.hasMoreTokens()) { 37 word.set(itr.nextToken()); 38 context.write(word, one); 39 } 40 System.out.println(read shared data[0] = + ToBeSharedData[0]); 41 } 42 } First, can you tell me how to make sure jvm reuse is taking effect, for I didn't see anything different from before. I use top command under linux and see the same number of java processes and same memory usage. Second, can you tell me how to make the ToBeSharedData be inited only once and can be read from other MapTasks on the same node? Or this is not a suitable programming style for map-reduce? By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a single-node. thanks in advance On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Can I share datas for several map tasks?
Thank you, Jason. I found the example. So, is there a way to share the same JVM between different jobs? From: jason hadoop jason.had...@gmail.com To: core-user@hadoop.apache.org Sent: Tuesday, June 16, 2009 7:22:16 PM Subject: Re: Can I share datas for several map tasks? in the example code, download bundle, in the package com.apress.hadoopbook.examples.advancedtechniques, is the class JVMReuseAndStaticInitializers.java which demonstrates sharing data between instances using jvm reuse. I built this to prove to myself that it was possible. It never got an actual write up in the book itself. On Tue, Jun 16, 2009 at 6:55 PM, Hello World snowlo...@gmail.com wrote: I can't get your book, so can you give me a few more words to describe the solution? very appreciate. -snowloong On Tue, Jun 16, 2009 at 9:51 PM, jason hadoop jason.had...@gmail.com wrote: In the examples for my book is a jvm reuse with static data shared between jvm's example On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote: Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541 namemapred.job.reuse.jvm.num.tasks/name 542 value-1/value 543 descriptionHow many tasks to run per jvm. If set to -1, there is 544 no limit. 545 /description 546 /property And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop; This is my program: 17 public class WordCount { 18 19 public static class TokenizerMapper 20 extends MapperObject, Text, Text, IntWritable{ 21 22 private final static IntWritable one = new IntWritable(1); 23 private Text word = new Text(); 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16]; 25 26 protected void setup(Context context 27 ) throws IOException, InterruptedException { 28 //Init shared data 29 ToBeSharedData[0] = 12345; 30 System.out.println(setup shared data[0] = + ToBeSharedData[0]); 31 } 32 33 public void map(Object key, Text value, Context context 34 ) throws IOException, InterruptedException { 35 StringTokenizer itr = new StringTokenizer(value.toString()); 36 while (itr.hasMoreTokens()) { 37 word.set(itr.nextToken()); 38 context.write(word, one); 39 } 40 System.out.println(read shared data[0] = + ToBeSharedData[0]); 41 } 42 } First, can you tell me how to make sure jvm reuse is taking effect, for I didn't see anything different from before. I use top command under linux and see the same number of java processes and same memory usage. Second, can you tell me how to make the ToBeSharedData be inited only once and can be read from other MapTasks on the same node? Or this is not a suitable programming style for map-reduce? By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a single-node. thanks in advance On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Can I share datas for several map tasks?
snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad