Re: Doubt in DoubleWritable
Please try this for (DoubleArrayWritable avalue : values) { Writable[] value = avalue.get(); // DoubleWritable[] value = new DoubleWritable[6]; // for(int k=0;k<6;k++){ // value[k] = DoubleWritable(wvalue[k]); // } //parse accordingly if (Double.parseDouble(value[1].toString()) != 0) { total_records_Temp = total_records_Temp + 1; sumvalueTemp = sumvalueTemp + Double.parseDouble(value[0].toString()); } if (Double.parseDouble(value[3].toString()) != 0) { total_records_Dewpoint = total_records_Dewpoint + 1; sumvalueDewpoint = sumvalueDewpoint + Double.parseDouble(value[2].toString()); } if (Double.parseDouble(value[5].toString()) != 0) { total_records_Windspeed = total_records_Windspeed + 1; sumvalueWindspeed = sumvalueWindspeed + Double.parseDouble(value[4].toString()); } } Attaching the code -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ //cc MaxTemperature Application to find the maximum temperature in the weather dataset //vv MaxTemperature import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.conf.Configuration; public class MapReduce { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err .println("Usage: MaxTemperature "); System.exit(-1); } /* * Job job = new Job(); job.setJarByClass(MaxTemperature.class); * job.setJobName("Max temperature"); */ Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf, "AverageTempValues"); /* * Deleting output directory for reuseing same dir */ Path dest = new Path(args[1]); if(fs.exists(dest)){ fs.delete(dest, true); } FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setNumReduceTasks(2); job.setMapperClass(NewMapper.class); job.setReducerClass(NewReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleArrayWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } } // ^^ MaxTemperature // cc MaxTemperatureMapper Mapper for maximum temperature example // vv MaxTemperatureMapper import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class NewMapper extends Mapper{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String Str = value.toString(); String[] Mylist = new String[1000]; int i = 0; for (String retval : Str.split("\\s+")) { System.out.println(retval); Mylist[i++] = retval; } String Val = Mylist[2]; String Year = Val.substring(0, 4); String Month = Val.substring(5, 6); String[] Section = Val.split("_"); String section_string = "0"; if (Section[1].matches("^(0|1|2|3|4|5)$")) { section_string = "4"; } else if (Section[1].matches("^(6|7|8|9|10|11)$")) { section_string = "1"; } else if (Section[1].matches("^(12|13|14|15|16|17)$")) { section_string = "2"; } else if (Section[1].matches("^(18|19|20|21|22|23)$")) { section_string = "3"; } DoubleWritable[] array = new DoubleWritable[6]; DoubleArrayWritable output = new DoubleArrayWritable(); array[0].set(Double.parseDouble(Mylist[3])); array[2].set(Double.parseDouble(Mylist[4])); array[4].set(Double.parseDouble(Mylist[12])); for (int j = 0; j < 6; j = j + 2) { if (999.9 == array[j].get()) { array[j + 1].set(0); } else { array[j + 1].set(1); } } output.set(array); context.write(new Text(Year + section_string + Month), output); } } //cc MaxTemperatureReducer Reducer for maximum temperature example //vv MaxTemperatureReducer import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class NewReducer extends Reducer { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { double sumvalueTemp = 0; double sumvalueDewpoint = 0; double sumvalueWindspeed = 0; double total_records_Temp = 0; double total_records_Dewpoint = 0; double total_records_Windspeed = 0; double average_Temp = Integer.MIN_VALUE; double average_Dewpoint = Integer.MIN_VALUE; double average_Windspeed = Integer.MIN_VALUE; DoubleWritable[] temp = new DoubleWritable[3];
Re: Doubt Regarding QJM protocol - example 2.10.6 of Quorum-Journal Design document
Hi A developer should answer that but a quick look to an edit file with od suggests that record are not fixed length. So maybe the likeliness of the situation you suggest is so low that there is no need to check more than file size Ulul Le 28/09/2014 11:17, Giridhar Addepalli a écrit : Hi All, I am going through Quorum Journal Design document. It is mentioned in Section 2.8 - In Accept Recovery RPC section If the current on-disk log is missing, or a /different length /than the proposed recovery, the JN downloads the log from the provided URI, replacing any current copy of the log segment. I can see it that the code follows above design Source :: Journal.java public synchronized void acceptRecovery(RequestInfo reqInfo, SegmentStateProto segment, URL fromUrl) throws IOException { if (currentSegment == null || currentSegment.getEndTxId() != segment.getEndTxId()) { } else { LOG.info(Skipping download of log + TextFormat.shortDebugString(segment) + : already have up-to-date logs); } } My question is what if on-disk log is present and is of /same length /as the proposed recovery If JournalNode is skipping download because the logs are of same length, then we could end up in a situation where finalized log segments contain different data ! This could happen if we follow example 2.10.6 As per that example we write transactions (151-153 ) on JN1 then when recovery proceeded with only JN2 JN3 let us assume that we write again /different transactions/ as (151-153) . Then after the crash when we run recovery , JN1 will skip downloading correct segment from JN2/JN3 as it thinks it has correct segment( as per the code pasted above). This will result in a situation where finalized segment ( edits_151-153 ) on JN1 is different from finalized segment edits_151-153 on JN2/JN3. Please let me know if i have gone wrong some where, and this situation is taken care of. Thanks, Giridhar.
RE: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop
Thanks John for your comments, I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* APIs. I see this way:[1.] One may have binaries i.e. jar file of the M\R program that used old *mapred* APIsThis will work directly on MRv2(YARN). [2.] One may have the source code i.e. Java Programs of the M\R program that used old *mapred* APIsFor this I need to recompile and generate the binaries i.e. jar file. Do I have to change the old *org.apache.hadoop.mapred* APIs to new *org.apache.hadoop.mapreduce* APIs? or No code changes are needed? -RR Date: Mon, 14 Apr 2014 10:37:56 -0400 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: john.meag...@gmail.com To: user@hadoop.apache.org Also, Source Compatibility also means ONLY a recompile is needed. No code changes should be needed. On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote: Source Compatibility = you need to recompile and use the new version as part of the compilation Binary Compatibility = you can take something compiled against the old version and run it on the new version On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello People, As per the Apache site http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html Binary Compatibility First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. Source Compatibility We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup. For Binary Compatibility I understand that if I had build a MR job with old *mapred* APIs then they can be run directly on YARN without and changes. Can anybody explain what do we mean by Source Compatibility here and also a use case where one will need it? Does that mean code changes if I already have a MR job source code written with with old *mapred* APIs and I need to make some changes to it to run in then I need to use the new mapreduce* API and generate the new binaries? Thanks, -RR
Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop
1. If you have the binaries that were compiled against MRv1 *mapred* libs, it should just work with MRv2. 2. If you have the source code that refers to MRv1 *mapred* libs, it should be compilable without code changes. Of course, you're free to change your code. 3. If you have the binaries that were compiled against MRv1 *mapreduce* libs, it may not be executable directly with MRv2, but you should able to compile it against MRv2 *mapreduce* libs without code changes, and execute it. - Zhijie On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe radhe.krishna.ra...@live.comwrote: Thanks John for your comments, I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* APIs. I see this way: [1.] One may have binaries i.e. jar file of the M\R program that used old *mapred* APIs This will work directly on MRv2(YARN). [2.] One may have the source code i.e. Java Programs of the M\R program that used old *mapred* APIs For this I need to recompile and generate the binaries i.e. jar file. Do I have to change the old *org.apache.hadoop.mapred* APIs to new * org.apache.hadoop.mapreduce* APIs? or No code changes are needed? -RR Date: Mon, 14 Apr 2014 10:37:56 -0400 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: john.meag...@gmail.com To: user@hadoop.apache.org Also, Source Compatibility also means ONLY a recompile is needed. No code changes should be needed. On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote: Source Compatibility = you need to recompile and use the new version as part of the compilation Binary Compatibility = you can take something compiled against the old version and run it on the new version On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello People, As per the Apache site http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html Binary Compatibility First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. Source Compatibility We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup. For Binary Compatibility I understand that if I had build a MR job with old *mapred* APIs then they can be run directly on YARN without and changes. Can anybody explain what do we mean by Source Compatibility here and also a use case where one will need it? Does that mean code changes if I already have a MR job source code written with with old *mapred* APIs and I need to make some changes to it to run in then I need to use the new mapreduce* API and generate the new binaries? Thanks, -RR -- Zhijie Shen Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
RE: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop
Thanks Zhijie for the explanation. Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build against old MRv1 mapred APIS) then how can I compile it since I don't have the source code i.e. Java files. All I can do with binaries i.e. jar file is execute it. -RR Date: Tue, 15 Apr 2014 13:03:53 -0700 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: zs...@hortonworks.com To: user@hadoop.apache.org 1. If you have the binaries that were compiled against MRv1 mapred libs, it should just work with MRv2.2. If you have the source code that refers to MRv1 mapred libs, it should be compilable without code changes. Of course, you're free to change your code. 3. If you have the binaries that were compiled against MRv1 mapreduce libs, it may not be executable directly with MRv2, but you should able to compile it against MRv2 mapreduce libs without code changes, and execute it. - Zhijie On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Thanks John for your comments, I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* APIs. I see this way:[1.] One may have binaries i.e. jar file of the M\R program that used old *mapred* APIs This will work directly on MRv2(YARN). [2.] One may have the source code i.e. Java Programs of the M\R program that used old *mapred* APIs For this I need to recompile and generate the binaries i.e. jar file. Do I have to change the old *org.apache.hadoop.mapred* APIs to new *org.apache.hadoop.mapreduce* APIs? or No code changes are needed? -RR Date: Mon, 14 Apr 2014 10:37:56 -0400 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: john.meag...@gmail.com To: user@hadoop.apache.org Also, Source Compatibility also means ONLY a recompile is needed. No code changes should be needed. On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote: Source Compatibility = you need to recompile and use the new version as part of the compilation Binary Compatibility = you can take something compiled against the old version and run it on the new version On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello People, As per the Apache site http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html Binary Compatibility First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. Source Compatibility We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup. For Binary Compatibility I understand that if I had build a MR job with old *mapred* APIs then they can be run directly on YARN without and changes. Can anybody explain what do we mean by Source Compatibility here and also a use case where one will need it? Does that mean code changes if I already have a MR job source code written with with old *mapred* APIs and I need to make some changes to it to run in then I need to use the new mapreduce* API and generate the new binaries? Thanks, -RR -- Zhijie ShenHortonworks Inc.http://hortonworks.com/ CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop
bq. Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build against old MRv1 mapred APIS) Which APIs are you talking about, *mapred* or *mapreduce*? In #3, I was saying about *mapreduce*. If this is the case, you may be in the trouble unfortunately, because MRv2 has evolved so much in *mapreduce *APIs that it's difficult to ensure binary compatibility. Anyway, you should still try your luck, as your binaries may not use the incompatible APIs. On the other hand, if you meant *mapred* APIs instead, you binaries should just work. - Zhijie On Tue, Apr 15, 2014 at 1:35 PM, Radhe Radhe radhe.krishna.ra...@live.comwrote: Thanks Zhijie for the explanation. Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build against old MRv1 *mapred* APIS) then how can I compile it since I don't have the source code i.e. Java files. All I can do with binaries i.e. jar file is execute it. -RR -- Date: Tue, 15 Apr 2014 13:03:53 -0700 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: zs...@hortonworks.com To: user@hadoop.apache.org 1. If you have the binaries that were compiled against MRv1 *mapred*libs, it should just work with MRv2. 2. If you have the source code that refers to MRv1 *mapred* libs, it should be compilable without code changes. Of course, you're free to change your code. 3. If you have the binaries that were compiled against MRv1 *mapreduce* libs, it may not be executable directly with MRv2, but you should able to compile it against MRv2 *mapreduce* libs without code changes, and execute it. - Zhijie On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Thanks John for your comments, I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* APIs. I see this way: [1.] One may have binaries i.e. jar file of the M\R program that used old *mapred* APIs This will work directly on MRv2(YARN). [2.] One may have the source code i.e. Java Programs of the M\R program that used old *mapred* APIs For this I need to recompile and generate the binaries i.e. jar file. Do I have to change the old *org.apache.hadoop.mapred* APIs to new * org.apache.hadoop.mapreduce* APIs? or No code changes are needed? -RR Date: Mon, 14 Apr 2014 10:37:56 -0400 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop From: john.meag...@gmail.com To: user@hadoop.apache.org Also, Source Compatibility also means ONLY a recompile is needed. No code changes should be needed. On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote: Source Compatibility = you need to recompile and use the new version as part of the compilation Binary Compatibility = you can take something compiled against the old version and run it on the new version On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello People, As per the Apache site http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html Binary Compatibility First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. Source Compatibility We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup. For Binary Compatibility I understand that if I had build a MR job with old *mapred* APIs then they can be run directly on YARN without and changes. Can anybody explain what do we mean by Source Compatibility here and also a use case where one will need it? Does that mean code changes if I already have a MR job source code written with with old *mapred* APIs and I need to make some changes to it to run in then I need to use the new mapreduce* API and generate the new binaries? Thanks, -RR -- Zhijie Shen Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended
Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop
Also, Source Compatibility also means ONLY a recompile is needed. No code changes should be needed. On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote: Source Compatibility = you need to recompile and use the new version as part of the compilation Binary Compatibility = you can take something compiled against the old version and run it on the new version On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe radhe.krishna.ra...@live.com wrote: Hello People, As per the Apache site http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html Binary Compatibility First, we ensure binary compatibility to the applications that use old mapred APIs. This means that applications which were built against MRv1 mapred APIs can run directly on YARN without recompilation, merely by pointing them to an Apache Hadoop 2.x cluster via configuration. Source Compatibility We cannot ensure complete binary compatibility with the applications that use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we ensure source compatibility for mapreduce APIs that break binary compatibility. In other words, users should recompile their applications that use mapreduce APIs against MRv2 jars. One notable binary incompatibility break is Counter and CounterGroup. For Binary Compatibility I understand that if I had build a MR job with old *mapred* APIs then they can be run directly on YARN without and changes. Can anybody explain what do we mean by Source Compatibility here and also a use case where one will need it? Does that mean code changes if I already have a MR job source code written with with old *mapred* APIs and I need to make some changes to it to run in then I need to use the new mapreduce* API and generate the new binaries? Thanks, -RR
Re: Doubt
Certainly it is , and quite common especially if you have some high performance machines : they can run as mapreduce slaves and also double as mongo hosts. The problem would of course be that when running mapreduce jobs you might have very slow network bandwidth at times, and if your front end needs fast response times all the time from mongo instances you could be in trouble. On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar praveen...@gmail.comwrote: Why not ? Its just a matter of installing 2 different packages. Depends on what do you want to use it for, you need to take care of few things, but as far as installation is concerned, it should be easily doable. Regards Prav On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote: Hi all, is it possible to install Mongodb on the same VM which consists hadoop? -- amiable harsha -- Jay Vyas http://jayunit100.blogspot.com
Re: Doubt
thank s jay and praveen, i want to use both separately don't want to use mongodb in the place of hbase On Wed, Mar 19, 2014 at 9:25 PM, Jay Vyas jayunit...@gmail.com wrote: Certainly it is , and quite common especially if you have some high performance machines : they can run as mapreduce slaves and also double as mongo hosts. The problem would of course be that when running mapreduce jobs you might have very slow network bandwidth at times, and if your front end needs fast response times all the time from mongo instances you could be in trouble. On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar praveen...@gmail.comwrote: Why not ? Its just a matter of installing 2 different packages. Depends on what do you want to use it for, you need to take care of few things, but as far as installation is concerned, it should be easily doable. Regards Prav On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote: Hi all, is it possible to install Mongodb on the same VM which consists hadoop? -- amiable harsha -- Jay Vyas http://jayunit100.blogspot.com -- amiable harsha
Re: Doubt
Why not ? Its just a matter of installing 2 different packages. Depends on what do you want to use it for, you need to take care of few things, but as far as installation is concerned, it should be easily doable. Regards Prav On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote: Hi all, is it possible to install Mongodb on the same VM which consists hadoop? -- amiable harsha
Re: doubt
I've installed a hadoop single node cluster on a VirtualBox machine running ubuntu 12.04LTS (64-bit) with 512MB RAM and 8GB HD. I haven't seen any errors in my testing yet. Is 1GB RAM required? Will I run into issues when I expand the cluster? On Sat, Jan 18, 2014 at 11:24 PM, Alexander Pivovarov apivova...@gmail.comwrote: it' enough. hadoop uses only 1GB RAM by default. On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote: Hi , i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough for this or shall i need to expand ? please suggest about my query. than x -- amiable harsha -- -jblack
Re: doubt
it' enough. hadoop uses only 1GB RAM by default. On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote: Hi , i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough for this or shall i need to expand ? please suggest about my query. than x -- amiable harsha
Re: Doubt on Input and Output Mapper - Key value pairs
The answer (a) is correct, in general. On Wed, Nov 7, 2012 at 6:09 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Which of the following is correct w.r.t mapper. (a) It accepts a single key-value pair as input and can emit any number of key-value pairs as output, including zero. (b) It accepts a single key-value pair as input and emits a single key and list of corresponding values as output regards, Rams -- Harsh J
Re: Doubt on Input and Output Mapper - Key value pairs
Hi Rams, A mapper will accept single key-value pair as input and can emit 0 or more key-value pairs based on what you want to do in mapper function (I mean based on your business logic in mapper function). But the framework will actually aggregate the list of values associated with a given key and sends the key and List of values to the reducer function. Best, Mahesh Balija. On Wed, Nov 7, 2012 at 6:09 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Which of the following is correct w.r.t mapper. (a) It accepts a single key-value pair as input and can emit any number of key-value pairs as output, including zero. (b) It accepts a single key-value pair as input and emits a single key and list of corresponding values as output regards, Rams
Re: doubt about reduce tasks and block writes
Harsh I did leave an escape route open witth a bit about corner cases :-) Anyway I agree that HDFS has no notion of block 0. I just meant that had the dfs.replication is 1, there will be,under normal circumstances :-), no blocks of output file will be written to node A. Raj - Original Message - From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Cc: Sent: Saturday, August 25, 2012 4:02 AM Subject: Re: doubt about reduce tasks and block writes Raj's almost right. In times of high load or space fillup on a local DN, the NameNode may decide to instead pick a non-local DN for replica-writing. In this way, the Node A may get a copy 0 of a replica from a task. This is per the default block placement policy. P.s. Note that HDFS hardly makes any differences between replicas, hence there is no hard-concept of a copy 0 or copy 1 block, at the NN level, it treats all DNs in pipeline equally and same for replicas. On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com wrote: But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Harsh J
Re: doubt about reduce tasks and block writes
Thanks, Raj you got exactly my point. I wanted to confirm this assumption as I was guessing if a shared HDFS cluster with MR and Hbase like this would make sense: http://old.nabble.com/HBase-User-f34655.html -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185p4003211.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: doubt about reduce tasks and block writes
Raj's almost right. In times of high load or space fillup on a local DN, the NameNode may decide to instead pick a non-local DN for replica-writing. In this way, the Node A may get a copy 0 of a replica from a task. This is per the default block placement policy. P.s. Note that HDFS hardly makes any differences between replicas, hence there is no hard-concept of a copy 0 or copy 1 block, at the NN level, it treats all DNs in pipeline equally and same for replicas. On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com wrote: But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Harsh J
Re: doubt about reduce tasks and block writes
Marc, see my inline comments. On Fri, Aug 24, 2012 at 4:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) Yes, if there is a DN running on that server (it's possible to be running TT without a DN). In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? I believe that it's possible that a map task would read from node A's DN. Yes, the JobTracker tries to schedule map tasks on nodes where the data would be local, but it can't always do so. If there's a node with a free map slot, but that node doesn't have the data blocks locally, the JobTracker will assign the map task to that free map slot. Some work done (albeit slower than the ideal case because of the increased network IO) is better than no work done. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: doubt about reduce tasks and block writes
Assuming that node A only contains replica, there is no garante that its data would never be read. First, you might lose a replica. The copy inside the node A could be used to create the missing replica again. Second, data locality is on best effort. If all the map slots are occupied except one on one node without a replica of the data then your node A is as likely as any other to be chosen as a source. Regards Bertrand On Fri, Aug 24, 2012 at 10:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. -- Bertrand Dechoux
Re: doubt about reduce tasks and block writes
But since node A has no TT running, it will not run map or reduce tasks. When the reducer node writes the output file, the fist block will be written on the local node and never on node A. So, to answer the question, Node A will contain copies of blocks of all output files. It wont contain the copy 0 of any output file. I am reasonably sure about this , but there could be corner cases in case of node failure and such like! I need to look into the code. Raj From: Marc Sturlese marc.sturl...@gmail.com To: hadoop-u...@lucene.apache.org Sent: Friday, August 24, 2012 1:09 PM Subject: doubt about reduce tasks and block writes Hey there, I have a doubt about reduce tasks and block writes. Do a reduce task always first write to hdfs in the node where they it is placed? (and then these blocks would be replicated to other nodes) In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one (node A) just run DN, when running MR jobs, map tasks would never read from node A? This would be because maps have data locality and if the reduce tasks write first to the node where they live, one replica of the block would always be in a node that has a TT. Node A would just contain blocks created from replication by the framework as no reduce task would write there directly. Is this correct? Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: doubt on Hadoop job submission process
Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J
Re: doubt on Hadoop job submission process
Hi Harsh, Thanks for your reply. Consider from my main program i am doing so many activities(Reading/writing/updating non hadoop activities) before invoking JobClient.runJob(conf); Is it anyway to separate the process flow by programmatic instead of going for any workflow engine? Cheers! Manoj. On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote: Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J
Re: doubt on Hadoop job submission process
Sure, you may separate the logic as you want it to be, but just ensure the configuration object has a proper setJar or setJarByClass done on it before you submit the job. On Mon, Aug 13, 2012 at 4:43 PM, Manoj Babu manoj...@gmail.com wrote: Hi Harsh, Thanks for your reply. Consider from my main program i am doing so many activities(Reading/writing/updating non hadoop activities) before invoking JobClient.runJob(conf); Is it anyway to separate the process flow by programmatic instead of going for any workflow engine? Cheers! Manoj. On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote: Hi Manoj, Reply inline. On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, Normal Hadoop job submission process involves: Checking the input and output specifications of the job. Computing the InputSplits for the job. Setup the requisite accounting information for the DistributedCache of the job, if necessary. Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system. Submitting the job to the JobTracker and optionally monitoring it's status. I have a doubt in 4th point of job execution flow could any of you explain it? What is job's jar? The job.jar is the jar you supply via hadoop jar jar. Technically though, it is the jar pointed by JobConf.getJar() (Set via setJar or setJarByClass calls). Is it job's jar is the one we submitted to hadoop or hadoop will build based on the job configuration object? It is the former, as explained above. -- Harsh J -- Harsh J
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote: Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. So basically it's a tradeoff here, you get more replicas to copy from but you have 2 more copies to write. Considering that that data's very short lived and that it doesn't need to be replicated (since if the machine fails the maps are replayed anyway) it seems that writing 2 replicas that are potentially unused would be hurtful. Regarding locality, it might make sense on a small cluster but the more you add nodes the smaller the chance to have local replicas for each blocks of data you're looking for. J-D
Re: Doubt from the book Definitive Guide
Answers inline. On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// Map output is written to local FS, not HDFS. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? Map output is sent to HDFS when reducer is not used. - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster
Narayanan, On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com wrote: Hi all, We are basically working on a research project and I require some help regarding this. Always glad to see research work being done! What're you working on? :) How do I submit a mapreduce job from outside the cluster i.e from a different machine outside the Hadoop cluster? If you use Java APIs, use the Job#submit(…) method and/or JobClient.runJob(…) method. Basically Hadoop will try to create a jar with all requisite classes within and will push it out to the JobTracker's filesystem (HDFS, if you run HDFS). From there on, its like a regular operation. This even happens on the Hadoop nodes itself, so doing so from an external place as long as that place has access to Hadoop's JT and HDFS, should be no different at all. If you are packing custom libraries along, don't forget to use DistributedCache. If you are packing custom MR Java code, don't forget to use Job#setJarByClass/JobClient#setJarByClass and other appropriate API methods. If the above can be done, How can I schedule map reduce jobs to run in hadoop like crontab from a different machine? Are there any webservice APIs that I can leverage to access a hadoop cluster from outside and submit jobs or read/write data from HDFS. For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/ It is well supported and is very useful in writing MR workflows (which is a common requirement). You also get coordinator features and can schedule similar to crontab functionalities. For HDFS r/w over web, not sure of an existing web app specifically for this purpose without limitations, but there is a contrib/thriftfs you can leverage upon (if not writing your own webserver in Java, in which case its as simple as using HDFS APIs). Also have a look at the pretty mature Hue project which aims to provide a great frontend that lets you design jobs, submit jobs, monitor jobs and upload files or browse the filesystem (among several other things): http://cloudera.github.com/hue/ -- Harsh J
Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster
Narayanan, On Fri, Jul 1, 2011 at 12:57 PM, Narayanan K knarayana...@gmail.com wrote: So the report will be run from a different machine outside the cluster. So we need a way to pass on the parameters to the hadoop cluster (master) and initiate a mapreduce job dynamically. Similarly the output of mapreduce job needs to tunneled into the machine from where the report was run. Some more clarification I need is : Does the machine (outside of cluster) which ran the report, require something like a Client installation which will talk with the Hadoop Master Server via TCP??? Or can it can run a job in hadoop server by using a passworldless scp to the master machine or something of the like. Regular way is to let the client talk to your nodes over tcp ports. This is what Hadoop's plain ol' submitter process does for you. Have you tried running any simple hadoop jar your jar from a remote client machine? If that works, so should invoking the same from your code (with appropriate configurations set) cause its basically the plain ol' runjar submission process in both ways. If not, maybe you need to think of opening ports to let things happen (if there's a firewall here). Hadoop does not use SSH/SCP to move code around. Please give this a read if you believe you're confused about how SSH+Hadoop is integrated (or not): http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F -- Harsh J
Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster
Narayanan, Regarding the client installation, you should make sure that client and server use same version hadoop for submitting jobs and transfer data. if you use a different user in client than the one runs hadoop job, config the hadoop ugi property (sorry i forget the exact name). 在 2011 7 1 15:28,Narayanan K knarayana...@gmail.com写道: Hi Harsh Thanks for the quick response... Have a few clarifications regarding the 1st point : Let me tell the background first.. We have actually set up a Hadoop cluster with HBase installed. We are planning to load Hbase with data and perform some computations with the data and show up the data in a report format. The report should be accessible from outside the cluster and the report accepts certain parameters to show data, that will in turn pass on these parameters to the hadoop master server where a mapreduce job will be run that queries HBase to retrieve the data. So the report will be run from a different machine outside the cluster. So we need a way to pass on the parameters to the hadoop cluster (master) and initiate a mapreduce job dynamically. Similarly the output of mapreduce job needs to tunneled into the machine from where the report was run. Some more clarification I need is : Does the machine (outside of cluster) which ran the report, require something like a Client installation which will talk with the Hadoop Master Server via TCP??? Or can it can run a job in hadoop server by using a passworldless scp to the master machine or something of the like. Regards, Narayanan On Fri, Jul 1, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote: Narayanan, On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com wrote: Hi all, We are basically working on a research project and I require some help regarding this. Always glad to see research work being done! What're you working on? :) How do I submit a mapreduce job from outside the cluster i.e from a different machine outside the Hadoop cluster? If you use Java APIs, use the Job#submit(…) method and/or JobClient.runJob(…) method. Basically Hadoop will try to create a jar with all requisite classes within and will push it out to the JobTracker's filesystem (HDFS, if you run HDFS). From there on, its like a regular operation. This even happens on the Hadoop nodes itself, so doing so from an external place as long as that place has access to Hadoop's JT and HDFS, should be no different at all. If you are packing custom libraries along, don't forget to use DistributedCache. If you are packing custom MR Java code, don't forget to use Job#setJarByClass/JobClient#setJarByClass and other appropriate API methods. If the above can be done, How can I schedule map reduce jobs to run in hadoop like crontab from a different machine? Are there any webservice APIs that I can leverage to access a hadoop cluster from outside and submit jobs or read/write data from HDFS. For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/ It is well supported and is very useful in writing MR workflows (which is a common requirement). You also get coordinator features and can schedule similar to crontab functionalities. For HDFS r/w over web, not sure of an existing web app specifically for this purpose without limitations, but there is a contrib/thriftfs you can leverage upon (if not writing your own webserver in Java, in which case its as simple as using HDFS APIs). Also have a look at the pretty mature Hue project which aims to provide a great frontend that lets you design jobs, submit jobs, monitor jobs and upload files or browse the filesystem (among several other things): http://cloudera.github.com/hue/ -- Harsh J
RE: Doubt: Regarding running Hadoop on a cluster with shared disk.
Udaya, You can use non-local disk on your hadoop cloud, however it will have sub-optimal performance, and you will have to tune accordingly. If its a shared drive on all of your nodes, you need to create different directories for each machine. Suppose your shared drive is /foo then you would need to set up a /foo/name of node/data for each machine in your cluster. The drawback is not only I/O traffic and constraints but you'll have to tune ZK and watch out for timing issues as your disk i/o is your constraint. Definitely not recommended. Date: Wed, 5 May 2010 15:52:11 +0530 Subject: Doubt: Regarding running Hadoop on a cluster with shared disk. From: udaya...@gmail.com To: common-user@hadoop.apache.org Hi, I have an account on a cluster which is having a file system similar to NFS. If I create a file on one machine it is being shown on all the machines in the cluster. But hadoop will work on a cluster of machines, where in , each machine has a disk of its own. Can someone please help me use hadoop on my cluster. Thanks, Udaya. _ Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
Re: Doubt: Using PBS to run mapreduce jobs.
HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Doubt: Using PBS to run mapreduce jobs.
Thank you Craig. My cluster has got Torque. Can you please point me something which will have detailed explanation about using HOD on Torque. On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Doubt: Using PBS to run mapreduce jobs.
Udaya, Following link will help you for HOD on torque. http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html Thanks, --- Peeyush On Tue, 2010-05-04 at 22:49 +0530, Udaya Lakshmi wrote: Thank you Craig. My cluster has got Torque. Can you please point me something which will have detailed explanation about using HOD on Torque. On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: HOD supports a PBS environment, namely Torque. Torque is the vastly improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or better still persuade your cluster admins to upgrade to a more recent version of Torque (e.g. at least 2.1.x) Craig On 22/07/28164 20:59, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Thanks, Udaya.
Re: Doubt: Using PBS to run mapreduce jobs.
On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Take a look at Hadoop on Demand. It was built with Torque in mind, but any PBS system should work with few changes.
Re: Doubt: Using PBS to run mapreduce jobs.
Thank you. Udaya. On Wed, May 5, 2010 at 12:23 AM, Allen Wittenauer awittena...@linkedin.comwrote: On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote: Hi, I am given an account on a cluster which uses OpenPBS as the cluster management software. The only way I can run a job is by submitting it to OpenPBS. How to run mapreduce programs on it? Is there any possible work around? Take a look at Hadoop on Demand. It was built with Torque in mind, but any PBS system should work with few changes.
Re: Doubt about SequenceFile.Writer
The SequenceFile is not text file, so you can not see the content by invoking unix command cat. But you can get the text content by using hadoop command : hadoop fs -text src On Sun, Feb 7, 2010 at 8:51 AM, Andiana Squazo Ringa andriana.ri...@gmail.com wrote: Hi, I have written to a sideeffect file using SequenceFile.Writer . But when I cat the file, it is printing some unreadable characters . I did not use any compression code. Why is this so? Thanks, Ringa. -- Best Regards Jeff Zhang
Re: Doubt about SequenceFile.Writer
Thanks a lot Jeff. Ringa. On Sun, Feb 7, 2010 at 10:30 PM, Jeff Zhang zjf...@gmail.com wrote: The SequenceFile is not text file, so you can not see the content by invoking unix command cat. But you can get the text content by using hadoop command : hadoop fs -text src On Sun, Feb 7, 2010 at 8:51 AM, Andiana Squazo Ringa andriana.ri...@gmail.com wrote: Hi, I have written to a sideeffect file using SequenceFile.Writer . But when I cat the file, it is printing some unreadable characters . I did not use any compression code. Why is this so? Thanks, Ringa. -- Best Regards Jeff Zhang
Re: Re: Re: Re: Doubt in Hadoop
Hi, Actually, I just made the change suggested by Aaron and my code worked. But I still would like to know why does the setJarbyClass() method have to be called when the Main class and the Map and Reduce classes are in the same package ? Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122) On Sun 11/29/09 10:39 AM , aa...@buffalo.edu sent: Hi, I dont set job.setJarByClass(Map.class). But my main class, the map class andthe reduce class are all in the same package. Does this make any difference at ordo I still have to call Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122) On Fri 11/27/09 1:42 PM , Aaron Kimball aa...@clou dera.com sent: When you set up the Job object, do you call job.setJarByClass(Map.class)? That will tell Hadoop which jar file to ship with the job and to use for classloading in your code. - Aaron On Thu, Nov 26, 2009 at 11:56 PM, wrote: Hi,  I am running the job from command line. The job runs fine in the local mode but something happens when I try to run the job in the distributed mode. Abhishek Agrawal SUNY- Buffalo (716-435-7122) On Fri 11/27/09  2:31 AM , Jeff Zhang sent: Do you run the map reduce job in command line or IDE? in map reduce mode, you should put the jar containing the map and reduce class in your classpath Jeff Zhang On Fri, Nov 27, 2009 at 2:19 PM,  wrote: Hello Everybody,         I have a doubt in Haddop and was wondering if anybody has faced a similar problem. I have a package called test. Inside that I have class called A.java, Map.java, Reduce.java. In A.java I have the main method where I am trying to initialize the jobConf object. I have written jobConf.setMapperClass(Map.class) and similarly for the reduce class as well. The code works correctly when I run the code locally via jobConf.set(mapred.job.tracker,local) but I get an exception when I try to run this code on my cluster. The stack trace of the exception is as under. I cannot understand the problem. Any help would be appreciated. java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: test.Map     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)     at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)     at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Markowitz.covarMatrixMap     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)     ... 6 more Caused by: java.lang.ClassNotFoundException: test.Map     at java.net.URLClassLoader$1.run(URLClassLoader.java:200)     at java.security.AccessController.doPrivileged(Native Method)     at java.net.URLClassLoader.findClass(URLClassLoader.java:188)     at java.lang.ClassLoader.loadClass(ClassLoader.java:306)     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)     at java.lang.ClassLoader.loadClass(ClassLoader.java:251)     at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)     at java.lang.Class.forName0(Native Method)     at java.lang.Class.forName(Class.java:247)     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)     ... 7 more Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: Re: Doubt in Hadoop
When you set up the Job object, do you call job.setJarByClass(Map.class)? That will tell Hadoop which jar file to ship with the job and to use for classloading in your code. - Aaron On Thu, Nov 26, 2009 at 11:56 PM, aa...@buffalo.edu wrote: Hi, I am running the job from command line. The job runs fine in the local mode but something happens when I try to run the job in the distributed mode. Abhishek Agrawal SUNY- Buffalo (716-435-7122) On Fri 11/27/09 2:31 AM , Jeff Zhang zjf...@gmail.com sent: Do you run the map reduce job in command line or IDE? in map reduce mode, you should put the jar containing the map and reduce class in your classpath Jeff Zhang On Fri, Nov 27, 2009 at 2:19 PM, wrote: Hello Everybody, I have a doubt in Haddop and was wondering if anybody has faced a similar problem. I have a package called test. Inside that I have class called A.java, Map.java, Reduce.java. In A.java I have the main method where I am trying to initialize the jobConf object. I have written jobConf.setMapperClass(Map.class) and similarly for the reduce class as well. The code works correctly when I run the code locally via jobConf.set(mapred.job.tracker,local) but I get an exception when I try to run this code on my cluster. The stack trace of the exception is as under. I cannot understand the problem. Any help would be appreciated. java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: test.Map at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Markowitz.covarMatrixMap at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744) ... 6 more Caused by: java.lang.ClassNotFoundException: test.Map at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718) ... 7 more Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: Doubt in Hadoop
Do you run the map reduce job in command line or IDE? in map reduce mode, you should put the jar containing the map and reduce class in your classpath Jeff Zhang On Fri, Nov 27, 2009 at 2:19 PM, aa...@buffalo.edu wrote: Hello Everybody, I have a doubt in Haddop and was wondering if anybody has faced a similar problem. I have a package called test. Inside that I have class called A.java, Map.java, Reduce.java. In A.java I have the main method where I am trying to initialize the jobConf object. I have written jobConf.setMapperClass(Map.class) and similarly for the reduce class as well. The code works correctly when I run the code locally via jobConf.set(mapred.job.tracker,local) but I get an exception when I try to run this code on my cluster. The stack trace of the exception is as under. I cannot understand the problem. Any help would be appreciated. java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: test.Map at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Markowitz.covarMatrixMap at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744) ... 6 more Caused by: java.lang.ClassNotFoundException: test.Map at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718) ... 7 more Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: Re: Doubt in Hadoop
Hi, I am running the job from command line. The job runs fine in the local mode but something happens when I try to run the job in the distributed mode. Abhishek Agrawal SUNY- Buffalo (716-435-7122) On Fri 11/27/09 2:31 AM , Jeff Zhang zjf...@gmail.com sent: Do you run the map reduce job in command line or IDE? in map reduce mode, you should put the jar containing the map and reduce class in your classpath Jeff Zhang On Fri, Nov 27, 2009 at 2:19 PM, wrote: Hello Everybody, I have a doubt in Haddop and was wondering if anybody has faced a similar problem. I have a package called test. Inside that I have class called A.java, Map.java, Reduce.java. In A.java I have the main method where I am trying to initialize the jobConf object. I have written jobConf.setMapperClass(Map.class) and similarly for the reduce class as well. The code works correctly when I run the code locally via jobConf.set(mapred.job.tracker,local) but I get an exception when I try to run this code on my cluster. The stack trace of the exception is as under. I cannot understand the problem. Any help would be appreciated. java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: test.Map at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.Child.main(Child.java:158) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Markowitz.covarMatrixMap at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744) ... 6 more Caused by: java.lang.ClassNotFoundException: test.Map at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718) ... 7 more Thank You Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: Doubt in reducer
But reducer can do some preparations during map process. It can distribute map output across nodes that will work as reducers. Copying and sorting map output is also time costuming process (maybe, more consuming than reduce itself). For example, piece job run log on 40node cluster could be like that: 09/08/27 11:08:24 INFO job.JobRunningListener: map 36% reduce 10% 09/08/27 11:08:28 INFO job.JobRunningListener: map 37% reduce 10% 09/08/27 11:08:29 INFO job.JobRunningListener: map 37% reduce 11% But if you run job on single node cluster reduce will start only after map finished. On Aug 27, 2009, at 4:31 PM, Harish Mallipeddi wrote: On Thu, Aug 27, 2009 at 5:22 PM, Rakhi Khatwani rkhatw...@gmail.com wrote: but i want my reduce to run , tht is if 25% map is done, thn i want the reduce 2 save that much data. even if the 2nd map fails, i dont loose data. any pointers? Regards, Raakhi What you're asking for will break the semantics of reduce(). Reduce can only proceed after receiving all the map-outputs. -- Harish Mallipeddi http://blog.poundbang.in --- Vladimir Klimontovich, skype: klimontovich GoogleTalk/Jabber: klimontov...@gmail.com Cell phone: +7926 890 2349
Re: Doubt in HBase
Well the inputs to those reducers would be the empty set, they wouldn't have anything to do and their output would also be nil as well. If you are doing something like this, and your operation is communitive, consider using a combiner so that you don't shuffle as much data. A large amount of shuffled data can make map-reduces slower. While map-reduce is a sorter, shuffling 1500gb just takes a little while you know? you can also set the # of reducers as well. but the mapping of reduce keys to reducer instances is random/hashed iirc. The normative case however is to a large number of reduce keys, rather than only a small amount. Generally speaking, use the combiner functionality. It keeps the data sizes low. High reduce counts is better for when you have to shuffle a lot of data with many distinct reduce keys. This is getting pretty OT, I suggest revisiting the map-reduce paper and the hadoop docs. -ryan On Thu, Aug 20, 2009 at 9:24 PM, john smithjs1987.sm...@gmail.com wrote: Thanks for all your replies guys ,.As bharath said , what is the case when number of reducers becomes more than number of distinct Map key outputs? On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt in HBase
hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt in HBase
Thanks Ryan I was just explaining with an example .. I have TBs of data to work with.Just i wanted to know that scheduler TRIES to assign the reduce phase to keep the data local (i.e.,TRYING to assign it to the machine with machine with greater num of key values). I was just explaining it with an example . Thanks for ur reply (following u on twitter :)) On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote: hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt in HBase
Ryan, In older versions of HBase, when we did not attempt any data locality, we had a few users running jobs that became network i/o bound. It wasn't a latency issue it was a bandwidth issue. That's actually when/why an attempt at better data locality for HBase MR was made in the first place. I hadn't personally experienced it but I recall two users who had. After they made a first-stab patch, I ran some comparisons and noticed a significant reduction in network i/o for data-intensive MR jobs. They also were no longer network i/o bound on their jobs, if I recall, and became disk i/o bound (as one would expect/hope). For a majority of use cases, it doesn't matter in a significant way at all. But I have seen it make a measurable difference for some. JG bharath vissapragada wrote: Thanks Ryan I was just explaining with an example .. I have TBs of data to work with.Just i wanted to know that scheduler TRIES to assign the reduce phase to keep the data local (i.e.,TRYING to assign it to the machine with machine with greater num of key values). I was just explaining it with an example . Thanks for ur reply (following u on twitter :)) On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote: hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt in HBase
JG Can you please elaborate on the last statement for some.. by giving an example or some kind of scenario in which it can take place where MR jobs involve huge amount of data. Thanks. On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote: Ryan, In older versions of HBase, when we did not attempt any data locality, we had a few users running jobs that became network i/o bound. It wasn't a latency issue it was a bandwidth issue. That's actually when/why an attempt at better data locality for HBase MR was made in the first place. I hadn't personally experienced it but I recall two users who had. After they made a first-stab patch, I ran some comparisons and noticed a significant reduction in network i/o for data-intensive MR jobs. They also were no longer network i/o bound on their jobs, if I recall, and became disk i/o bound (as one would expect/hope). For a majority of use cases, it doesn't matter in a significant way at all. But I have seen it make a measurable difference for some. JG bharath vissapragada wrote: Thanks Ryan I was just explaining with an example .. I have TBs of data to work with.Just i wanted to know that scheduler TRIES to assign the reduce phase to keep the data local (i.e.,TRYING to assign it to the machine with machine with greater num of key values). I was just explaining it with an example . Thanks for ur reply (following u on twitter :)) On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote: hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is
Re: Doubt in HBase
I really couldn't be specific. The more data that has to be moved across the wire, the more network i/o. For example, if you have very large values, and a very large table, and you have that as the input to your MR. You could potentially be network i/o bound. It should be very easy to test how your own jobs run on your own cluster using Ganglia and hadoop/mr logging/output. bharath vissapragada wrote: JG Can you please elaborate on the last statement for some.. by giving an example or some kind of scenario in which it can take place where MR jobs involve huge amount of data. Thanks. On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote: Ryan, In older versions of HBase, when we did not attempt any data locality, we had a few users running jobs that became network i/o bound. It wasn't a latency issue it was a bandwidth issue. That's actually when/why an attempt at better data locality for HBase MR was made in the first place. I hadn't personally experienced it but I recall two users who had. After they made a first-stab patch, I ran some comparisons and noticed a significant reduction in network i/o for data-intensive MR jobs. They also were no longer network i/o bound on their jobs, if I recall, and became disk i/o bound (as one would expect/hope). For a majority of use cases, it doesn't matter in a significant way at all. But I have seen it make a measurable difference for some. JG bharath vissapragada wrote: Thanks Ryan I was just explaining with an example .. I have TBs of data to work with.Just i wanted to know that scheduler TRIES to assign the reduce phase to keep the data local (i.e.,TRYING to assign it to the machine with machine with greater num of key values). I was just explaining it with an example . Thanks for ur reply (following u on twitter :)) On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote: hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly.
Re: Doubt in HBase
JG, In one of your above replies , you have said that datalocality was not considered in older versions of HBase , Is there any development on the same in 0.20 RC1/2 or 0.19.x ? If no can you tell me where that patch can be available so that i can test my programs . Thanks in advance On Sat, Aug 22, 2009 at 12:12 AM, Jonathan Gray jl...@streamy.com wrote: I really couldn't be specific. The more data that has to be moved across the wire, the more network i/o. For example, if you have very large values, and a very large table, and you have that as the input to your MR. You could potentially be network i/o bound. It should be very easy to test how your own jobs run on your own cluster using Ganglia and hadoop/mr logging/output. bharath vissapragada wrote: JG Can you please elaborate on the last statement for some.. by giving an example or some kind of scenario in which it can take place where MR jobs involve huge amount of data. Thanks. On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote: Ryan, In older versions of HBase, when we did not attempt any data locality, we had a few users running jobs that became network i/o bound. It wasn't a latency issue it was a bandwidth issue. That's actually when/why an attempt at better data locality for HBase MR was made in the first place. I hadn't personally experienced it but I recall two users who had. After they made a first-stab patch, I ran some comparisons and noticed a significant reduction in network i/o for data-intensive MR jobs. They also were no longer network i/o bound on their jobs, if I recall, and became disk i/o bound (as one would expect/hope). For a majority of use cases, it doesn't matter in a significant way at all. But I have seen it make a measurable difference for some. JG bharath vissapragada wrote: Thanks Ryan I was just explaining with an example .. I have TBs of data to work with.Just i wanted to know that scheduler TRIES to assign the reduce phase to keep the data local (i.e.,TRYING to assign it to the machine with machine with greater num of key values). I was just explaining it with an example . Thanks for ur reply (following u on twitter :)) On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote: hey, Yes the hadoop system attempts to assign map tasks to data local, but why would you be worried about this for 5 values? The max value size in hbase is Integer.MAX_VALUE, so it's not like you have much data to shuffle. Once your blobs ~ 64mb or so, it might make more sense to use HDFS directly and keep only the metadata in hbase (including things like location of the data blob). I think people are confused about how optimal map reduces have to be. Keeping all the data super-local on each machine is not always helping you, since you have to read via a socket anyways. Going remote doesn't actually make things that much slower, since on a modern lan ping times are 0.1ms. If your entire cluster is hanging off a single switch, there is nearly unlimited bandwidth between all nodes (certainly much higher than any single system could push). Only once you go multi-switch then switch-locality (aka rack locality) becomes important. Remember, hadoop isn't about the instantaneous speed of any job, but about running jobs in a highly scalable manner that works on tens or tens of thousands of nodes. You end up blocking on single machine limits anyways, and the r=3 of HDFS helps you transcend a single machine read speed for large files. Keeping the data transfer local in this case results in lower performance. If you want max local speed, I suggest looking at CUDA. On Thu, Aug 20, 2009 at 9:09 PM, bharath vissapragadabharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are
Re: Doubt in HBase
On Thu, Aug 20, 2009 at 9:42 AM, john smith js1987.sm...@gmail.com wrote: Hi all , I have one small doubt . Kindly answer it even if it sounds silly. No questions are silly.. Dont worry Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? If you set the number of map tasks to a high number, it automatically spawns one map task for each region (not region server). Otherwise, it'll spawn the number you have explicitly specified in the job. Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Increase the number of reducers. Each reducer will have lesser data to move. Thanks :)
Re: Doubt in HBase
What Amandeep said. Also, one clarification for you. You mentioned the reduce task moving map output across regionservers. Remember, HBase is just a MapReduce input source or output sink. The sort/shuffle/reduce is a part of Hadoop MapReduce and has nothing to do with HBase directly. It is utilizing the JobTracker/TaskTrackers, not the RegionServers. Like AK said, you can increase the number of reducers, or reduce the amount of data you output from the maps. JG Amandeep Khurana wrote: On Thu, Aug 20, 2009 at 9:42 AM, john smith js1987.sm...@gmail.com wrote: Hi all , I have one small doubt . Kindly answer it even if it sounds silly. No questions are silly.. Dont worry Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? If you set the number of map tasks to a high number, it automatically spawns one map task for each region (not region server). Otherwise, it'll spawn the number you have explicitly specified in the job. Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Increase the number of reducers. Each reducer will have lesser data to move. Thanks :)
Re: Doubt in HBase
Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt in HBase
Thanks for all your replies guys ,.As bharath said , what is the case when number of reducers becomes more than number of distinct Map key outputs? On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful. You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote: The behavior of TableInputFormat is to schedule one mapper for every table region. In addition to what others have said already, if your reducer is doing little more than storing data back into HBase (via TableOutputFormat), then you can consider writing results back to HBase directly from the mapper to avoid incurring the overhead of sort/shuffle/merge which happens within the Hadoop job framework as map outputs are input into reducers. For that type of use case -- using the Hadoop mapreduce subsystem as essentially a grid scheduler -- something like job.setNumReducers(0) will do the trick. Best regards, - Andy From: john smith js1987.sm...@gmail.com To: hbase-user@hadoop.apache.org Sent: Friday, August 21, 2009 12:42:36 AM Subject: Doubt in HBase Hi all , I have one small doubt . Kindly answer it even if it sounds silly. Iam using Map Reduce in HBase in distributed mode . I have a table which spans across 5 region servers . I am using TableInputFormat to read the data from the tables in the map . When i run the program , by default how many map regions are created ? Is it one per region server or more ? Also after the map task is over.. reduce task is taking a bit more time . Is it due to moving the map output across the regionservers? i.e, moving the values of same key to a particular reduce phase to start the reducer? Is there any way i can optimize the code (e.g. by storing data of same reducer nearby ) Thanks :)
Re: Doubt regarding mem cache.
Hi Rakhi! On Wed, Aug 12, 2009 at 11:49 AM, Rakhi Khatwani rkhatw...@gmail.comwrote: Hi, I am not very clear as to how does the mem cache thing works. MemCache was a name that was used and caused some confusion of what the purpose of it is. It has now been renamed to MemStore and is basically a write buffer that gets flushed to disk/HDFS when it is too big. 1. When you set memcache to say 1MB, does hbase write all the table information into some cache memory and when the size reaches IMB, it writes into hadoop and after that the replication takes place??? So yeah, kinda 2. Is there any minimum limit on the mem cache property in hbase site?? Not sure if there is a minimum limit but have a look at hbase-default .xml and you will see a bunch of settings for memStore Regards, Raakhi Regards Erik
Re: Doubt regarding Replication Factor
You can try it: start a 3 node cluster and create a file with replication 5. The answer is that each data-node can store only one replica of a block. So in your case you will get an exception on close() saying the file cannot be fully replicated. Thanks, --Konstantin Rakhi Khatwani wrote: Hi, I just wanted to know what if we have set the replication factor greater than the number of nodes in the cluster. for example, i have only 3 nodes in my cluster but i set the replication factor to 5. will it create 3 copies and save it in each node, or can it create more than one copy per node? Regards, Raakhi Khatwani
Re: Doubt regarding Replication Factor
A similar question- If in an N node cluster, a file's replication is set to N (replicate on each node) and later if a node goes down, will HDFS throw an exception since the file's replication has gone down below the specified number ? Thanks, Tarandeep On Wed, Aug 12, 2009 at 12:11 PM, Konstantin Shvachko s...@yahoo-inc.comwrote: You can try it: start a 3 node cluster and create a file with replication 5. The answer is that each data-node can store only one replica of a block. So in your case you will get an exception on close() saying the file cannot be fully replicated. Thanks, --Konstantin Rakhi Khatwani wrote: Hi, I just wanted to know what if we have set the replication factor greater than the number of nodes in the cluster. for example, i have only 3 nodes in my cluster but i set the replication factor to 5. will it create 3 copies and save it in each node, or can it create more than one copy per node? Regards, Raakhi Khatwani
Re: Doubt in implementing TableReduce Interface
the method looks fine. Put some logging inside the reduce method to trace the inputs to the reduce. Here's an example... change IntWritable to Text in your case... static class ReadTableReduce2 extends MapReduceBase implements TableReduceText, IntWritable{ SortedMapText, Text buzz = new TreeMapText, Text(); @Override public void reduce(Text key, IteratorIntWritable values, OutputCollectorImmutableBytesWritable, BatchUpdate output, Reporter report) throws IOException { Integer sum =0; while(values.hasNext()) { sum += values.next().get(); } if (sum =3) { BatchUpdate outval = new BatchUpdate(rowCounter.toString()); String keyStr = key.toString(); String[] keyArr=keyStr.split(:); outval.put(buzzcount:+keyArr[1], sum.toString().getBytes()); report.incrCounter(Counters.REDUCE_LINES, 1); report.setStatus(sum:+sum); rowCounter++; output.collect(new ImmutableBytesWritable(key.getBytes()), outval); } } On Fri, Jul 24, 2009 at 11:29 PM, bharath vissapragada bhara...@students.iiit.ac.in wrote: Hi all, i have implemented TableMap interface succesfully to emit Text,Text pairs .. so now i must implement TableReduce interface to receive those Text,Text pairs correspondingly ... Is the following code correct public class MyTableReduce extends MapReduceBase implements TableReduceText, Text { public void reduce(Text key, IteratorText values, OutputCollectorImmutableBytesWritable, BatchUpdate output, @SuppressWarnings(unused) Reporter reporter) throws IOException { } }
Re: Doubt regarding permissions
Hi Amar, I just have tried. Everything worked as expected. I guess user A in your experiment was a superuser so that he could read anything. Nicholas Sze /// permission testing // drwx-wx-wx - nicholas supergroup 0 2009-04-13 10:55 /temp drwx-w--w- - tsz supergroup 0 2009-04-13 10:58 /temp/test -rw-r--r-- 3 tsz supergroup 1366 2009-04-13 10:58 /temp/test/r.txt //login as nicholas (non-superuser) $ whoami nicholas $ ./bin/hadoop fs -lsr /temp drwx-w--w- - tsz supergroup 0 2009-04-13 10:58 /temp/test lsr: could not get get listing for 'hdfs://:9000/temp/test' : org.apache.hadoop.security.AccessControlException: Permission denied: user=nicholas, access=READ_EXECUTE, inode=test:tsz:supergroup:rwx-w--w- $ ./bin/hadoop fs -cat /temp/test/r.txt cat: org.apache.hadoop.security.AccessControlException: Permission denied: user=nicholas, access=EXECUTE, inode=test:tsz:supergroup:rwx-w--w- - Original Message From: Amar Kamat ama...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Monday, April 13, 2009 2:02:24 AM Subject: Doubt regarding permissions Hey, I tried the following : - created a dir temp for user A and permission 733 - created a dir temp/test for user B and permission 722 - - created a file temp/test/test.txt for user B and permission722 Now in HDFS, user A can list as well as read the contents of file temp/test/test.txt while on my RHEL box I cant. Is it a feature or a bug. Can someone please try this out and confirm? Thanks Amar
Re: Doubt in MultiFileWordCount.java
On Sep 29, 2008, at 3:11 AM, Geethajini C wrote: Hi everyone, In the example MultiFileWordCount.java (hadoop-0.17.0), what happens when the statement JobClient.runJob(job);is executed. What methods will be called in sequence? This might help: http://hadoop.apache.org/core/docs/r0.17.2/api/org/apache/hadoop/mapred/JobClient.html Arun
Re: Doubt in RegExpRowFilter and RowFilters in general
Have you tried enabling DEBUG-level logging? Filters have lots of logging around state changes. Might help figure this issue. You might need to add extra logging around line #2401 in HStore. (I just spent some time trying to bend my head around whats going on. Filters are run at the Store level. It looks like that in RegExpRowFilter, a map is made on construction of column to value. If value matches, filter returns false, so cell should be added in each family. I don't see anything obviously wrong in here). St.Ack David Alves wrote: St.Ack Thanks for your reply. When I use RegExpRowFilter with only one (either one) of the conditions it works (the rows are passed onto the Map/Reduce task) but there is still a problem because only one of them column is present in the resulting MapWritable (I'm using my own tableinputformat) from the scanner. So I still use the filter to check for more rows (build a scanner with one of the conditions, the rarest one, iterate through to try and find the other) but not in the tableinputformat itself (I just discard the unwanted values in the Mapper) which is a performance hit (if it would be the scanner the row wouldn't simply be sent to the master right, therefore less traffic is distributed mode?), but no big deal. I seems to me that when the filter is applied only the column that matches (or the one that doesn't match I'm not sure at the moment) is passed to the scanner result. As to the second point I'm running HBase in local mode for development and the DEBUG log for the HMaster shows nothing, my process simply hangs indefinitely. When I'll have some free time I'll try to look into the sources, and pinpoint the problem more accurately. David On Mon, 2008-02-11 at 10:36 -0800, stack wrote: David: disclaimerIMO, filters are a bit of sweet functionality but they are not easy to use. They also have seen little exercise so you are probably tripping over bugs. That said, I know they basically work./disclaimer I'd suggest you progress from basic filtering toward the filter you'd like to implement. Does the RegExpRowFilter do the right thing when filtering one column only? On the ClassNotFoundException, yeah, it should be coming out on the client. Can you see it in the server logs? Do you get any exceptions client-side? St.Ack David Alves wrote: Hi Again In my previous example I seem to have misplaced a new keyword (new myvalue1.getBytes() where it should have been myvalue1.getBytes()). On another note my program hangs when I supply my own filter to the scanner (I suppose it's clear that the nodes don't know my class so there should be a ClassNotFoundException right?). Regards David Alves On Mon, 2008-02-11 at 16:51 +, David Alves wrote: Hi Guys In my previous email I might have misunderstood the roles of the RowFilterInterfaces so I'll pose my question more clearly (since the last one wasn't in question form :)). I save a setup when a table has to columns belonging to different column families (Table A cf1:a cf2:b)); I'm trying to build a filter so that a scanner only returns the rows where cf1:a = myvalue1 and cf2:b = myvalue2. I've build a RegExpRowFilter like this; MapText, byte[] conditionalsMap = new HashMapText, byte[](); conditionalsMap.put(new Text(cf1:a), new myvalue1.getBytes()); conditionalsMap.put(new Text(cf2:b), myvalue2.getBytes()); return new RegExpRowFilter(.*, conditionalsMap); My problem is this filter always fails when I know for sure that there are rows whose columns match my values. I'm building the the scanner like this (the purpose in this case is to find if there are more values that match my filter): final Text startKey = this.htable.getStartKeys()[0]; HScannerInterface scanner = htable.obtainScanner(new Text[] {new Text(cf1:a), new Text(cf2:b)}, startKey, rowFilterInterface); return scanner.iterator().hasNext(); Can anyone give me a hand please. Thanks in advance David Alves
Re: Doubt in RegExpRowFilter and RowFilters in general
Hi Again In my previous example I seem to have misplaced a new keyword (new myvalue1.getBytes() where it should have been myvalue1.getBytes()). On another note my program hangs when I supply my own filter to the scanner (I suppose it's clear that the nodes don't know my class so there should be a ClassNotFoundException right?). Regards David Alves On Mon, 2008-02-11 at 16:51 +, David Alves wrote: Hi Guys In my previous email I might have misunderstood the roles of the RowFilterInterfaces so I'll pose my question more clearly (since the last one wasn't in question form :)). I save a setup when a table has to columns belonging to different column families (Table A cf1:a cf2:b)); I'm trying to build a filter so that a scanner only returns the rows where cf1:a = myvalue1 and cf2:b = myvalue2. I've build a RegExpRowFilter like this; MapText, byte[] conditionalsMap = new HashMapText, byte[](); conditionalsMap.put(new Text(cf1:a), new myvalue1.getBytes()); conditionalsMap.put(new Text(cf2:b), myvalue2.getBytes()); return new RegExpRowFilter(.*, conditionalsMap); My problem is this filter always fails when I know for sure that there are rows whose columns match my values. I'm building the the scanner like this (the purpose in this case is to find if there are more values that match my filter): final Text startKey = this.htable.getStartKeys()[0]; HScannerInterface scanner = htable.obtainScanner(new Text[] {new Text(cf1:a), new Text(cf2:b)}, startKey, rowFilterInterface); return scanner.iterator().hasNext(); Can anyone give me a hand please. Thanks in advance David Alves