Re: A couple of Questions on InputFormat
Hi, (I'm assuming 1.0~ MR here) On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com wrote: Classes implementing InputFormat implement public ListInputSplit getSplits(JobContext job) which a List if InputSplits. for FileInputFormat the Splits have Path.start and End 1) When is this method called and on which JVM on Which Machine and is it called only once? Called only at a client, i.e. your hadoop jar JVM. Called only once. 2) Do the number of Map task correspond to the number of splits returned by getSplits? Yes, number of split objects == number of mappers. 3) InputFormat implements a method RecordReaderK,V createRecordReader(InputSplit split,TaskAttemptContext context ). Is this executed within the JVM of the Mapper on the slave machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. 4) The default RecordReaders read a file from the start position to the end position emitting values in the order read. With such a reader, assume it is reading lines of text, is it reasonable to assume that the values the mapper received are in the same order they were found in a file? Would it, for example, be possible for WordCount to see a word that was hyphen- ated at the end of one line and append the first word of the next line it sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J
Re: A couple of Questions on InputFormat
Thank you for your thorough answer The last question is essentially this - while I can write a custom input format to handle things like hyphens I could do almost the same thing in the mapper by saving any hyphenated words from the last line (ignoring hyphenated words that cross a split boundary) as long as LineRecordReader guarantees that each line in the split is sent to the same mapper in the order read. This seems to be the case - right? On Mon, Sep 23, 2013 at 4:30 AM, Harsh J ha...@cloudera.com wrote: Hi, (I'm assuming 1.0~ MR here) On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com wrote: Classes implementing InputFormat implement public ListInputSplit getSplits(JobContext job) which a List if InputSplits. for FileInputFormat the Splits have Path.start and End 1) When is this method called and on which JVM on Which Machine and is it called only once? Called only at a client, i.e. your hadoop jar JVM. Called only once. 2) Do the number of Map task correspond to the number of splits returned by getSplits? Yes, number of split objects == number of mappers. 3) InputFormat implements a method RecordReaderK,V createRecordReader(InputSplit split,TaskAttemptContext context ). Is this executed within the JVM of the Mapper on the slave machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. 4) The default RecordReaders read a file from the start position to the end position emitting values in the order read. With such a reader, assume it is reading lines of text, is it reasonable to assume that the values the mapper received are in the same order they were found in a file? Would it, for example, be possible for WordCount to see a word that was hyphen- ated at the end of one line and append the first word of the next line it sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: A couple of Questions on InputFormat
Hi, Yes, that is right. On Mon, Sep 23, 2013 at 9:04 PM, Steve Lewis lordjoe2...@gmail.com wrote: Thank you for your thorough answer The last question is essentially this - while I can write a custom input format to handle things like hyphens I could do almost the same thing in the mapper by saving any hyphenated words from the last line (ignoring hyphenated words that cross a split boundary) as long as LineRecordReader guarantees that each line in the split is sent to the same mapper in the order read. This seems to be the case - right? On Mon, Sep 23, 2013 at 4:30 AM, Harsh J ha...@cloudera.com wrote: Hi, (I'm assuming 1.0~ MR here) On Sun, Sep 22, 2013 at 1:00 AM, Steve Lewis lordjoe2...@gmail.com wrote: Classes implementing InputFormat implement public ListInputSplit getSplits(JobContext job) which a List if InputSplits. for FileInputFormat the Splits have Path.start and End 1) When is this method called and on which JVM on Which Machine and is it called only once? Called only at a client, i.e. your hadoop jar JVM. Called only once. 2) Do the number of Map task correspond to the number of splits returned by getSplits? Yes, number of split objects == number of mappers. 3) InputFormat implements a method RecordReaderK,V createRecordReader(InputSplit split,TaskAttemptContext context ). Is this executed within the JVM of the Mapper on the slave machine and does the RecordReader run within that JVM RecordReaders are not created on the client side JVM. RecordReaders are created on the Map task JVMs, and run inside it. 4) The default RecordReaders read a file from the start position to the end position emitting values in the order read. With such a reader, assume it is reading lines of text, is it reasonable to assume that the values the mapper received are in the same order they were found in a file? Would it, for example, be possible for WordCount to see a word that was hyphen- ated at the end of one line and append the first word of the next line it sees (ignoring the case where the word is at the end of a split) If you speak of the LineRecordReader, each map() will simply read a line, i.e. until \n. It is not language-aware to understand meaning of hyphens, etc.. You can implement a custom reader to do this however - there should be no problems so long as your logic covers the case of not having any duplicate reads across multiple maps. -- Harsh J -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Harsh J
Re: Lifecycle dates for Apache Hadoop versions
Hi Arun, On Sun, Sep 22, 2013 at 8:04 AM, H G, Arun arun@hp.com wrote: HI Are there end of life / support dates published for Apache Hadoop versions? Plz direct me to a link, if there is such information. I don't think anybody can answer this question once and for all. Apache Hadoop is an ASF project, not a software product. As such the longevity of every particular branch of development is only governed by the level of interest that the community at large has in it. For example, if there was somebody interested in continuing development of Hadoop 0.20.x branch there would be nothing (short of 3 PMC votes) stopping releases on that branch. That said, one broad observation that may be helpful to you is that it seems that most of the community these days seems to be investing in Hadoop 2.x codeline over anything else. Thanks, Roman.
答复: Problem building hadoop 2 native
The log looks to me that you need to install cmake, or it doesn't appear in PATH. 发件人: Alon Grinshpoon [alo...@mellanox.com] 发送时间: 2013年9月23日 14:49 收件人: user@hadoop.apache.org 抄送: Avner Ben Hanoch; Elad Itzhakian 主题: Problem building hadoop 2 native Hello, I am trying the build hadoop 2.0.5/2.1.0 with native (so compression can used): mvn package -Pnative -Pdist -DskipTests -Dtar but it fails. The problem occurs during the building of “Hadoop-Common” and the error is: [ERROR] Could not find goal 'protoc' in plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals - [Help 1] org.apache.maven.plugin.MojoNotFoundException: Could not find goal 'protoc' in plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals (I’m attaching the full build log in case you want to review it) Building failed on both Hadoop 2.1.0 and 2.0.5, using protobuf 2.4.1 and 2.5.0. Please help! Thank you ☺
Queues in Fair Scheduler
Hi, We are trying to implement queues in fair scheduler using mapred acls. If I setup queues and try to use them in fair scheduler, then if I don't add following two properties to the job, the job fails: -Dmapred.job.queue.name=queue name -D mapreduce.job.acl-view-job=* Is that correct ? If yes, is there a way to set a default queue per user and similarly better way to set acl-view property ? Thanks, Anurag Tangri
RE: Problem building hadoop 2 native
You are right, It works! Thank you very much! -Original Message- From: 谢良 [mailto:xieli...@xiaomi.com] Sent: Monday, September 23, 2013 10:17 AM To: user@hadoop.apache.org Cc: Avner Ben Hanoch; Elad Itzhakian Subject: 答复: Problem building hadoop 2 native The log looks to me that you need to install cmake, or it doesn't appear in PATH. 发件人: Alon Grinshpoon [alo...@mellanox.com] 发送时间: 2013年9月23日 14:49 收件人: user@hadoop.apache.org 抄送: Avner Ben Hanoch; Elad Itzhakian 主题: Problem building hadoop 2 native Hello, I am trying the build hadoop 2.0.5/2.1.0 with native (so compression can used): mvn package -Pnative -Pdist -DskipTests -Dtar but it fails. The problem occurs during the building of “Hadoop-Common” and the error is: [ERROR] Could not find goal 'protoc' in plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals - [Help 1] org.apache.maven.plugin.MojoNotFoundException: Could not find goal 'protoc' in plugin org.apache.hadoop:hadoop-maven-plugins:2.1.0-beta among available goals (I’m attaching the full build log in case you want to review it) Building failed on both Hadoop 2.1.0 and 2.0.5, using protobuf 2.4.1 and 2.5.0. Please help! Thank you ☺
Re: How to best decide mapper output/reducer input for a huge string?
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down.. On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Pavan, It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting. A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config. property name mapreduce.map.output.compress /name value true/value /property property namemapreduce.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.GzipCodec/value /property You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster. One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound. Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign. Hope this helps! - Pradeep On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done. If that is the case then you can recreate the tables with predefined splits to create more regions. Thanks, Rahul On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned “huge strings”. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John ** ** ** ** *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com] *Sent:* Saturday, September 21, 2013 2:17 AM *To:* user@hadoop.apache.org *Subject:* Re: How to best decide mapper output/reducer input for a huge string? ** ** No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed. Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping.. ** ** On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com wrote: One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up. ** ** Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner? ** ** Have you been able to profile your code to see where the bottlenecks are? ** ** On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com wrote: Hi Pradeep, Yes.. Basically i'm only writing the key part as the map output.. The V of K,V is
Re: How to best decide mapper output/reducer input for a huge string?
@John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell.. On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.comwrote: @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down.. On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Pavan, It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting. A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config. property name mapreduce.map.output.compress /name value true/value /property property namemapreduce.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.GzipCodec/value /property You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster. One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound. Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign. Hope this helps! - Pradeep On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done. If that is the case then you can recreate the tables with predefined splits to create more regions. Thanks, Rahul On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned “huge strings”. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John ** ** ** ** *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com] *Sent:* Saturday, September 21, 2013 2:17 AM *To:* user@hadoop.apache.org *Subject:* Re: How to best decide mapper output/reducer input for a huge string? ** ** No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed. Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping.. ** ** On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com wrote: One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom writable for your Key object using the appropriate data types. For example, use a long (LongWritable) for timestamps. This should make (de)serialization a lot faster. If HouseHoldId is an integer, your speed of comparisons for sorting will also go up. ** ** Ensure that your map output's are being compressed. Are your tables in HBase compressed? Do you have a combiner? ** ** Have you been able to profile your code to see where the
hadoop-2.1.0-beta - Namenode federation Configuration
Guys, Sometime back i installed hadoop-2.1.0 beta version to see the Yarn features. now i want to configure with HDFS federation with this existing cluster setup? Is it possible? pls clarify this and help me to get some tutorials for this. Thanks, Manickam P
RE: HDFS federation Configuration
Hi Suresh, I'm not able to follow the page completely. Can you pls help me to get some clear step by step or little bit more details in the configuration side? I am trying to follow a 3 node machines with 2.1.0 beta version of hadoop. My intention is having 2 federated name node and 2 data nodes. Thanks, Manickam P Date: Thu, 19 Sep 2013 12:28:04 -0700 Subject: Re: HDFS federation Configuration From: sur...@hortonworks.com To: user@hadoop.apache.org Have you looked at - http://hadoop.apache.org/docs/r2.1.0-beta/hadoop-project-dist/hadoop-hdfs/Federation.html Let me know if the document is not clear or needs improvements. Regards,Suresh On Thu, Sep 19, 2013 at 12:01 PM, Manickam P manicka...@outlook.com wrote: Guys, I need some tutorials to configure fedration. Can you pls suggest me some? Thanks, Manickam P -- http://hortonworks.com/download/ CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Help running Hadoop 2.0.5 with Snappy compression
Hi, I'm trying to run some MapReduce jobs on Hadoop 2.0.5 framework using Snappy compression. I built Hadoop with -Pnative, installed it and Snappy on all 3 machines (master+2 slaves) and copied .so files as required to $HADOOP_HOME/lib/native Also I added the following to $HADOOP_CONF_DIR/mapred-site.xml: property namemapreduce.map.output.compress/name valuetrue/value /property property namemapred.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.SnappyCodec/value /property And this I added to core-site.xml: property nameio.compression.codecs/name value org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec /value /property I then ran the following jobs: Pi: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar pi 8 2000 Teragen Terasort: bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar teragen 10 /in bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar terasort /in /out But when grepping the logs I can't seem to find any sign of Snappy: [eladi@r-zorro003 hadoop-2.0.5-alpha]$ grep -r Snappy /data1/elad/logs/* [eladi@r-zorro003 hadoop-2.0.5-alpha]$ grep -r compress /data1/elad/logs/* /data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23 11:56:24,182 INFO [fetcher#5] org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded initialized native-zlib library /data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23 11:56:24,183 INFO [fetcher#5] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.deflate] /data1/elad/logs/application_1379926544427_0001/container_1379926544427_0001_01_10/syslog:2013-09-23 11:56:24,331 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.deflate] It seems as if Zlib is being loaded instead of Snappy. What am I missing? Thanks, Elad Itzhakian
Distributed cache in command line
Hi, Is it possible to access distributed cache in command line? I have written a custom InputFormat implementation which I want to add to distributed cache. Using libjars is not an option for me as I am not running Hadoop job in command line. I am running it using RHadoop package in R which internally uses Hadoop streaming. Please help. Thanks. Regards, Anand.C
Error while configuring HDFS fedration
Guys, I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 machines in that i want to have two name nodes and one data node. I have done the other thing like password less ssh and host entries properly. when i start the cluster i'm getting the below error. In node one i'm getting this error. java.net.BindException: Port in use: lab-hadoop.eng.com:50070 In another node i'm getting this error. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. My core-site xml has the below. configuration property namefs.default.name/name valuehdfs://10.101.89.68:9000/value /property property namehadoop.tmp.dir/name value/home/lab/hadoop-2.1.0-beta/tmp/value /property /configuration My hdfs-site xml has the below. configuration property namedfs.replication/name value2/value /property property namedfs.permissions/name valuefalse/value /property property namedfs.federation.nameservices/name valuens1,ns2/value /property property namedfs.namenode.rpc-address.ns1/name value10.101.89.68:9001/value /property property namedfs.namenode.http-address.ns1/name value10.101.89.68:50070/value /property property namedfs.namenode.secondary.http-address.ns1/name value10.101.89.68:50090/value /property property namedfs.namenode.rpc-address.ns2/name value10.101.89.69:9001/value /property property namedfs.namenode.http-address.ns2/name value10.101.89.69:50070/value /property property namedfs.namenode.secondary.http-address.ns2/name value10.101.89.69:50090/value /property /configuration Please help me to fix this error. Thanks, Manickam P
RE: Error while configuring HDFS fedration
Ports in use may result from actual processes using them, or just ghost processes. The second error may be caused by inconsistent permissions on different nodes, and/or a format is needed on DFS. I suggest the following: 1. sbin/stop-dfs.sh sbin/stop-yarn.sh 2. sudo killall java (on all nodes) 3. sudo chmod -R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs (on all nodes) 4. sudo rm -rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/* (on all nodes) 5. bin/hdfs namenode -format -force 6. sbin/start-dfs.sh sbin/start-yarn.sh Then see if you get that error again. From: Manickam P [mailto:manicka...@outlook.com] Sent: Monday, September 23, 2013 4:44 PM To: user@hadoop.apache.org Subject: Error while configuring HDFS fedration Guys, I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 machines in that i want to have two name nodes and one data node. I have done the other thing like password less ssh and host entries properly. when i start the cluster i'm getting the below error. In node one i'm getting this error. java.net.BindException: Port in use: lab-hadoop.eng.com:50070 In another node i'm getting this error. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. My core-site xml has the below. configuration property namefs.default.name/name valuehdfs://10.101.89.68:9000/value /property property namehadoop.tmp.dir/name value/home/lab/hadoop-2.1.0-beta/tmp/value /property /configuration My hdfs-site xml has the below. configuration property namedfs.replication/name value2/value /property property namedfs.permissions/name valuefalse/value /property property namedfs.federation.nameservices/name valuens1,ns2/value /property property namedfs.namenode.rpc-address.ns1/name value10.101.89.68:9001/value /property property namedfs.namenode.http-address.ns1/name value10.101.89.68:50070/value /property property namedfs.namenode.secondary.http-address.ns1/name value10.101.89.68:50090/value /property property namedfs.namenode.rpc-address.ns2/name value10.101.89.69:9001/value /property property namedfs.namenode.http-address.ns2/name value10.101.89.69:50070/value /property property namedfs.namenode.secondary.http-address.ns2/name value10.101.89.69:50090/value /property /configuration Please help me to fix this error. Thanks, Manickam P
RE: Error while configuring HDFS fedration
Hi, I followed your steps. That bind error got resolved but still i'm getting the second exception. I've given the complete stack below. 2013-09-23 10:26:01,887 INFO org.mortbay.log: Stopped selectchannelconnec...@lab2-hadoop2-vm1.eng.com:50070 2013-09-23 10:26:01,988 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system... 2013-09-23 10:26:01,989 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped. 2013-09-23 10:26:01,990 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete. 2013-09-23 10:26:01,991 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:292) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:200) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:777) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:558) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:418) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:466) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:659) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:644) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1221) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1287) 2013-09-23 10:26:02,001 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2013-09-23 10:26:02,018 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: Thanks, Manickam P From: el...@mellanox.com To: user@hadoop.apache.org Subject: RE: Error while configuring HDFS fedration Date: Mon, 23 Sep 2013 14:05:47 + Ports in use may result from actual processes using them, or just ghost processes. The second error may be caused by inconsistent permissions on different nodes, and/or a format is needed on DFS. I suggest the following: 1. sbin/stop-dfs.sh sbin/stop-yarn.sh 2. sudo killall java (on all nodes) 3. sudo chmod –R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs (on all nodes) 4. sudo rm –rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/* (on all nodes) 5. bin/hdfs namenode –format –force 6. sbin/start-dfs.sh sbin/start-yarn.sh Then see if you get that error again. From: Manickam P [mailto:manicka...@outlook.com] Sent: Monday, September 23, 2013 4:44 PM To: user@hadoop.apache.org Subject: Error while configuring HDFS fedration Guys, I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 machines in that i want to have two name nodes and one data node. I have done the other thing like password less ssh and host entries properly. when i start the cluster i'm getting the below error. In node one i'm getting this error. java.net.BindException: Port in use: lab-hadoop.eng.com:50070 In another node i'm getting this error. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. My core-site xml has the below. configuration property namefs.default.name/name valuehdfs://10.101.89.68:9000/value /property property namehadoop.tmp.dir/name value/home/lab/hadoop-2.1.0-beta/tmp/value /property /configuration My hdfs-site xml has the below. configuration property namedfs.replication/name value2/value /property property namedfs.permissions/name valuefalse/value /property property namedfs.federation.nameservices/name valuens1,ns2/value /property property namedfs.namenode.rpc-address.ns1/name value10.101.89.68:9001/value /property property namedfs.namenode.http-address.ns1/name value10.101.89.68:50070/value /property property namedfs.namenode.secondary.http-address.ns1/name value10.101.89.68:50090/value /property property namedfs.namenode.rpc-address.ns2/name value10.101.89.69:9001/value /property property namedfs.namenode.http-address.ns2/name value10.101.89.69:50070/value /property property namedfs.namenode.secondary.http-address.ns2/name value10.101.89.69:50090/value /property /configuration Please help me to fix this error. Thanks, Manickam P
RE: Error while configuring HDFS fedration
Hi, Thanks for your inputs. I fixed the issue. Thanks, Manickam P From: el...@mellanox.com To: user@hadoop.apache.org Subject: RE: Error while configuring HDFS fedration Date: Mon, 23 Sep 2013 14:05:47 + Ports in use may result from actual processes using them, or just ghost processes. The second error may be caused by inconsistent permissions on different nodes, and/or a format is needed on DFS. I suggest the following: 1. sbin/stop-dfs.sh sbin/stop-yarn.sh 2. sudo killall java (on all nodes) 3. sudo chmod –R 755 /home/lab/hadoop-2.1.0-beta/tmp/dfs (on all nodes) 4. sudo rm –rf /home/lab/hadoop-2.1.0-beta/tmp/dfs/* (on all nodes) 5. bin/hdfs namenode –format –force 6. sbin/start-dfs.sh sbin/start-yarn.sh Then see if you get that error again. From: Manickam P [mailto:manicka...@outlook.com] Sent: Monday, September 23, 2013 4:44 PM To: user@hadoop.apache.org Subject: Error while configuring HDFS fedration Guys, I'm trying to configure HDFS federation with 2.1.0 beta version. I am having 3 machines in that i want to have two name nodes and one data node. I have done the other thing like password less ssh and host entries properly. when i start the cluster i'm getting the below error. In node one i'm getting this error. java.net.BindException: Port in use: lab-hadoop.eng.com:50070 In another node i'm getting this error. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/lab/hadoop-2.1.0-beta/tmp/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. My core-site xml has the below. configuration property namefs.default.name/name valuehdfs://10.101.89.68:9000/value /property property namehadoop.tmp.dir/name value/home/lab/hadoop-2.1.0-beta/tmp/value /property /configuration My hdfs-site xml has the below. configuration property namedfs.replication/name value2/value /property property namedfs.permissions/name valuefalse/value /property property namedfs.federation.nameservices/name valuens1,ns2/value /property property namedfs.namenode.rpc-address.ns1/name value10.101.89.68:9001/value /property property namedfs.namenode.http-address.ns1/name value10.101.89.68:50070/value /property property namedfs.namenode.secondary.http-address.ns1/name value10.101.89.68:50090/value /property property namedfs.namenode.rpc-address.ns2/name value10.101.89.69:9001/value /property property namedfs.namenode.http-address.ns2/name value10.101.89.69:50070/value /property property namedfs.namenode.secondary.http-address.ns2/name value10.101.89.69:50090/value /property /configuration Please help me to fix this error. Thanks, Manickam P
RE: HDFS federation Configuration
Hi, Thanks. I have done that. Thanks, Manickam P Date: Mon, 23 Sep 2013 10:02:42 -0700 Subject: Re: HDFS federation Configuration From: sur...@hortonworks.com To: user@hadoop.apache.org I'm not able to follow the page completely. Can you pls help me to get some clear step by step or little bit more details in the configuration side? Have you setup a non-federated cluster before. If you have, the page should be easy to follow. If you have not setup a non-federated cluster before, I suggest doing so, before looking at this document. I think the document already has step by step instructions.I CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
RE: HDFS federation Configuration
Hi, I got couple of doubts while doing this. Please help me to understand that. How to generate or find out the the cluster id for the below step? I saw that for one node we can format without this cluster id and we need to pass same cluster id which got generated for the other name node while formatting or else we could not get the federated cluster. $HADOOP_PREFIX_HOME/bin/hdfs namenode -format -clusterId cluster_idIn my case i want to have two name nodes and 3 data nodes. Do i need to mention all the data nodes ip address in the slave files in two name nodes? Thanks,Manickam P Date: Mon, 23 Sep 2013 10:02:42 -0700 Subject: Re: HDFS federation Configuration From: sur...@hortonworks.com To: user@hadoop.apache.org I'm not able to follow the page completely. Can you pls help me to get some clear step by step or little bit more details in the configuration side? Have you setup a non-federated cluster before. If you have, the page should be easy to follow. If you have not setup a non-federated cluster before, I suggest doing so, before looking at this document. I think the document already has step by step instructions.I CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
RE: How to best decide mapper output/reducer input for a huge string?
You might try creating a stub MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code. John From: Pavan Sudheendra [mailto:pavan0...@gmail.com] Sent: Monday, September 23, 2013 3:31 AM To: user@hadoop.apache.org Subject: Re: How to best decide mapper output/reducer input for a huge string? @John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell.. On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.commailto:pavan0...@gmail.com wrote: @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down.. On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.commailto:pradeep...@gmail.com wrote: Pavan, It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting. A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config. property name mapreduce.map.output.compress /name value true/value /property property namemapreduce.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.GzipCodec/value /property You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster. One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound. Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign. Hope this helps! - Pradeep On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee rahul.rec@gmail.commailto:rahul.rec@gmail.com wrote: One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done. If that is the case then you can recreate the tables with predefined splits to create more regions. Thanks, Rahul On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned huge strings. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John From: Pavan Sudheendra [mailto:pavan0...@gmail.commailto:pavan0...@gmail.com] Sent: Saturday, September 21, 2013 2:17 AM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: How to best decide mapper output/reducer input for a huge string? No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed. Although, there's no real bottleneck, the time it takes to process the entire table is huge. I have to constantly check if i can optimize it somehow.. Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any thing wrong with my design? Does it require any kind of re-work? Thank you so much for helping.. On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.commailto:pradeep...@gmail.com wrote: One thing that comes to mind is that your keys are Strings which are highly inefficient. You might get a lot better performance if you write a custom
Re: How to make hadoop use all nodes?
Hi Omkar, (which has 40 containers slots.) for total cluster? Yes, it was just an hypotetical value though. Below are my real configurations. 1) yarn-site.xml - what is the resource memory configured for per node? 12288mb 2) yarn-site.xml - what is the minimum resource allocation for the cluster? 1024mb min 12288mb max I also have those memory configurations in mapred-site.xml : property namemapreduce.map.memory.mb/name value5000/value /property property namemapreduce.map.java.opts/name value-Xmx4g -Djava.awt.headless=true/value /property property namemapreduce.reduce.memory.mb/name value5000/value /property property namemapreduce.reduce.java.opts/name value-Xmx4g -Djava.awt.headless=true/value /property 3) yarn-resource-manager-log (while starting resource manager export YARN_ROOT_LOGGER=DEBUG,RFA).. I am looking for debug logs.. The resulting log is really verbose. Are you searching for something in particular? 4) On RM UI how much total cluster memory is reported (how many total nodes). ( RM UI click on Cluster) So I have 58 active nodes and total memory reported is 696GB which is 58x12 as expected. I have 93 containers running instead of 116 I would expect (my job has 2046 maps so it could use all 116 containers). Here is a copy past of what I have in the scheduler tab: *Queue State: * RUNNING *Used Capacity: * 99.4% *Absolute Capacity: * 100.0% *Absolute Max Capacity: * 100.0% *Used Resources: * *Num Active Applications: * 1 *Num Pending Applications: * 0 *Num Containers: * 139 *Max Applications: * 1 *Max Applications Per User: * 1 *Max Active Applications: * 70 *Max Active Applications Per User: * 70 *Configured Capacity: * 100.0% *Configured Max Capacity: * 100.0% *Configured Minimum User Limit Percent: * 100% *Configured User Limit Factor: * 1.0 *Active users: * xxx Memory: 708608 (100.00%), vCores: 139 (100.00%), Active Apps: 1, Pending Apps: 0 I don't know where the 139 containers value is comming from. 5) which scheduler you are using? Capacity/Fair/FIFO I did not set yarn.resourcemanager.scheduler.class so apparently the default is Capacity. 6) have you configured any user limits/ queue capacity? (please add details). No. 7) All requests you are making at same priority or with different priorities? (Ideally it will not matter but want to know). I don't set any priority. Thanks for your help. Antoine Vandecreme On Friday, September 20, 2013 12:20:38 PM Omkar Joshi wrote: Hi, few more questions (which has 40 containers slots.) for total cluster? Please give below details for cluster 1) yarn-site.xml - what is the resource memory configured for per node? 2) yarn-site.xml - what is the minimum resource allocation for the cluster? 3) yarn-resource-manager-log (while starting resource manager export YARN_ROOT_LOGGER=DEBUG,RFA).. I am looking for debug logs.. 4) On RM UI how much total cluster memory is reported (how many total nodes). ( RM UI click on Cluster) 5) which scheduler you are using? Capacity/Fair/FIFO 6) have you configured any user limits/ queue capacity? (please add details). 7) All requests you are making at same priority or with different priorities? (Ideally it will not matter but want to know). Please let us know all the above details. Thanks. Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Fri, Sep 20, 2013 at 6:55 AM, Antoine Vandecreme antoine.vandecr...@nist.gov wrote: Hello Omkar, Thanks for your reply. Yes, all 4 points are corrects.
Re: Distributed cache in command line
Hi, I have no idea about RHadoop but in general in YARN we do create symlinks for the files in distributed cache in the current working directory of every container. You may be able to use that somehow. Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Mon, Sep 23, 2013 at 6:28 AM, Chandra Mohan, Ananda Vel Murugan ananda.muru...@honeywell.com wrote: Hi, ** ** Is it possible to access distributed cache in command line? I have written a custom InputFormat implementation which I want to add to distributed cache. Using *libjars *is not an option for me as I am not running Hadoop job in command line. I am running it using RHadoop package in R which internally uses Hadoop streaming. Please help. Thanks. ** ** Regards, Anand.C -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: HDFS federation Configuration
I'm not able to follow the page completely. Can you pls help me to get some clear step by step or little bit more details in the configuration side? Have you setup a non-federated cluster before. If you have, the page should be easy to follow. If you have not setup a non-federated cluster before, I suggest doing so, before looking at this document. I think the document already has step by step instructions. I -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
mapred Configuration Vs mapReduce configuration in Ooozie
I am implementing an oozie workflow MR job . I was not sure if I have to use the mapred properties vs the mapreduce properties. I have seen that if mapreduce properties were to be used, it has to be set explicitly like JobConf.setUseNewMapper(boolean flag) How do I do the same in Oozie? Also another question, if I were to use the new properties where do I find the property names like mapreduce.job.inputformat.class These seem to be hidden in some private API. Thanks your answers are appreciated.
Re: mapred Configuration Vs mapReduce configuration in Ooozie
Hi, You can use the new api in your action if you set these two properties in the action's configuration section: property namemapred.mapper.new-api/name valuetrue/value /property property namemapred.reducer.new-api/name valuetrue/value /property The JobConf.setUseNewMapper(boolean flag) simply sets the same property; same with JobConf.setUseNewReducer(boolean flag). - Robert On Mon, Sep 23, 2013 at 2:39 PM, KayVajj vajjalak...@gmail.com wrote: I am implementing an oozie workflow MR job . I was not sure if I have to use the mapred properties vs the mapreduce properties. I have seen that if mapreduce properties were to be used, it has to be set explicitly like JobConf.setUseNewMapper(boolean flag) How do I do the same in Oozie? Also another question, if I were to use the new properties where do I find the property names like mapreduce.job.inputformat.class These seem to be hidden in some private API. Thanks your answers are appreciated. -- --- You received this message because you are subscribed to the Google Groups CDH Users group. To unsubscribe from this group and stop receiving emails from it, send an email to cdh-user+unsubscr...@cloudera.org. For more options, visit https://groups.google.com/a/cloudera.org/groups/opt_out.
Re: mapred Configuration Vs mapReduce configuration in Ooozie
Hi Kay, Do you want to use the new hadoop API? If yes, did you check this link ? https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook CASE-2: RUNNING MAPREDUCE USING THE NEW HADOOP API For your second questions, I don't know any good such link. But this could be helpful : http://stackoverflow.com/questions/10986633/hadoop-configuration-mapred-vs-mapreduce Regards, Mohammad From: KayVajj vajjalak...@gmail.com To: cdh-u...@cloudera.org; common-u...@hadoop.apache.org user@hadoop.apache.org Sent: Monday, September 23, 2013 2:39 PM Subject: mapred Configuration Vs mapReduce configuration in Ooozie I am implementing an oozie workflow MR job . I was not sure if I have to use the mapred properties vs the mapreduce properties. I have seen that if mapreduce properties were to be used, it has to be set explicitly like JobConf.setUseNewMapper(boolean flag) How do I do the same in Oozie? Also another question, if I were to use the new properties where do I find the property names like mapreduce.job.inputformat.class These seem to be hidden in some private API. Thanks your answers are appreciated.
Re: mapred Configuration Vs mapReduce configuration in Ooozie
+user@oozie - cdh-user (can't send) From: Mohammad Islam misla...@yahoo.com To: user@hadoop.apache.org user@hadoop.apache.org; cdh-u...@cloudera.org cdh-u...@cloudera.org Sent: Monday, September 23, 2013 3:24 PM Subject: Re: mapred Configuration Vs mapReduce configuration in Ooozie Hi Kay, Do you want to use the new hadoop API? If yes, did you check this link ? https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook CASE-2: RUNNING MAPREDUCE USING THE NEW HADOOP API For your second questions, I don't know any good such link. But this could be helpful : http://stackoverflow.com/questions/10986633/hadoop-configuration-mapred-vs-mapreduce Regards, Mohammad From: KayVajj vajjalak...@gmail.com To: cdh-u...@cloudera.org; common-u...@hadoop.apache.org user@hadoop.apache.org Sent: Monday, September 23, 2013 2:39 PM Subject: mapred Configuration Vs mapReduce configuration in Ooozie I am implementing an oozie workflow MR job . I was not sure if I have to use the mapred properties vs the mapreduce properties. I have seen that if mapreduce properties were to be used, it has to be set explicitly like JobConf.setUseNewMapper(boolean flag) How do I do the same in Oozie? Also another question, if I were to use the new properties where do I find the property names like mapreduce.job.inputformat.class These seem to be hidden in some private API. Thanks your answers are appreciated.
Re: Task status query
Yep, typically, the AM should pass it's host:port to the task as part of either the cmd-line for the task or in it's env. That is what is done by MR AM. hth, Arun On Sep 21, 2013, at 6:52 AM, John Lilley john.lil...@redpoint.net wrote: Thanks Harsh! The data-transport format is pretty easy, but how is the RPC typically set up? Does the AM open a listen port to accept the RPC from the tasks, and then pass the port/URI to the tasks when they are spawned as command-line or environment? john -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Friday, September 20, 2013 7:47 AM To: user@hadoop.apache.org Subject: Re: Task status query Right now its MR specific (TaskUmbilicalProtocol) - YARN doesn't have any reusable items here yet, but there are easy to use RPC libs such as Avro and Thrift out there that make it easy to do such things once you define what you want in a schema/spec form. On Fri, Sep 20, 2013 at 5:32 PM, John Lilley john.lil...@redpoint.net wrote: Thanks Harsh. Is this protocol something that is available to all AMs/tasks? Or is it up to each AM/task pair to develop their own protocol? john -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, September 19, 2013 9:20 PM To: user@hadoop.apache.org Subject: Re: Task status query Hi John, YARN tasks can be more than simple executables. In case of MR, for example, tasks talk to the AM and report their individual progress and counters back to it, via a specific protocol (over the network), giving the AM more data to compute an near-accurate global progress. On Fri, Sep 20, 2013 at 12:18 AM, John Lilley john.lil...@redpoint.net wrote: How does a YARN application master typically query ongoing status (like percentage completion) of its tasks? I would like to be able to ultimately relay information to the user like: 100 tasks are scheduled 10 tasks are complete 4 tasks are running and they are (4%, 10%, 50%, 70%) complete But, given that YARN tasks are simply executables, how can the AM even get at this information? Can the AM get access to stdout/stderr? Thanks John -- Harsh J -- Harsh J -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Help writing a YARN application
I looked at your code, it's using really old apis/protocols which are significantly different now. See hadoop-2.1.0-beta release for latest apis/protocols. Also, you should really be using yarn client module rather than raw protocols. See https://github.com/hortonworks/simple-yarn-app for a simple example. thanks, Arun On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm expecting that since my code isn't complete yet). But I have no logs that I can examine. So I'm not sure if the error I'm getting is the error I'm expecting. I've attached the nodemanger and resourcemanager logs for your reference as well. How can I get started on writing YARN applications beyond the initial tutorial? Thanks for any help/pointers! Pradeep yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.logyarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Queues in Fair Scheduler
Please don't cross-post. On Sep 22, 2013, at 11:19 PM, Anurag Tangri tangri.anu...@gmail.com wrote: Hi, We are trying to implement queues in fair scheduler using mapred acls. If I setup queues and try to use them in fair scheduler, then if I don't add following two properties to the job, the job fails: -Dmapred.job.queue.name=queue name -D mapreduce.job.acl-view-job=* Is that correct ? If yes, is there a way to set a default queue per user and similarly better way to set acl-view property ? Thanks, Anurag Tangri -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
hadoop infinispan view not detected inside Map/Reduce
Hi all I want to use infinispan and hadoop. I made a simple test (wordcount example). I have this type of mode on my infinispan xml *clustering mode=distribution* * hash numOwners=2/* and also this *clustering mode=replication* *First*: I create a folder and upload a file X in my infinispan cluster (on each machine I can access the grid and the file is found). *Second*: on my hadoop job, on the main class I can access the grid and see the X file that I upload . *Third*: I want to get the X file on my Map and/or Reduce class of the job but the X file is not found. As I understand hadoop TaskTracker runs the maps and reduce as a Child task on a separate JVM. Is that why I cant access the infinispan grid from the Map or Reduce class? If so, any ideas how can I get the file in the map/reduce? Do I need a special infinispan configuration for this? Thanks -- *Cornelio*
RE: How to best decide mapper output/reducer input for a huge string?
No, I'm pretty sure the job is executing fine.. Just that the time it takes to complete the whole process, is too much that's all.. I didn't mean to say the mapper or the reducer doesn't work.. Just that it's very slow and I'm trying to figure out where it's happening in my code. Regards, Pavan On Sep 23, 2013 11:49 PM, John Lilley john.lil...@redpoint.net wrote: You might try creating a “stub” MR job in which the mappers produce no output; that would isolate the time spent reading from HBase without the trouble of instrumenting your code. John ** ** ** ** *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com] *Sent:* Monday, September 23, 2013 3:31 AM *To:* user@hadoop.apache.org *Subject:* Re: How to best decide mapper output/reducer input for a huge string? ** ** @John, to be really frank i don't know what the limiting factor is.. It might be all of them or a subset of them.. Cannot tell.. ** ** On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.com wrote: @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are functional at the same time.. Although, @Pradeep, i should do the compression like you say.. I'll give it a shot.. As far as i can see, i think i'll need to implement Writable and write out the key of the mapper using the specific data types instead of writing it out as a string which might slow the operation down.. ** ** On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Pavan, ** ** It's hard to tell whether there's anything wrong with your design or not since you haven't given us specific enough details. The best thing you can do is instrument your code and see what is taking a long time. Rahul mentioned a problem that I myself have seen before, with only one region (or a couple) having any data. So even if you have 21 regions, only mapper might be doing the heavy lifting. ** ** A combiner is hugely helpful in terms of reducing the data output of mappers. Writing a combiner is a best practice and you should almost always have one. Compression can be turned on by setting the following properties in your job config. property name mapreduce.map.output.compress /name value true/value /property property namemapreduce.map.output.compress.codec/name valueorg.apache.hadoop.io.compress.GzipCodec/value /property You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. depending on your use cases. Gzip is really slow but gets the best compression ratios. Snappy/Lzo are a lot faster but don't have as good of a compression ratio. If your computations are CPU bound, then you'd probably want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can use Gzip. You'll have to experiment and find the best settings for you. There are a lot of other tweaks that you can try to get the best performance out of your cluster. ** ** One of the best things you can do is to install Ganglia (or some other similar tool) on your cluster and monitor usage of resources while your job is running. This will tell you if your job is I/O bound or CPU bound. ** ** Take a look at this paper by Intel about optimizing your Hadoop cluster and see if that fits your deployment. http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf ** ** If your cluster is already optimized and your job is not I/O bound, then there might be a problem with your algorithm and might warrant a redesign. ** ** Hope this helps! - Pradeep ** ** On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: One mapper is spawned per hbase table region. You can use the admin ui of hbase to find the number of regions per table. It might happen that all the data is sitting in a single region , so a single mapper is spawned and you are not getting enough parallel work getting done. If that is the case then you can recreate the tables with predefined splits to create more regions. Thanks, Rahul ** ** On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.net wrote: Pavan, How large are the rows in HBase? 22 million rows is not very much but you mentioned “huge strings”. Can you tell which part of the processing is the limiting factor (read from HBase, mapper output, reducers)? John *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com] *Sent:* Saturday, September 21, 2013 2:17 AM *To:* user@hadoop.apache.org *Subject:* Re: How to best decide mapper output/reducer input for a huge string? No, I don't have a combiner in place. Is it necessary? How do I make my map output compressed? Yes, the Tables in HBase are compressed. Although, there's no real
Mapreduce jobtracker recover property
Hi, I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the following properties with the value in jobtracker job.xml, *mapreduce.job.restart.recover - true* *mapred.jobtracker.restart.recover - false** * What is the difference and which property will taken by the jobtracker? Do we need really need these properties? Also I didn't set the value to those properties in any of the configurations(eg., mapred-site.xml). Please help on this. -- Regards, Viswa.J
Re: Mapreduce jobtracker recover property
*mapred.jobtracker.restart.recover *is the old API, while the other one is for new. It is used to specify whether the job should try to resume at recovering time and when restarting. If you don't want to use it then the default value of false is used (specified in the already packaged/bundled config file in the distribution jars.) https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster Regards, Shahab On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J jayamviswanat...@gmail.comwrote: Hi, I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the following properties with the value in jobtracker job.xml, *mapreduce.job.restart.recover - true* *mapred.jobtracker.restart.recover - false** * What is the difference and which property will taken by the jobtracker? Do we need really need these properties? Also I didn't set the value to those properties in any of the configurations(eg., mapred-site.xml). Please help on this. -- Regards, Viswa.J
Re: Mapreduce jobtracker recover property
*mapred.jobtracker.restart.recover *is the old API, while the other one is for new. It is used to specify whether the job should try to resume at recovering time and when restarting. If you don't want to use it then the default value of false is used (specified in the already packaged/bundled config file in the distribution jars.) https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster Regards, Shahab On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J jayamviswanat...@gmail.comwrote: Hi, I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the following properties with the value in jobtracker job.xml, *mapreduce.job.restart.recover - true* *mapred.jobtracker.restart.recover - false** * What is the difference and which property will taken by the jobtracker? Do we need really need these properties? Also I didn't set the value to those properties in any of the configurations(eg., mapred-site.xml). Please help on this. -- Regards, Viswa.J
Re: Help writing a YARN application
Hi Arun, Thanks so much for your reply. I looked at the app you linked, it's way easier to understand. Unfortunately, I can't use YarnClient class. It doesn't seem to be in the 2.0.0 branch. Our production environment is using cdh-4.x and it looks like they're tracking the 2.0.x branch. However, I have looked at other implementations (notably Kitten and Giraph) for inspiration and able to get my containers successfully launching. Thanks for the help! - Pradeep On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote: I looked at your code, it's using really old apis/protocols which are significantly different now. See hadoop-2.1.0-beta release for latest apis/protocols. Also, you should really be using yarn client module rather than raw protocols. See https://github.com/hortonworks/simple-yarn-app for a simple example. thanks, Arun On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm expecting that since my code isn't complete yet). But I have no logs that I can examine. So I'm not sure if the error I'm getting is the error I'm expecting. I've attached the nodemanger and resourcemanager logs for your reference as well. How can I get started on writing YARN applications beyond the initial tutorial? Thanks for any help/pointers! Pradeep yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Mapreduce jobtracker recover property
Hi Shahab, Thanks for the detailed response. So the new property will have the default value as true in the package? Thanks, Viswa.J On Sep 24, 2013 8:33 AM, Shahab Yunus shahab.yu...@gmail.com wrote: *mapred.jobtracker.restart.recover *is the old API, while the other one is for new. It is used to specify whether the job should try to resume at recovering time and when restarting. If you don't want to use it then the default value of false is used (specified in the already packaged/bundled config file in the distribution jars.) https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-5/running-on-a-cluster Regards, Shahab On Mon, Sep 23, 2013 at 10:31 PM, Viswanathan J jayamviswanat...@gmail.com wrote: Hi, I'm using the version Hadoop 1.2.1 in production hdfs, i can see the the following properties with the value in jobtracker job.xml, *mapreduce.job.restart.recover - true* *mapred.jobtracker.restart.recover - false** * What is the difference and which property will taken by the jobtracker? Do we need really need these properties? Also I didn't set the value to those properties in any of the configurations(eg., mapred-site.xml). Please help on this. -- Regards, Viswa.J
Re: Help writing a YARN application
Hi Pradeep, You can also look at Weave - A simpler thread abstraction over YARN. http://github.com/continuuity/weave Runs on multiple HDs. Nitin On Mon, Sep 23, 2013 at 10:20 PM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi Arun, Thanks so much for your reply. I looked at the app you linked, it's way easier to understand. Unfortunately, I can't use YarnClient class. It doesn't seem to be in the 2.0.0 branch. Our production environment is using cdh-4.x and it looks like they're tracking the 2.0.x branch. However, I have looked at other implementations (notably Kitten and Giraph) for inspiration and able to get my containers successfully launching. Thanks for the help! - Pradeep On Mon, Sep 23, 2013 at 4:46 PM, Arun C Murthy a...@hortonworks.com wrote: I looked at your code, it's using really old apis/protocols which are significantly different now. See hadoop-2.1.0-beta release for latest apis/protocols. Also, you should really be using yarn client module rather than raw protocols. See https://github.com/hortonworks/simple-yarn-app for a simple example. thanks, Arun On Sep 20, 2013, at 11:24 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Hi All, I've been trying to write a Yarn application and I'm completely lost. I'm using Hadoop 2.0.0-cdh4.4.0 (Cloudera distribution). I've uploaded my sample code to github at https://github.com/pradeepg26/sample-yarn The problem is that my application master is exiting with a status of 1 (I'm expecting that since my code isn't complete yet). But I have no logs that I can examine. So I'm not sure if the error I'm getting is the error I'm expecting. I've attached the nodemanger and resourcemanager logs for your reference as well. How can I get started on writing YARN applications beyond the initial tutorial? Thanks for any help/pointers! Pradeep yarn-yarn-nodemanager-pradeep-gollakota.vm.lithium.com.log yarn-yarn-resourcemanager-pradeep-gollakota.vm.lithium.com.log -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.