Book - Pro Hadoop 2?
http://www.amazon.com/Pro-Hadoop-2-Jason-Venner/dp/1430248637/ Anyone have any inside information on this book? Amazon has a date of March 2013, while Apress doesn't list the 2nd edition at all... Marco
Re: How to learn Mapreduce
In addition to the book, you might want to take Hortonworks sandbox for a spin. This is a single node instance of Hadoop designed to help you take the initial steps towards learning Hadoop and related projects. It comes as a self contained VM, including tutorials. http://hortonworks.com/products/hortonworks-sandbox/ Regards, Olivier On 29 January 2013 04:54, Harsh J ha...@cloudera.com wrote: I'd recommend discovering a need first - attempt to solve a problem and then write a program around it. Hadoop is data-driven, so its rather hard if you try to learn it without working with any data. A book may interest you as well: http://wiki.apache.org/hadoop/Books On Tue, Jan 29, 2013 at 10:17 AM, abdul wajeed abdul54waj...@gmail.com wrote: Hi Sir/Mam, I am very new to Hadoop Technology so any one can help me how to write simple MapReduce program except wordcount program I would like to write our own program in mapreduce ...As well as in pig Latin so please any one can help me ... Thanks Regards Abdul Wajeed -- Harsh J -- Olivier Renault Solution Engineer - Big Data - Hortonworks, Inc. +44 7500 933 036 orena...@hortonworks.com www.hortonworks.com
Re: number of mapper tasks
Hello, I have been able to make this work. I don't know why, but when but input file is zipped (read as a input stream) it creates only 1 mapper. However, when it's not zipped, it creates more mappers (running 3 instances it created 4 mappers and running 5 instances, it created 8 mappers). I really would like to know why this happens and even with this number of mappers, I would like to know why more mappers aren't created. I was reading part of the book Hadoop - The definitive guide ( https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats) which says: The JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument. This number is treated as a hint, as InputFormat implementations are free to return a different number of splits to the number specified in numSplits. Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers. ... I am not sure on how to get more info. Would you recommend me to try to find the answer on the book? Or should I read hadoop source code directly? Best regards, Marcelo. 2013/1/29 Marcelo Elias Del Valle mvall...@gmail.com I implemented my custom input format. Here is how I used it: https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java As you can see, I do: importerJob.setInputFormatClass(CSVNLineInputFormat.class); And here is the Input format and the linereader: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java In this input format, I completely ignore these other parameters and get the splits by the number of lines. The amount of lines per map can be controlled by the same parameter used in NLineInputFormat: public static final String LINES_PER_MAP = mapreduce.input.lineinputformat.linespermap; However, it has really no effect on the number of maps. 2013/1/29 Vinod Kumar Vavilapalli vino...@hortonworks.com Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly. W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one? HTH, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote: Just to complement the last question, I have implemented the getSplits method in my input format: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created? Thanks Marcelo. 2013/1/28 Marcelo Elias Del Valle mvall...@gmail.com Sorry for asking too many questions, but the answers are really happening. 2013/1/28 Harsh J ha...@cloudera.com This seems CPU-oriented. You probably want the NLineInputFormat? See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html . This should let you spawn more maps as we, based on your N factor. Indeed, CPU is my bottleneck. That's why I want more things in parallel. Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat I could change it to read several lines at a time, but would this alone allow more tasks running in parallel? Not really - Slots are capacities, rather than split factors themselves. You can have N slots always available, but your job has to supply as many map tasks (based on its input/needs/etc.) to use them up. But how can I do that (supply map tasks) in my job? changing its code? hadoop config? Unless your job sets the number of reducers to 0 manually, 1 default reducer is always run that waits to see if it has any outputs from maps. If it does not receive any outputs after maps have all completed, it dies out with behavior equivalent to a NOP. Ok, I did job.setNumReduceTasks(0); , guess this will solve this part, thanks! -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Using distcp with Hadoop HA
Hello everyone.. I am trying to use distcp with Hadoop HA configuration (using CDH4.0.0 at the moment).. Here is my problem: - I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out.. - If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks Regards, Dhaval
Re: Multiple reduce task retries running at same time
Hi Ben, Take a look at the Speculative Execution feature of MR which should answer your question. See its section under 'Fault Tolerance' here: http://developer.yahoo.com/hadoop/tutorial/module4.html#tolerence. On Tue, Jan 29, 2013 at 1:08 PM, Ben Kim benkimkim...@gmail.com wrote: Attached a screenshot showing the retries On Tue, Jan 29, 2013 at 4:35 PM, Ben Kim benkimkim...@gmail.com wrote: Hi! I have come across the situation where i found a single reducer task executing with multiple retries simultaneously. Which is potent for slowing down the whole reduce process for large data sets. Is this pretty normal to yall for hadoop 1.0.3? -- Benjamin Kim benkimkimben at gmail -- Benjamin Kim benkimkimben at gmail -- Harsh J
Re: Tricks to upgrading Sequence Files?
This is a pretty interesting question, but unfortunately there isn't an inbuilt way in SequenceFiles itself to handle this. However, your key/value classes can be made to handle versioning perhaps - detecting if what they've read is of an older time and decoding it appropriately (while handling newer encoding separately, in the normal fashion). This would be much better than going down the classloader hack paths I think? On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote: Anyone have any good tricks for upgrading a sequence file. We maintain a sequence file like a flat file DB and the primary object in there changed in recent development. It’s trivial to write a job to read in the sequence file, update the object, and write it back out in the new format. But since sequence files read and write the key/value class I would either need to rename the model object with a version number, or change the header of each sequence file. Just wondering if there are any nice tricks to this. -- Harsh J
Re: number of mapper tasks
Tried looking at your code, it's a bit involved. Instead of trying to run the job, try unit-testing your input format. Test for getSplits(), whatever number of splits that method returns, that will be the number of mappers that will run. You can also use LocalJobRunner also for this - set mapred.job.tracker to local and run your job locally on your machine instead of trying on a cluster. HTH, +Vinod On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I have been able to make this work. I don't know why, but when but input file is zipped (read as a input stream) it creates only 1 mapper. However, when it's not zipped, it creates more mappers (running 3 instances it created 4 mappers and running 5 instances, it created 8 mappers). I really would like to know why this happens and even with this number of mappers, I would like to know why more mappers aren't created. I was reading part of the book Hadoop - The definitive guide ( https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats) which says: The JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument. This number is treated as a hint, as InputFormat implementations are free to return a different number of splits to the number specified in numSplits. Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers. ... I am not sure on how to get more info. Would you recommend me to try to find the answer on the book? Or should I read hadoop source code directly? Best regards, Marcelo. 2013/1/29 Marcelo Elias Del Valle mvall...@gmail.com I implemented my custom input format. Here is how I used it: https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java As you can see, I do: importerJob.setInputFormatClass(CSVNLineInputFormat.class); And here is the Input format and the linereader: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java In this input format, I completely ignore these other parameters and get the splits by the number of lines. The amount of lines per map can be controlled by the same parameter used in NLineInputFormat: public static final String LINES_PER_MAP = mapreduce.input.lineinputformat.linespermap; However, it has really no effect on the number of maps. 2013/1/29 Vinod Kumar Vavilapalli vino...@hortonworks.com Regarding your original question, you can use the min and max split settings to control the number of maps: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or use mapred.min.split.size directly. W.r.t your custom inputformat, are you sure you job is using this InputFormat and not the default one? HTH, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote: Just to complement the last question, I have implemented the getSplits method in my input format: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java However, it still doesn't create more than 2 map tasks. Is there something I could do about it to assure more map tasks are created? Thanks Marcelo. 2013/1/28 Marcelo Elias Del Valle mvall...@gmail.com Sorry for asking too many questions, but the answers are really happening. 2013/1/28 Harsh J ha...@cloudera.com This seems CPU-oriented. You probably want the NLineInputFormat? See http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html . This should let you spawn more maps as we, based on your N factor. Indeed, CPU is my bottleneck. That's why I want more things in parallel. Actually, I wrote my own InputFormat, to be able to process multiline CSVs: https://github.com/mvallebr/CSVInputFormat I could change it to read several lines at a time, but would this alone allow more tasks running in parallel? Not really - Slots are capacities, rather than split factors themselves. You can have N slots always available, but your job has to supply as many map tasks (based on its input/needs/etc.) to use them up. But how can I do that (supply map tasks) in my job? changing its code? hadoop config? Unless your job sets the number of reducers to 0 manually, 1 default reducer is always run that waits to see if it has any outputs from maps. If it does not receive any outputs after maps have all completed, it
Re: Initial Permission Settings
Please check your dfs umask (dfs.umask configuration property). HTH, +Vinod On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.com wrote: Hi all, Quick question about hadoop dfs -put local_file hdfs_file command It seems that regardless of permissions for local_file the initial permissions for hdfs_file are -rw-r--r--, which allow file to be read by other users. Is there are way to change the initial file settings? Thanks in advance. Serge -- +Vinod Hortonworks Inc. http://hortonworks.com/
Re: Issue with Reduce Side join using datajoin package
Seems like a bug in your code, can you share the source here? +Vinod On Tue, Jan 29, 2013 at 4:00 AM, Vikas Jadhav vikascjadha...@gmail.comwrote: I am using Hadoop 1.0.3 I am getting following Error 13/01/29 06:55:19 INFO mapred.JobClient: Task Id : attempt_201301290120_0006_r_00_0, Status : FAILED java.lang.NullPointerException at MyJoin$TaggedWritable.readFields(MyJoin.java:101) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1271) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1211) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:249) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:245) at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.regroup(DataJoinReducerBase.java:106) at org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.reduce(DataJoinReducerBase.java:129) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) It is poiting to String dataClz = in.readUTF(); this line in readFields * public * *void* readFields( DataInput in) *throws* IOException { *this*.tag.readFields( in); //String dataClz = in.readUTF(); String dataClz = in.readUTF(); error log show this line is culprit *try* * * { //try - catch is needed because the error: unreported exception //ClassNotFoundException; must be caught or declared to be thrown //is raised from compiler *if*( *this*.data == *null* || !*this*.data.getClass().getName().equals( dataClz)) { //this line of code raises the compile error mentioned above *this*.data = (Writable) ReflectionUtils.*newInstance*( Class.*forName*( dataClz), *null*); } *this*.data.readFields( in); } *catch*( ClassNotFoundException cnfe) { System.*out*.println( Problem in TaggedWritable class, method readFields.); } }//end readFields -- * * * Thanx and Regards* * Vikas Jadhav* -- +Vinod Hortonworks Inc. http://hortonworks.com/
Re: ClientProtocol Version mismatch. (client = 69, server = 1)
Please take this up in CDH mailing list. Most likely you are using client that is not from 2.0 release of Hadoop. On Tue, Jan 29, 2013 at 12:33 PM, Kim Chew kchew...@gmail.com wrote: I am using CDH4 (2.0.0-mr1-cdh4.1.2) vm running on my mbp. I was trying to invoke a remote method in the ClientProtocol via RPC, however I am getting this exception. 2013-01-29 11:20:45,810 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:training (auth:SIMPLE) cause:org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 69, server = 1) 2013-01-29 11:20:45,810 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 192.168.140.1:50597: error: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 69, server = 1) org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. (client = 69, server = 1) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine.java:400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:435) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) I could understand if the Server's ClientProtocol has version number 60 or something else, but how could it has a version number of 1? Thanks. Kim -- http://hortonworks.com/download/
Re: Initial Permission Settings
Thanks for response. I am still a bit confused. There are two parameters: dfs.umask and dfs.umaskmode and both of them seems to be deprecated in different hadoop versions. Does anybody know whats going on with those two parameters? Thanks Serge On Jan 29, 2013, at 12:10 PM, Vinod Kumar Vavilapalli vino...@hortonworks.commailto:vino...@hortonworks.com wrote: Please check your dfs umask (dfs.umask configuration property). HTH, +Vinod On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.commailto:serge.blazhiyevs...@nice.com wrote: Hi all, Quick question about hadoop dfs -put local_file hdfs_file command It seems that regardless of permissions for local_file the initial permissions for hdfs_file are -rw-r--r--, which allow file to be read by other users. Is there are way to change the initial file settings? Thanks in advance. Serge -- +Vinod Hortonworks Inc. http://hortonworks.com/
Re: Using distcp with Hadoop HA
Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? - I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out. - If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks Regards, Dhaval -- http://hortonworks.com/download/
Re: Initial Permission Settings
dfs.umask has been deprecated in favor of dfs.umaskmode Typically a umask is represented as an octal, whereas dfs.umask was represented as a decimal. dfs.umaskmode accepts octal notation, as is preferred. For details, feel free to read up on https://issues.apache.org/jira/browse/HADOOP-6234 -Michael On Tue, Jan 29, 2013 at 5:47 PM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.com wrote: Thanks for response. I am still a bit confused. There are two parameters: dfs.umask and dfs.umaskmode and both of them seems to be deprecated in different hadoop versions. Does anybody know whats going on with those two parameters? Thanks Serge On Jan 29, 2013, at 12:10 PM, Vinod Kumar Vavilapalli vino...@hortonworks.commailto:vino...@hortonworks.com wrote: Please check your dfs umask (dfs.umask configuration property). HTH, +Vinod On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy serge.blazhiyevs...@nice.commailto:serge.blazhiyevs...@nice.com wrote: Hi all, Quick question about hadoop dfs -put local_file hdfs_file command It seems that regardless of permissions for local_file the initial permissions for hdfs_file are -rw-r--r--, which allow file to be read by other users. Is there are way to change the initial file settings? Thanks in advance. Serge -- +Vinod Hortonworks Inc. http://hortonworks.com/
Re: Using distcp with Hadoop HA
No the datanodes are running on different sets of machines. The configuration looks like this: The problem is that datanodes in clusterA are trying to connect to namenodes in clusterB (and this seems random.. like it trying to randomly select from the 4 namenodes) property namedfs.nameservices/name valueclusterA,clusterB/value description Comma-separated list of nameservices. /description finaltrue/final /property property namedfs.nameservice.id/name valueclusterA/value description The ID of this nameservice. If the nameservice ID is not configured or more than one nameservice is configured for dfs.nameservices it is determined automatically by matching the local node's address with the configured address. /description finaltrue/final /property property namedfs.ha.namenodes.clusterA/name valueclusterAnn1,clusterAnn2/value description The prefix for a given nameservice, contains a comma-separated list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE). /description finaltrue/final /property property namedfs.namenode.rpc-address.clusterA.clusterAnn1/name valueclusterAnn1:8000/value description Set the full address and IPC port of the NameNode process /description finaltrue/final /property property namedfs.namenode.rpc-address.clusterA.clusterAnn2/name valueclusterAnn1:8000/value description Set the full address and IPC port of the NameNode process /description finaltrue/final /property property namedfs.ha.namenodes.clusterB/name valueclusterBnn1,clusterBnn2/value description The prefix for a given nameservice, contains a comma-separated list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE). /description finaltrue/final /property property namedfs.namenode.rpc-address.clusterB.clusterBnn1/name valueclusterBnn1:8000/value description Set the full address and IPC port of the NameNode process /description finaltrue/final /property property namedfs.namenode.rpc-address.clusterB.clusterBnn2/name valueclusterBnn2:8000/value description Set the full address and IPC port of the NameNode process /description finaltrue/final /property property namedfs.client.failover.proxy.provider.clusterA/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value description Configure the name of the Java class which the DFS Client will use to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. /description finaltrue/final /property property namedfs.client.failover.proxy.provider.clusterB/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value description Configure the name of the Java class which the DFS Client will use to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. /description finaltrue/final /property Regards, Dhaval From: Suresh Srinivas sur...@hortonworks.com To: hdfs-u...@hadoop.apache.org user@hadoop.apache.org; Dhaval Shah prince_mithi...@yahoo.co.in Sent: Tuesday, 29 January 2013 6:03 PM Subject: Re: Using distcp with Hadoop HA Currently, as you have pointed out, client side configuration based failover is used in HA setup. The configuration must define namenode addresses for the nameservices of both the clusters. Are the datanodes belonging to the two clusters running on the same set of nodes? Can you share the configuration you are using, to diagnose the problem? - I am trying to do a distcp from cluster A to cluster B. Since no operations are supported on the standby namenode, I need to specify either the active namenode while using distcp or use the failover proxy provider (dfs.client.failover.proxy.provider.clusterA) where I can specify the two namenodes for cluster B and the failover code inside HDFS will figure it out. - If I use the failover proxy provider, some of my datanodes on cluster A would connect to the namenode on cluster B and vice versa. I am assuming that is because I have configured both nameservices in my hdfs-site.xml for distcp to work.. I have configured dfs.nameservice.id to be the right one but the datanodes do not seem to respect that. What is the best way to use distcp with Hadoop HA configuration without having the datanodes to connect to the remote namenode? Thanks Regards, Dhaval -- http://hortonworks.com/download/
Oozie workflow error - renewing token issue
Oozie question I'm trying to run an Oozie workflow (sqoop action) from the Hue console and it fails every time. No exception in the oozie log but I see this in the Job Tracker log file. Two primary issues seem to be 1. Client mapred tries to renew a token with renewer specified as mr token And 2. Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN Any ideas how to get past this? Full Stacktrace: 2013-01-29 17:11:28,860 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088860, maxDate=136010560, sequenceNumber=75, masterKeyId=8 2013-01-29 17:11:28,871 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088871, maxDate=136010571, sequenceNumber=76, masterKeyId=8 2013-01-29 17:11:29,202 INFO org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal: registering token for renewal for service =10.204.12.62:8021 and jobID = job_201301231648_0029 2013-01-29 17:11:29,211 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Token renewal requested for identifier: owner=hdfs, renewer=mr token, realUser=oozie, issueDate=1359501088871, maxDate=136010571, sequenceNumber=76, masterKeyId=8 2013-01-29 17:11:29,211 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as mr token 2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token: Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN 2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8021, call renewDelegationToken(Kind: MAPREDUCE_DELEGATION_TOKEN, Service: 10.204.12.62:8021, Ident: 00 04 68 64 66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 7a 69 65 8a 01 3c 88 94 58 67 8a 01 3c ac a0 dc 67 4c 08), rpc version=2, client version=28, methodsFingerPrint=1830206421 from 10.204.12.62:9706: error: org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as mr token org.apache.hadoop.security.AccessControlException: Client mapred tries to renew a token with renewer specified as mr token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274) at org.apache.hadoop.mapred.JobTracker.renewDelegationToken(JobTracker.java:3738) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:474) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) 2013-01-29 17:11:29,212 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Client mapred tries to renew a token with renewer specified as mr token at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274) at org.apache.hadoop.mapred.JobTracker.renewDelegationToken(JobTracker.java:3738) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:474) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) at java.security.AccessController.doPrivileged(Native Method) at
RE: Tricks to upgrading Sequence Files?
I'll consider a patch to the SequenceFile, if we could manually override the sequence file input Key and Value that's read from the sequence file headers we'd have a clean solution. I don't like versioning my Model object because it's used by 10's of other classes and I don't want to risk less maintained classes continuing to use an old version. For the time being I just used 2 jobs. First I renamed the old Model Object to the original name, read it in, upgraded it, and wrote the new version with a different class name. Then I renamed the classes again so the new model object used the original name and read in the altered name and cloned it into the original name. All in all an hours work only, but having a cleaner process would be better. I'll add the request to JIRA at a minimum. Dave -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Wednesday, January 30, 2013 2:32 AM To: user@hadoop.apache.org Subject: Re: Tricks to upgrading Sequence Files? This is a pretty interesting question, but unfortunately there isn't an inbuilt way in SequenceFiles itself to handle this. However, your key/value classes can be made to handle versioning perhaps - detecting if what they've read is of an older time and decoding it appropriately (while handling newer encoding separately, in the normal fashion). This would be much better than going down the classloader hack paths I think? On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote: Anyone have any good tricks for upgrading a sequence file. We maintain a sequence file like a flat file DB and the primary object in there changed in recent development. It's trivial to write a job to read in the sequence file, update the object, and write it back out in the new format. But since sequence files read and write the key/value class I would either need to rename the model object with a version number, or change the header of each sequence file. Just wondering if there are any nice tricks to this. -- Harsh J
what will happen when HDFS restarts but with some dead nodes
Hi, all I'm wondering if HDFS is stopped, and some of the machines of the cluster are moved, some of the block replication are definitely lost for moving machines when I restart the system, will the namenode recalculate the data distribution? Best, -- Nan Zhu School of Computer Science, McGill University
eclipse plugin
Hi I wonder whether there is eclipse-plugin for hadoop development,i'm using CDH4u1 eclipse indigo. thank you...
[ANN] First 2013 Munich OpenHUG Meeting
I am pleased to invite you to our next Munich Open Hadoop User Group Meeting! Like always we are looking forward to see everyone again and are welcoming new attendees to join our group. We are enthusiast about all things related to scalable, distributed storage and database systems. We are not limiting us to a particular system but appreciate anyone who would like to share their experiences. When: Friday February 22nd, 2013 from 4PM to 8PM Where: T-Systems International GmbH, Dachauer Straße 651, 80995 München, Room EG-148E Thanks to T-Systems (http://www.t-systems.com/) for helping to organize the event and for providing the infrastructure, as well as food and drinks. We have quite a few very interesting talks scheduled: - Ending Confusion on Big Data Technologies: How Use Case Segmentation drives Target Architectures and Technology Selection at Deutsche Telekom by Juergen Urbanski, Chief Technologist at T-Systems - Low latency data processing with Impala by Lars George, Director EMEA Services at Cloudera We are looking for further volunteers to submit talks, so if you are working in the new Big Data or NoSQL space, and would like to give a presentation, please let me know (via email to l...@cloudera.com). Looking forward to seeing you there! Please RSVP here: Xing: https://www.xing.com/events/2013-munich-openhug-meeting-1198136 Best regards, Lars
Re: what will happen when HDFS restarts but with some dead nodes
Hi Nan Namenode will stay in safemode before all blocks are replicated. During this time, the jobtracker can not see any tasktrackers. (MRv1). Chen On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu zhunans...@gmail.com wrote: Hi, all I'm wondering if HDFS is stopped, and some of the machines of the cluster are moved, some of the block replication are definitely lost for moving machines when I restart the system, will the namenode recalculate the data distribution? Best, -- Nan Zhu School of Computer Science, McGill University
Re: what will happen when HDFS restarts but with some dead nodes
So, we can assume that all blocks are fully replicated at the start point of HDFS? Best, -- Nan Zhu School of Computer Science, McGill University On Tuesday, 29 January, 2013 at 10:50 PM, Chen He wrote: Hi Nan Namenode will stay in safemode before all blocks are replicated. During this time, the jobtracker can not see any tasktrackers. (MRv1). Chen On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu zhunans...@gmail.com (mailto:zhunans...@gmail.com) wrote: Hi, all I'm wondering if HDFS is stopped, and some of the machines of the cluster are moved, some of the block replication are definitely lost for moving machines when I restart the system, will the namenode recalculate the data distribution? Best, -- Nan Zhu School of Computer Science, McGill University
Re: eclipse plugin
Hi YouPeng, I also wondering the same thing. Is there anybody now about eclipse-plugin for hadoop? Thanks. On Wed, Jan 30, 2013 at 11:19 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi I wonder whether there is eclipse-plugin for hadoop development,i'm using CDH4u1 eclipse indigo. thank you...