Book - Pro Hadoop 2?

2013-01-29 Thread Marco Shaw
http://www.amazon.com/Pro-Hadoop-2-Jason-Venner/dp/1430248637/

Anyone have any inside information on this book?

Amazon has a date of March 2013, while Apress doesn't list the 2nd edition
at all...

Marco


Re: How to learn Mapreduce

2013-01-29 Thread Olivier Renault
In addition to the book, you might want to take Hortonworks sandbox for a
spin. This is a single node instance of Hadoop designed to help you take
the initial steps towards learning Hadoop and related projects. It comes as
a self contained VM, including tutorials.

http://hortonworks.com/products/hortonworks-sandbox/

Regards,
Olivier


On 29 January 2013 04:54, Harsh J ha...@cloudera.com wrote:

 I'd recommend discovering a need first - attempt to solve a problem
 and then write a program around it. Hadoop is data-driven, so its
 rather hard if you try to learn it without working with any data.

 A book may interest you as well: http://wiki.apache.org/hadoop/Books

 On Tue, Jan 29, 2013 at 10:17 AM, abdul wajeed abdul54waj...@gmail.com
 wrote:
  Hi Sir/Mam,
   I am very new to Hadoop Technology so any one can help
 me
  how to write simple MapReduce program except wordcount program
  I would like to write our own program in  mapreduce ...As well as in pig
  Latin
  so please any one can help me ...
 
 Thanks
 
  Regards
  Abdul Wajeed



 --
 Harsh J




-- 
Olivier Renault
Solution Engineer - Big Data - Hortonworks, Inc.
+44 7500 933 036
orena...@hortonworks.com
www.hortonworks.com


Re: number of mapper tasks

2013-01-29 Thread Marcelo Elias Del Valle
Hello,

I have been able to make this work. I don't know why, but when but
input file is zipped (read as a input stream) it creates only 1 mapper.
However, when it's not zipped, it creates more mappers (running 3 instances
it created 4 mappers and running 5 instances, it created 8 mappers).
I really would like to know why this happens and even with this number
of mappers, I would like to know why more mappers aren't created. I was
reading part of the book Hadoop - The definitive guide (
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
which says:

The JobClient calls the getSplits() method, passing the desired number of
map tasks as the numSplits argument. This number is treated as a hint, as
InputFormat implementations are free to return a different number of splits
to the number specified in numSplits. Having calculated the splits, the
client sends them to the jobtracker, which uses their storage locations to
schedule map tasks to process them on the tasktrackers. ...

 I am not sure on how to get more info.

 Would you recommend me to try to find the answer on the book? Or
should I read hadoop source code directly?

Best regards,
Marcelo.


2013/1/29 Marcelo Elias Del Valle mvall...@gmail.com

 I implemented my custom input format. Here is how I used it:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

 As you can see, I do:
 importerJob.setInputFormatClass(CSVNLineInputFormat.class);

 And here is the Input format and the linereader:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

 In this input format, I completely ignore these other parameters and get
 the splits by the number of lines. The amount of lines per map can be
 controlled by the same parameter used in NLineInputFormat:

 public static final String LINES_PER_MAP =
 mapreduce.input.lineinputformat.linespermap;
 However, it has really no effect on the number of maps.



 2013/1/29 Vinod Kumar Vavilapalli vino...@hortonworks.com


 Regarding your original question, you can use the min and max split
 settings to control the number of maps:
 http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html.
  See #setMinInputSplitSize and #setMaxInputSplitSize. Or
 use mapred.min.split.size directly.

 W.r.t your custom inputformat, are you sure you job is using this
 InputFormat and not the default one?

  HTH,
 +Vinod Kumar Vavilapalli
 Hortonworks Inc.
 http://hortonworks.com/

 On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

 Just to complement the last question, I have implemented the getSplits
 method in my input format:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

 However, it still doesn't create more than 2 map tasks. Is there
 something I could do about it to assure more map tasks are created?

 Thanks
 Marcelo.


 2013/1/28 Marcelo Elias Del Valle mvall...@gmail.com

 Sorry for asking too many questions, but the answers are really
 happening.


 2013/1/28 Harsh J ha...@cloudera.com

 This seems CPU-oriented. You probably want the NLineInputFormat? See

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
 .
 This should let you spawn more maps as we, based on your N factor.


 Indeed, CPU is my bottleneck. That's why I want more things in parallel.
 Actually, I wrote my own InputFormat, to be able to process multiline
 CSVs: https://github.com/mvallebr/CSVInputFormat
 I could change it to read several lines at a time, but would this alone
 allow more tasks running in parallel?


 Not really - Slots are capacities, rather than split factors
 themselves. You can have N slots always available, but your job has to
 supply as many map tasks (based on its input/needs/etc.) to use them
 up.


 But how can I do that (supply map tasks) in my job? changing its code?
 hadoop config?


 Unless your job sets the number of reducers to 0 manually, 1 default
 reducer is always run that waits to see if it has any outputs from
 maps. If it does not receive any outputs after maps have all
 completed, it dies out with behavior equivalent to a NOP.

 Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
 thanks!


 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr





 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Using distcp with Hadoop HA

2013-01-29 Thread Dhaval Shah
Hello everyone.. I am trying to use distcp with Hadoop HA configuration (using 
CDH4.0.0 at the moment).. Here is my problem:
- I am trying to do a distcp from cluster A to cluster B. Since no operations 
are supported on the standby namenode, I need to specify either the active 
namenode while using distcp or use the failover proxy provider 
(dfs.client.failover.proxy.provider.clusterA) where I can specify the two 
namenodes for cluster B and the failover code inside HDFS will figure it out.. 
- If I use the failover proxy provider, some of my datanodes on cluster A would 
connect to the namenode on cluster B and vice versa. I am assuming that is 
because I have configured both nameservices in my hdfs-site.xml for distcp to 
work.. I have configured dfs.nameservice.id to be the right one but the 
datanodes do not seem to respect that. 

What is the best way to use distcp with Hadoop HA configuration without having 
the datanodes to connect to the remote namenode? Thanks
 
Regards,
Dhaval

Re: Multiple reduce task retries running at same time

2013-01-29 Thread Harsh J
Hi Ben,

Take a look at the Speculative Execution feature of MR which should
answer your question. See its section under 'Fault Tolerance' here:
http://developer.yahoo.com/hadoop/tutorial/module4.html#tolerence.

On Tue, Jan 29, 2013 at 1:08 PM, Ben Kim benkimkim...@gmail.com wrote:
 Attached a screenshot showing the retries


 On Tue, Jan 29, 2013 at 4:35 PM, Ben Kim benkimkim...@gmail.com wrote:

 Hi!

 I have come across the situation where i found a single reducer task
 executing with multiple retries simultaneously.
 Which is potent for slowing down the whole reduce process for large data
 sets.

 Is this pretty normal to yall for hadoop 1.0.3?

 --

 Benjamin Kim
 benkimkimben at gmail




 --

 Benjamin Kim
 benkimkimben at gmail



-- 
Harsh J


Re: Tricks to upgrading Sequence Files?

2013-01-29 Thread Harsh J
This is a pretty interesting question, but unfortunately there isn't
an inbuilt way in SequenceFiles itself to handle this. However, your
key/value classes can be made to handle versioning perhaps - detecting
if what they've read is of an older time and decoding it appropriately
(while handling newer encoding separately, in the normal fashion).
This would be much better than going down the classloader hack paths I
think?

On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote:
 Anyone have any good tricks for upgrading a sequence file.



 We maintain a sequence file like a flat file DB and the primary object in
 there changed in recent development.



 It’s trivial to write a job to read in the sequence file, update the object,
 and write it back out in the new format.



 But since sequence files read and write the key/value class I would either
 need to rename the model object with a version number, or change the header
 of each sequence file.



 Just wondering if there are any nice tricks to this.



-- 
Harsh J


Re: number of mapper tasks

2013-01-29 Thread Vinod Kumar Vavilapalli
Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod



On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello,

 I have been able to make this work. I don't know why, but when but
 input file is zipped (read as a input stream) it creates only 1 mapper.
 However, when it's not zipped, it creates more mappers (running 3 instances
 it created 4 mappers and running 5 instances, it created 8 mappers).
 I really would like to know why this happens and even with this number
 of mappers, I would like to know why more mappers aren't created. I was
 reading part of the book Hadoop - The definitive guide (
 https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
 which says:

 The JobClient calls the getSplits() method, passing the desired number
 of map tasks as the numSplits argument. This number is treated as a hint,
 as InputFormat implementations are free to return a different number of
 splits to the number specified in numSplits. Having calculated the
 splits, the client sends them to the jobtracker, which uses their storage
 locations to schedule map tasks to process them on the tasktrackers. ...

  I am not sure on how to get more info.

  Would you recommend me to try to find the answer on the book? Or
 should I read hadoop source code directly?

 Best regards,
 Marcelo.


 2013/1/29 Marcelo Elias Del Valle mvall...@gmail.com

 I implemented my custom input format. Here is how I used it:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java

 As you can see, I do:
 importerJob.setInputFormatClass(CSVNLineInputFormat.class);

 And here is the Input format and the linereader:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java

 In this input format, I completely ignore these other parameters and get
 the splits by the number of lines. The amount of lines per map can be
 controlled by the same parameter used in NLineInputFormat:

 public static final String LINES_PER_MAP =
 mapreduce.input.lineinputformat.linespermap;
 However, it has really no effect on the number of maps.



 2013/1/29 Vinod Kumar Vavilapalli vino...@hortonworks.com


 Regarding your original question, you can use the min and max split
 settings to control the number of maps:
 http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html.
  See #setMinInputSplitSize and #setMaxInputSplitSize. Or
 use mapred.min.split.size directly.

 W.r.t your custom inputformat, are you sure you job is using this
 InputFormat and not the default one?

  HTH,
 +Vinod Kumar Vavilapalli
 Hortonworks Inc.
 http://hortonworks.com/

 On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:

 Just to complement the last question, I have implemented the getSplits
 method in my input format:

 https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

 However, it still doesn't create more than 2 map tasks. Is there
 something I could do about it to assure more map tasks are created?

 Thanks
 Marcelo.


 2013/1/28 Marcelo Elias Del Valle mvall...@gmail.com

 Sorry for asking too many questions, but the answers are really
 happening.


 2013/1/28 Harsh J ha...@cloudera.com

 This seems CPU-oriented. You probably want the NLineInputFormat? See

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
 .
 This should let you spawn more maps as we, based on your N factor.


 Indeed, CPU is my bottleneck. That's why I want more things in parallel.
 Actually, I wrote my own InputFormat, to be able to process multiline
 CSVs: https://github.com/mvallebr/CSVInputFormat
 I could change it to read several lines at a time, but would this alone
 allow more tasks running in parallel?


 Not really - Slots are capacities, rather than split factors
 themselves. You can have N slots always available, but your job has to
 supply as many map tasks (based on its input/needs/etc.) to use them
 up.


 But how can I do that (supply map tasks) in my job? changing its code?
 hadoop config?


 Unless your job sets the number of reducers to 0 manually, 1 default
 reducer is always run that waits to see if it has any outputs from
 maps. If it does not receive any outputs after maps have all
 completed, it 

Re: Initial Permission Settings

2013-01-29 Thread Vinod Kumar Vavilapalli
Please check your dfs umask (dfs.umask configuration property).

HTH,
+Vinod


On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy 
serge.blazhiyevs...@nice.com wrote:

 Hi all,

 Quick question about hadoop dfs -put local_file hdfs_file command


 It seems that regardless of permissions for local_file the initial
 permissions for hdfs_file are -rw-r--r--, which allow file to be read by
 other users.

 Is there are way to change the initial file settings?

 Thanks in advance.
 Serge





-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/


Re: Issue with Reduce Side join using datajoin package

2013-01-29 Thread Vinod Kumar Vavilapalli
Seems like a bug in your code, can you share the source here?

+Vinod


On Tue, Jan 29, 2013 at 4:00 AM, Vikas Jadhav vikascjadha...@gmail.comwrote:

 I am using Hadoop 1.0.3

 I am getting following Error


 13/01/29 06:55:19 INFO mapred.JobClient: Task Id :
 attempt_201301290120_0006_r_00_0, Status : FAILED
 java.lang.NullPointerException
 at MyJoin$TaggedWritable.readFields(MyJoin.java:101)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
 at
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1271)
 at
 org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1211)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:249)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:245)
 at
 org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.regroup(DataJoinReducerBase.java:106)
 at
 org.apache.hadoop.contrib.utils.join.DataJoinReducerBase.reduce(DataJoinReducerBase.java:129)
 at
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)




 It is poiting to  String dataClz = in.readUTF(); this line in readFields


 *

 public
 *

 *void* readFields( DataInput in) *throws* IOException

 {

 *this*.tag.readFields( in);

 //String dataClz = in.readUTF();

 String dataClz = in.readUTF();  error log show this line is culprit

 *try*
 * *

 {

 //try - catch is needed because the  error: unreported exception

 //ClassNotFoundException; must be caught or declared to be thrown

 //is raised from compiler

 *if*( *this*.data == *null* || !*this*.data.getClass().getName().equals(
 dataClz))

 {

 //this line of code raises the compile error mentioned above

 *this*.data = (Writable) ReflectionUtils.*newInstance*( Class.*forName*(
 dataClz), *null*);

 }

 *this*.data.readFields( in);

 }

 *catch*( ClassNotFoundException cnfe)

 {

 System.*out*.println( Problem in TaggedWritable class, method
 readFields.);

 }

 }//end readFields


 --
 *
 *
 *

 Thanx and Regards*
 * Vikas Jadhav*




-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/


Re: ClientProtocol Version mismatch. (client = 69, server = 1)

2013-01-29 Thread Suresh Srinivas
Please take this up in CDH mailing list. Most likely you are using client
that is not from 2.0 release of Hadoop.


On Tue, Jan 29, 2013 at 12:33 PM, Kim Chew kchew...@gmail.com wrote:

 I am using CDH4 (2.0.0-mr1-cdh4.1.2) vm running on my mbp.

 I was trying to invoke a remote method in the ClientProtocol via RPC,
 however I am getting this exception.

 2013-01-29 11:20:45,810 ERROR
 org.apache.hadoop.security.UserGroupInformation:
 PriviledgedActionException as:training (auth:SIMPLE)
 cause:org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
 org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
 (client = 69, server = 1)
 2013-01-29 11:20:45,810 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 6 on 8020, call
 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
 192.168.140.1:50597: error: org.apache.hadoop.ipc.RPC$VersionMismatch:
 Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version
 mismatch. (client = 69, server = 1)
 org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol
 org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
 (client = 69, server = 1)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine.java:400)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:435)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

 I could understand if the Server's ClientProtocol has version number
 60 or something else, but how could it has a version number of 1?

 Thanks.

 Kim




-- 
http://hortonworks.com/download/


Re: Initial Permission Settings

2013-01-29 Thread Serge Blazhiyevskyy
Thanks for response.


I am still a bit confused.

There are two parameters: dfs.umask and dfs.umaskmode and both of them seems to 
be deprecated in different hadoop versions.

Does anybody know whats going on with those two parameters?


Thanks
Serge




On Jan 29, 2013, at 12:10 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.commailto:vino...@hortonworks.com wrote:


Please check your dfs umask (dfs.umask configuration property).

HTH,
+Vinod


On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy 
serge.blazhiyevs...@nice.commailto:serge.blazhiyevs...@nice.com wrote:
Hi all,

Quick question about hadoop dfs -put local_file hdfs_file command


It seems that regardless of permissions for local_file the initial permissions 
for hdfs_file are -rw-r--r--, which allow file to be read by other users.

Is there are way to change the initial file settings?

Thanks in advance.
Serge





--
+Vinod
Hortonworks Inc.
http://hortonworks.com/



Re: Using distcp with Hadoop HA

2013-01-29 Thread Suresh Srinivas
Currently, as you have pointed out, client side configuration based
failover is used in HA setup. The configuration must define namenode
addresses  for the nameservices of both the clusters. Are the datanodes
belonging to the two clusters running on the same set of nodes? Can you
share the configuration you are using, to diagnose the problem?

- I am trying to do a distcp from cluster A to cluster B. Since no
 operations are supported on the standby namenode, I need to specify either
 the active namenode while using distcp or use the failover proxy provider
 (dfs.client.failover.proxy.provider.clusterA) where I can specify the two
 namenodes for cluster B and the failover code inside HDFS will figure it
 out.



 - If I use the failover proxy provider, some of my datanodes on cluster A
 would connect to the namenode on cluster B and vice versa. I am assuming
 that is because I have configured both nameservices in my hdfs-site.xml for
 distcp to work.. I have configured dfs.nameservice.id to be the right one
 but the datanodes do not seem to respect that.

 What is the best way to use distcp with Hadoop HA configuration without
 having the datanodes to connect to the remote namenode? Thanks

 Regards,
 Dhaval




-- 
http://hortonworks.com/download/


Re: Initial Permission Settings

2013-01-29 Thread Michael Katzenellenbogen
dfs.umask has been deprecated in favor of dfs.umaskmode

Typically a umask is represented as an octal, whereas dfs.umask was
represented as a decimal. dfs.umaskmode accepts octal notation, as is
preferred. For details, feel free to read up on
https://issues.apache.org/jira/browse/HADOOP-6234

-Michael

On Tue, Jan 29, 2013 at 5:47 PM, Serge Blazhiyevskyy 
serge.blazhiyevs...@nice.com wrote:

 Thanks for response.


 I am still a bit confused.

 There are two parameters: dfs.umask and dfs.umaskmode and both of them
 seems to be deprecated in different hadoop versions.

 Does anybody know whats going on with those two parameters?


 Thanks
 Serge




 On Jan 29, 2013, at 12:10 PM, Vinod Kumar Vavilapalli 
 vino...@hortonworks.commailto:vino...@hortonworks.com wrote:


 Please check your dfs umask (dfs.umask configuration property).

 HTH,
 +Vinod


 On Tue, Jan 29, 2013 at 12:02 PM, Serge Blazhiyevskyy 
 serge.blazhiyevs...@nice.commailto:serge.blazhiyevs...@nice.com wrote:
 Hi all,

 Quick question about hadoop dfs -put local_file hdfs_file command


 It seems that regardless of permissions for local_file the initial
 permissions for hdfs_file are -rw-r--r--, which allow file to be read by
 other users.

 Is there are way to change the initial file settings?

 Thanks in advance.
 Serge





 --
 +Vinod
 Hortonworks Inc.
 http://hortonworks.com/




Re: Using distcp with Hadoop HA

2013-01-29 Thread Dhaval Shah
No the datanodes are running on different sets of machines. The configuration 
looks like this:
The problem is that datanodes in clusterA are trying to connect to namenodes in 
clusterB (and this seems random.. like it trying to randomly select from the 4 
namenodes)

property
namedfs.nameservices/name
valueclusterA,clusterB/value
description
Comma-separated list of nameservices.
/description
finaltrue/final
/property
property
namedfs.nameservice.id/name
valueclusterA/value
description
The ID of this nameservice. If the nameservice ID is not
configured or more than one nameservice is configured for
dfs.nameservices it is determined automatically by
matching the local node's address with the configured address.
/description
finaltrue/final
/property
property
namedfs.ha.namenodes.clusterA/name
valueclusterAnn1,clusterAnn2/value
description
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
/description
finaltrue/final
/property
property
namedfs.namenode.rpc-address.clusterA.clusterAnn1/name
valueclusterAnn1:8000/value
description
Set the full address and IPC port of the NameNode process
/description
finaltrue/final
/property
property
namedfs.namenode.rpc-address.clusterA.clusterAnn2/name
valueclusterAnn1:8000/value
description
Set the full address and IPC port of the NameNode process
/description
finaltrue/final
/property
property
namedfs.ha.namenodes.clusterB/name
valueclusterBnn1,clusterBnn2/value
description
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
/description
finaltrue/final
/property
property
namedfs.namenode.rpc-address.clusterB.clusterBnn1/name
valueclusterBnn1:8000/value
description
Set the full address and IPC port of the NameNode process
/description
finaltrue/final
/property
property
namedfs.namenode.rpc-address.clusterB.clusterBnn2/name
valueclusterBnn2:8000/value
description
Set the full address and IPC port of the NameNode process
/description
finaltrue/final
/property
property
namedfs.client.failover.proxy.provider.clusterA/name
valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
description
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
/description
finaltrue/final
/property
property
namedfs.client.failover.proxy.provider.clusterB/name
valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
description
Configure the name of the Java class which the DFS Client will 
use to determine which NameNode is the current Active, 
and therefore which NameNode is currently serving client requests.
/description
finaltrue/final
/property
 
Regards,
Dhaval



 From: Suresh Srinivas sur...@hortonworks.com
To: hdfs-u...@hadoop.apache.org user@hadoop.apache.org; Dhaval Shah 
prince_mithi...@yahoo.co.in 
Sent: Tuesday, 29 January 2013 6:03 PM
Subject: Re: Using distcp with Hadoop HA
 

Currently, as you have pointed out, client side configuration based failover is 
used in HA setup. The configuration must define namenode addresses  for the 
nameservices of both the clusters. Are the datanodes belonging to the two 
clusters running on the same set of nodes? Can you share the configuration you 
are using, to diagnose the problem? 

- I am trying to do a distcp from cluster A to cluster B. Since no operations 
are supported on the standby namenode, I need to specify either the active 
namenode while using distcp or use the failover proxy provider 
(dfs.client.failover.proxy.provider.clusterA) where I can specify the two 
namenodes for cluster B and the failover code inside HDFS will figure it out.
 
- If I use the failover proxy provider, some of my datanodes on cluster A would 
connect to the namenode on cluster B and vice versa. I am assuming that is 
because I have configured both nameservices in my hdfs-site.xml for distcp to 
work.. I have configured dfs.nameservice.id to be the right one but the 
datanodes do not seem to respect that. 


What is the best way to use distcp with Hadoop HA configuration without having 
the datanodes to connect to the remote namenode? Thanks
 
Regards,
Dhaval


-- 
http://hortonworks.com/download/

Oozie workflow error - renewing token issue

2013-01-29 Thread Corbett Martin
Oozie question

I'm trying to run an Oozie workflow (sqoop action) from the Hue console and it 
fails every time.  No exception in the oozie log but I see this in the Job 
Tracker log file.

Two primary issues seem to be

1.   Client mapred tries to renew a token with renewer specified as mr token

And


2.   Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN

Any ideas how to get past this?

Full Stacktrace:

2013-01-29 17:11:28,860 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Creating password for identifier: owner=hdfs, renewer=mr token, 
realUser=oozie, issueDate=1359501088860, maxDate=136010560, 
sequenceNumber=75, masterKeyId=8
2013-01-29 17:11:28,871 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Creating password for identifier: owner=hdfs, renewer=mr token, 
realUser=oozie, issueDate=1359501088871, maxDate=136010571, 
sequenceNumber=76, masterKeyId=8
2013-01-29 17:11:29,202 INFO 
org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal: registering 
token for renewal for service =10.204.12.62:8021 and jobID = 
job_201301231648_0029
2013-01-29 17:11:29,211 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Token renewal requested for identifier: owner=hdfs, renewer=mr token, 
realUser=oozie, issueDate=1359501088871, maxDate=136010571, 
sequenceNumber=76, masterKeyId=8
2013-01-29 17:11:29,211 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:mapred (auth:SIMPLE) 
cause:org.apache.hadoop.security.AccessControlException: Client mapred tries to 
renew a token with renewer specified as mr token
2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token: Cannot 
find class for token kind MAPREDUCE_DELEGATION_TOKEN
2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 
on 8021, call renewDelegationToken(Kind: MAPREDUCE_DELEGATION_TOKEN, Service: 
10.204.12.62:8021, Ident: 00 04 68 64 66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 
7a 69 65 8a 01 3c 88 94 58 67 8a 01 3c ac a0 dc 67 4c 08), rpc version=2, 
client version=28, methodsFingerPrint=1830206421 from 10.204.12.62:9706: error: 
org.apache.hadoop.security.AccessControlException: Client mapred tries to renew 
a token with renewer specified as mr token
org.apache.hadoop.security.AccessControlException: Client mapred tries to renew 
a token with renewer specified as mr token
  at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274)
  at 
org.apache.hadoop.mapred.JobTracker.renewDelegationToken(JobTracker.java:3738)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at 
org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:474)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
2013-01-29 17:11:29,212 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:mapred (auth:SIMPLE) 
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
 Client mapred tries to renew a token with renewer specified as mr token
  at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:274)
  at 
org.apache.hadoop.mapred.JobTracker.renewDelegationToken(JobTracker.java:3738)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at 
org.apache.hadoop.ipc.WritableRpcEngine$Server$WritableRpcInvoker.call(WritableRpcEngine.java:474)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
  at java.security.AccessController.doPrivileged(Native Method)
  at 

RE: Tricks to upgrading Sequence Files?

2013-01-29 Thread David Parks
I'll consider a patch to the SequenceFile, if we could manually override the
sequence file input Key and Value that's read from the sequence file headers
we'd have a clean solution.

I don't like versioning my Model object because it's used by 10's of other
classes and I don't want to risk less maintained classes continuing to use
an old version.

For the time being I just used 2 jobs. First I renamed the old Model Object
to the original name, read it in, upgraded it, and wrote the new version
with a different class name.

Then I renamed the classes again so the new model object used the original
name and read in the altered name and cloned it into the original name.

All in all an hours work only, but having a cleaner process would be better.
I'll add the request to JIRA at a minimum.

Dave


-Original Message-
From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Wednesday, January 30, 2013 2:32 AM
To: user@hadoop.apache.org
Subject: Re: Tricks to upgrading Sequence Files?

This is a pretty interesting question, but unfortunately there isn't an
inbuilt way in SequenceFiles itself to handle this. However, your key/value
classes can be made to handle versioning perhaps - detecting if what they've
read is of an older time and decoding it appropriately (while handling newer
encoding separately, in the normal fashion).
This would be much better than going down the classloader hack paths I
think?

On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote:
 Anyone have any good tricks for upgrading a sequence file.



 We maintain a sequence file like a flat file DB and the primary object 
 in there changed in recent development.



 It's trivial to write a job to read in the sequence file, update the 
 object, and write it back out in the new format.



 But since sequence files read and write the key/value class I would 
 either need to rename the model object with a version number, or 
 change the header of each sequence file.



 Just wondering if there are any nice tricks to this.



--
Harsh J



what will happen when HDFS restarts but with some dead nodes

2013-01-29 Thread Nan Zhu
Hi, all 

I'm wondering if HDFS is stopped, and some of the machines of the cluster are 
moved,  some of the block replication are definitely lost for moving machines

when I restart the system, will the namenode recalculate the data distribution? 

Best, 

-- 
Nan Zhu
School of Computer Science,
McGill University




eclipse plugin

2013-01-29 Thread YouPeng Yang
Hi

 I wonder whether there is  eclipse-plugin for hadoop development,i'm using
CDH4u1  eclipse indigo.

 thank you...


[ANN] First 2013 Munich OpenHUG Meeting

2013-01-29 Thread Lars George
I am pleased to invite you to our next Munich Open Hadoop User Group Meeting!

Like always we are looking forward to see everyone again and are welcoming new 
attendees to join our group. We are enthusiast about all things related to 
scalable, distributed storage and database systems. We are not limiting us to a 
particular system but appreciate anyone who would like to share their 
experiences.

When: Friday February 22nd, 2013 from 4PM to 8PM
Where: T-Systems International GmbH, Dachauer Straße 651, 80995 München, Room 
EG-148E

Thanks to T-Systems (http://www.t-systems.com/) for helping to organize the 
event and for providing the infrastructure, as well as food and drinks.

We have quite a few very interesting talks scheduled:

- Ending Confusion on Big Data Technologies: How Use Case Segmentation drives 
Target Architectures and Technology Selection at Deutsche Telekom by Juergen 
Urbanski, Chief Technologist at T-Systems

- Low latency data processing with Impala by Lars George, Director EMEA 
Services at Cloudera

We are looking for further volunteers to submit talks, so if you are working in 
the new Big Data or NoSQL space, and would like to give a presentation, 
please let me know (via email to l...@cloudera.com).

Looking forward to seeing you there!

Please RSVP here:

Xing: https://www.xing.com/events/2013-munich-openhug-meeting-1198136

Best regards,
Lars

Re: what will happen when HDFS restarts but with some dead nodes

2013-01-29 Thread Chen He
Hi Nan

Namenode will stay in safemode before all blocks are replicated. During
this time, the jobtracker can not see any tasktrackers. (MRv1).

Chen

On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu zhunans...@gmail.com wrote:

  Hi, all

 I'm wondering if HDFS is stopped, and some of the machines of the cluster
 are moved,  some of the block replication are definitely lost for moving
 machines

 when I restart the system, will the namenode recalculate the data
 distribution?

 Best,

 --
 Nan Zhu
 School of Computer Science,
 McGill University





Re: what will happen when HDFS restarts but with some dead nodes

2013-01-29 Thread Nan Zhu
So, we can assume that all blocks are fully replicated at the start point of 
HDFS?

Best, 

-- 
Nan Zhu
School of Computer Science,
McGill University



On Tuesday, 29 January, 2013 at 10:50 PM, Chen He wrote:

 Hi Nan
 
 Namenode will stay in safemode before all blocks are replicated. During this 
 time, the jobtracker can not see any tasktrackers. (MRv1).
 
 Chen
 
 On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu zhunans...@gmail.com 
 (mailto:zhunans...@gmail.com) wrote:
  Hi, all 
  
  I'm wondering if HDFS is stopped, and some of the machines of the cluster 
  are moved,  some of the block replication are definitely lost for moving 
  machines
  
  when I restart the system, will the namenode recalculate the data 
  distribution? 
  
  Best, 
  
  -- 
  Nan Zhu
  School of Computer Science,
  McGill University
  
  
 



Re: eclipse plugin

2013-01-29 Thread Martinus Martinus
Hi YouPeng,

I also wondering the same thing. Is there anybody now about eclipse-plugin
for hadoop?

Thanks.

On Wed, Jan 30, 2013 at 11:19 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi

  I wonder whether there is  eclipse-plugin for hadoop development,i'm
 using CDH4u1  eclipse indigo.

  thank you...