Re: no jobtracker to stop,no namenode to stop
Hi Nikhil, Appreciate your quick response on this, but the issue still continues. I believe I have covered all the pointers you have mentioned. Still I am pasting the portions of the documents so that you can verify. 1. /etc/hosts file, localhost should not be commented, and add ip address. The entry looks like the below: # localhost name resolution is handled within DNS itself. 127.0.0.1 localhost 2. core-site.xml, hdfs//localhost:port number configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property /configuration 3. mapred-site.xml hdfs//localhost:port number mapred.local.dir configuration property namemapred.job.tracker/name valuelocalhost:9001/value /property /configuration 4. hdfs-site.xml 1.replication factor should be one include dfs.name.dir property dfs.data.dir property for both the property check on net configuration property namedfs.replication/name value1/value /property property namedfs.name.dir/name valuec:/Hadoop/name/value /property property namedfs.data.dir/name valuec:/Hadoop/data/value /property /configuration I am getting stuck at: 13/08/30 11:39:26 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 13/08/30 11:39:26 INFO input.FileInputFormat: Total input paths to process : 1 13/08/30 11:39:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/08/30 11:39:26 WARN snappy.LoadSnappy: Snappy native library not loaded 13/08/30 11:39:27 INFO mapred.JobClient: Running job: job_201308301135_0002 13/08/30 11:39:28 INFO mapred.JobClient: map 0% reduce 0% My Jobtracker UI looks like this: Cluster Summary (Heap Size is 120.06 MB/888.94 MB)Running Map TasksRunning Reduce TasksTotal SubmissionsNodesOccupied Map SlotsOccupied Reduce SlotsReserved Map SlotsReserved Reduce SlotsMap Task CapacityReduce Task CapacityAvg. Tasks/NodeBlacklisted NodesGraylisted NodesExcluded Nodes0010http://localhost:50030/machines.jsp?type=active 00-0 http://localhost:50030/machines.jsp?type=blacklisted0http://localhost:50030/machines.jsp?type=graylisted 0 http://localhost:50030/machines.jsp?type=excluded I have a feeling that the jobtracker is not able to find the task tracker as there is a 0 in nodes column. Does this ring any bells to you? Thanks, Nitesh Jain On Thu, Aug 29, 2013 at 5:51 PM, Nikhil2405 [via Hadoop Common] ml-node+s472056n4024848...@n3.nabble.com wrote: Hi Nitesh, I think your problem may be in your configuration, so check your files as follow 1. /etc/hosts file, localhost should not be commented, and add ip address. 2. core-site.xml, hdfs//localhost:port number 3. mapred-site.xml hdfs//localhost:port number mapred.local.dir 4. hdfs-site.xml 1.replication factor should be one include dfs.name.dir property dfs.data.dir property for both the property check on net Thanks Nikhil -- If you reply to this email, your message will be added to the discussion below: http://hadoop-common.472056.n3.nabble.com/no-jobtracker-to-stop-no-namenode-to-stop-tp34874p4024848.html To unsubscribe from no jobtracker to stop,no namenode to stop, click herehttp://hadoop-common.472056.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=34874code=bml0ZXNoLmphaW44NUBnbWFpbC5jb218MzQ4NzR8MzUzNjEyNzQx . NAMLhttp://hadoop-common.472056.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://hadoop-common.472056.n3.nabble.com/no-jobtracker-to-stop-no-namenode-to-stop-tp34874p4024979.html Sent from the Users mailing list archive at Nabble.com.
Re: Sqoop issue related to Hadoop
Hi Raj The easiest approach to pull out task log is using JT web UI. Got to JT web UI, drill down on the sqoop job. You'll get a list of failed/killed tasks, your failed thask would be in there. Clicking on that task would give you the logs for the same. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Hadoop Raj hadoop...@yahoo.com Date: Thu, 29 Aug 2013 00:43:59 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: Sqoop issue related to Hadoop Hi Kate, Where can I find the task attempt log? Can you specify the location please? Thanks, Raj On Aug 28, 2013, at 7:13 PM, Kathleen Ting kathl...@apache.org wrote: Raj, in addition to what Abe said, please also send the failed task attempt log attempt_201307041900_0463_m_00_0 as well. Thanks, Kate On Wed, Aug 28, 2013 at 2:25 PM, Abraham Elmahrek a...@cloudera.com wrote: Hey Raj, It seems like the number of fields you have in your data doesn't match the number of fields in your RAJ.CUSTOMERS table. Could you please add --verbose to the beginning of your argument list and provide the entire contents here? -Abe On Wed, Aug 28, 2013 at 9:36 AM, Raj Hadoop hadoop...@yahoo.com wrote: Hello all, I am getting an error while using sqoop export ( Load HDFS file to Oracle ). I am not sure the issue might be a Sqoop or Hadoop related one. So I am sending it to both the dist lists. I am using - sqoop export --connect jdbc:oracle:thin:@//dbserv:9876/OKI --table RAJ.CUSTOMERS --export-dir /user/hive/warehouse/web_cust --input-null-string '\\N' --input-null-non-string '\\N' --username --password -m 1 --input-fields-terminated-by '\t' I am getting the following error - Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. 13/08/28 09:42:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 13/08/28 09:42:36 INFO manager.SqlManager: Using default fetchSize of 1000 13/08/28 09:42:36 INFO tool.CodeGenTool: Beginning code generation 13/08/28 09:42:38 INFO manager.OracleManager: Time zone has been set to GMT 13/08/28 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM RAJ.CUSTOMERS t WHERE 1=0 13/08/28 09:42:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /software/hadoop/hadoop/hadoop-1.1.2 Note: /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ_CUSTOMERS.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/08/28 09:42:40 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ.CUSTOMERS.jar 13/08/28 09:42:40 INFO mapreduce.ExportJobBase: Beginning export of RAJ.CUSTOMERS 13/08/28 09:42:41 INFO manager.OracleManager: Time zone has been set to GMT 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1 13/08/28 09:42:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/08/28 09:42:43 WARN snappy.LoadSnappy: Snappy native library not loaded 13/08/28 09:42:43 INFO mapred.JobClient: Running job: job_201307041900_0463 13/08/28 09:42:44 INFO mapred.JobClient: map 0% reduce 0% 13/08/28 09:42:56 INFO mapred.JobClient: map 1% reduce 0% 13/08/28 09:43:00 INFO mapred.JobClient: map 2% reduce 0% 13/08/28 09:43:03 INFO mapred.JobClient: map 4% reduce 0% 13/08/28 09:43:10 INFO mapred.JobClient: map 5% reduce 0% 13/08/28 09:43:13 INFO mapred.JobClient: map 6% reduce 0% 13/08/28 09:43:17 INFO mapred.JobClient: Task Id : attempt_201307041900_0463_m_00_0, Status : FAILED java.io.IOException: Can't export data, please check task tracker logs at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.util.NoSuchElementException at java.util.ArrayList$Itr.next(ArrayList.java:794) at RAJ_CUSTOMERS.__loadFromFields(RAJ_CUSTOMERS.java:1057) at RAJ_CUSTOMERS.parse(RAJ_CUSTOMERS.java:876) at
reading input stream
Hi, Probably a very stupid question. I have this data in binary format... and the following piece of code works for me in normal java. public classparser { public static void main(String [] args) throws Exception{ String filename = sample.txt; File file = new File(filename); FileInputStream fis = new FileInputStream(filename); System.out.println(Total file size to read (in bytes) : + fis.available()); BSONDecoder bson = new BSONDecoder(); System.out.println(bson.readObject(fis)); } } Now finally the last line is the answer.. Now, I want to implement this on hadoop but the challenge (which I think) is.. that I am not reading or parsing data line by line.. rather its a stream of data??? right?? How do i replicate the above code logic.. but in hadoop?
how to find process under node
Hi All, what im trying out here is to capture the process which is running under which node this is the unix script which i tried !/bin/ksh Cnt=cat /users/hadoop/unixtest/nodefilename.txt | wc -l cd /users/hadoop/unixtest/ ls -ltr | awk '{print $9}' list_of_scripts.txt split -l $Cnt list_of_scripts.txt node_scripts ls -ltr node_scripts* | awk '{print $9}' list_of_node_scripts.txt for i in nodefilename.txt do for j in list_of_node_scripts.txt do node=$i script_file=$j cat $node\n $script_file $script_file done done exit 0; but my result should look like below: node1 node2 - --- process1 proces3 process2 proces4 can some one please help in this.. thanks in advance..
Re: how to find process under node
Are your trying to find the java process under a node...Then simple thing would be to do ssh and run jps command to get the list of java process Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 12:27 PM, suneel hadoop suneel.bigd...@gmail.com wrote: Hi All, what im trying out here is to capture the process which is running under which node this is the unix script which i tried !/bin/ksh Cnt=cat /users/hadoop/unixtest/nodefilename.txt | wc -l cd /users/hadoop/unixtest/ ls -ltr | awk '{print $9}' list_of_scripts.txt split -l $Cnt list_of_scripts.txt node_scripts ls -ltr node_scripts* | awk '{print $9}' list_of_node_scripts.txt for i in nodefilename.txt do for j in list_of_node_scripts.txt do node=$i script_file=$j cat $node\n $script_file $script_file done done exit 0; but my result should look like below: node1 node2 - --- process1 proces3 process2 proces4 can some one please help in this.. thanks in advance..
Re: reading input stream
Path p = new Path(path of the file which youwould like to read from HDFS); FSDataInputStream iStream = FileSystem.open(p); String str; while((str = iStream.readLine())!=null) { System.out.printn(str); } Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 12:15 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, Probably a very stupid question. I have this data in binary format... and the following piece of code works for me in normal java. public classparser { public static void main(String [] args) throws Exception{ String filename = sample.txt; File file = new File(filename); FileInputStream fis = new FileInputStream(filename); System.out.println(Total file size to read (in bytes) : + fis.available()); BSONDecoder bson = new BSONDecoder(); System.out.println(bson.readObject(fis)); } } Now finally the last line is the answer.. Now, I want to implement this on hadoop but the challenge (which I think) is.. that I am not reading or parsing data line by line.. rather its a stream of data??? right?? How do i replicate the above code logic.. but in hadoop?
Re: Hadoop client user
Put that user in hadoop group... And if the user wants to hadoop client, then the use should be aware of two properties fs.default.name which is the address of NameNode and mapred.job.tracker which is the address of job tracker Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 10:55 AM, Harsh J ha...@cloudera.com wrote: The user1 will mainly require a home directory on the HDFS, created by the HDFS administrator user ('hadoop' in your case): sudo -u hadoop hadoop fs -mkdir /user/user1; sudo -u hadoop hadoop fs -chown user1:user1 /user/user1. After this, the user should be able to run jobs and manipulate files in their own directory. On Thu, Aug 29, 2013 at 10:21 AM, Hadoop Raj hadoop...@yahoo.com wrote: Hi, I have a hadoop learning environment on a pseudo distributed mode. It is owned by the user 'hadoop'. I am trying to get an understanding on how can another user on this box can act as a Hadoop client and able to create HDFS files and run Map Reduce jobs. Say I have a Linux user 'user1'. What permissions , privileges and configuration settings are required for 'user1' to act as a Hadoop client? Thanks, Raj -- Harsh J
Re: Sqoop issue related to Hadoop
Go inside the $HADOOP_HOME/log/user/history... Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 10:13 AM, Hadoop Raj hadoop...@yahoo.com wrote: Hi Kate, Where can I find the task attempt log? Can you specify the location please? Thanks, Raj On Aug 28, 2013, at 7:13 PM, Kathleen Ting kathl...@apache.org wrote: Raj, in addition to what Abe said, please also send the failed task attempt log attempt_201307041900_0463_m_00_0 as well. Thanks, Kate On Wed, Aug 28, 2013 at 2:25 PM, Abraham Elmahrek a...@cloudera.com wrote: Hey Raj, It seems like the number of fields you have in your data doesn't match the number of fields in your RAJ.CUSTOMERS table. Could you please add --verbose to the beginning of your argument list and provide the entire contents here? -Abe On Wed, Aug 28, 2013 at 9:36 AM, Raj Hadoop hadoop...@yahoo.com wrote: Hello all, I am getting an error while using sqoop export ( Load HDFS file to Oracle ). I am not sure the issue might be a Sqoop or Hadoop related one. So I am sending it to both the dist lists. I am using - sqoop export --connect jdbc:oracle:thin:@//dbserv:9876/OKI --table RAJ.CUSTOMERS --export-dir /user/hive/warehouse/web_cust --input-null-string '\\N' --input-null-non-string '\\N' --username --password -m 1 --input-fields-terminated-by '\t' I am getting the following error - Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. Warning: $HADOOP_HOME is deprecated. 13/08/28 09:42:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 13/08/28 09:42:36 INFO manager.SqlManager: Using default fetchSize of 1000 13/08/28 09:42:36 INFO tool.CodeGenTool: Beginning code generation 13/08/28 09:42:38 INFO manager.OracleManager: Time zone has been set to GMT 13/08/28 09:42:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM RAJ.CUSTOMERS t WHERE 1=0 13/08/28 09:42:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /software/hadoop/hadoop/hadoop-1.1.2 Note: /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ_CUSTOMERS.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 13/08/28 09:42:40 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/c1376f66d2151b48024c54305377c981/RAJ.CUSTOMERS.jar 13/08/28 09:42:40 INFO mapreduce.ExportJobBase: Beginning export of RAJ.CUSTOMERS 13/08/28 09:42:41 INFO manager.OracleManager: Time zone has been set to GMT 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1 13/08/28 09:42:43 INFO input.FileInputFormat: Total input paths to process : 1 13/08/28 09:42:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/08/28 09:42:43 WARN snappy.LoadSnappy: Snappy native library not loaded 13/08/28 09:42:43 INFO mapred.JobClient: Running job: job_201307041900_0463 13/08/28 09:42:44 INFO mapred.JobClient: map 0% reduce 0% 13/08/28 09:42:56 INFO mapred.JobClient: map 1% reduce 0% 13/08/28 09:43:00 INFO mapred.JobClient: map 2% reduce 0% 13/08/28 09:43:03 INFO mapred.JobClient: map 4% reduce 0% 13/08/28 09:43:10 INFO mapred.JobClient: map 5% reduce 0% 13/08/28 09:43:13 INFO mapred.JobClient: map 6% reduce 0% 13/08/28 09:43:17 INFO mapred.JobClient: Task Id : attempt_201307041900_0463_m_00_0, Status : FAILED java.io.IOException: Can't export data, please check task tracker logs at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.util.NoSuchElementException at java.util.ArrayList$Itr.next(ArrayList.java:794) at RAJ_CUSTOMERS.__loadFromFields(RAJ_CUSTOMERS.java:1057) at RAJ_CUSTOMERS.parse(RAJ_CUSTOMERS.java:876) at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83) ... 10 more Thanks, Raj
Re: how to find process under node
Hi Suneel, Please provide more details. Like what you want to print and what files you are using with in the script. So that i can help. May be some thing wrong in your script. So i want to check from my end and help you on this case. On Thu, Aug 29, 2013 at 1:10 PM, Shekhar Sharma shekhar2...@gmail.comwrote: Are your trying to find the java process under a node...Then simple thing would be to do ssh and run jps command to get the list of java process Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 12:27 PM, suneel hadoop suneel.bigd...@gmail.com wrote: Hi All, what im trying out here is to capture the process which is running under which node this is the unix script which i tried !/bin/ksh Cnt=cat /users/hadoop/unixtest/nodefilename.txt | wc -l cd /users/hadoop/unixtest/ ls -ltr | awk '{print $9}' list_of_scripts.txt split -l $Cnt list_of_scripts.txt node_scripts ls -ltr node_scripts* | awk '{print $9}' list_of_node_scripts.txt for i in nodefilename.txt do for j in list_of_node_scripts.txt do node=$i script_file=$j cat $node\n $script_file $script_file done done exit 0; but my result should look like below: node1 node2 - --- process1 proces3 process2 proces4 can some one please help in this.. thanks in advance.. -- Pavan Kumar Polineni
HBase client with security
Hi all, I set up Hadoop (1.2.0), Zookeeper (3.4.5) and HBase (0.94.8-security) with security. HBase works if I launch the shell from the node running the master, but I'd like to use it from an external machine. I prepared one, copying the Hadoop and HBase installation folders and adapting the path (indeed I can use the same client to run MR jobs and interact with HDFS). Regarding HBase client configuration: - hbase-site.xml specifies property namehbase.security.authentication/name valuekerberos/value /property property namehbase.rpc.engine/name valueorg.apache.hadoop.hbase.ipc.SecureRpcEngine/value /property property namehbase.zookeeper.quorum/name valuemaster.hadoop.local,host49.hadoop.local/value /property where the zookeeper hosts are reachable and can be solved via DNS. I had to specify them otherwise the shell complains about org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid - I have a keytab for the principal I want to use (user running hbase/my client hostname@MYREALM), correctly addressed by the file hbase/conf/zk-jaas.conf. In hbase-env.sh, the variable HBASE_OPTS points to zk-jaas.conf. Nonetheless, when I issue a command from a HBase shell on the client machine, I got an error in the HBase master log 2013-08-29 10:11:30,890 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 6: readAndProcess threw exception org.apache.hadoop.security.AccessControlException: Authentication is required. Count of bytes read: 0 org.apache.hadoop.security.AccessControlException: Authentication is required at org.apache.hadoop.hbase.ipc.SecureServer$SecureConnection.readAndProcess(SecureServer.java:435) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:748) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:539) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:514) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) It looks like there's a mismatch between the client and the master regarding the authentication mechanism. Note that from the same client machine I can launch and use a Zookeeper shell. What am I missing in the client configuration? Does /etc/krb5.conf play any role into this? Thanks, Matteo Matteo Lanati Distributed Resources Group Leibniz-Rechenzentrum (LRZ) Boltzmannstrasse 1 85748 Garching b. München (Germany) Phone: +49 89 35831 8724
Re: Tutorials that work with modern Hadoop (v1.x.y)?
Have you tried the hortonworks sandbox? It's a self contained Hadoop environment with dataset + tutorials (10ish) on Hive Pig. Thanks Olivier On 27 Aug 2013 15:53, Andrew Pennebaker apenneba...@42six.com wrote: There are a number of Hadoop tutorials and textbooks available, but they always seem to target older versions of Hadoop. Does anyone know of good tutorials that work with modern Hadoop verions (v1.x.y)? -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Fwd: Pig GROUP operator - Data is shuffled and wind up together for the same grouping key
Appreciate the response. I'm facing this issue in prod. -- Forwarded message -- From: Viswanathan J jayamviswanat...@gmail.com Date: Thu, Aug 29, 2013 at 2:00 PM Subject: Pig GROUP operator - Data is shuffled and wind up together for the same grouping key To: u...@pig.apache.org u...@pig.apache.org Hi, I'm using pig version 0.11.0 While using GROUP operator in Pig all the data is shuffled, so that rows in different partitions that have the same grouping key wind up together and got wrong results for grouping. While storing the result data, it is share work between multiple calculations. How to solve this? Please advice. -- Regards, Viswa.J -- Regards, Viswa.J
Re: How to pass parameter to mappers
@rab ra : Last line of the para. An example : *Job Setup -* Configuration conf = new Configuration(); conf.set(param, value); Job job = new Job(conf); *Inside mapper -* Configuration conf = context.getConfiguration(); String paramValue = conf.get(param); HTH Warm Regards, Tariq cloudfront.blogspot.com On Wed, Aug 28, 2013 at 7:05 PM, Shahab Yunus shahab.yu...@gmail.comwrote: See here: http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Job+Configuration Regards, Shahab On Wed, Aug 28, 2013 at 7:59 AM, rab ra rab...@gmail.com wrote: Hello Any hint on how to pass parameters to mappers in 1.2.1 hadoop release?
Re: Simplifying MapReduce API
Just to add to the above comments, you just have to extend the classes * Mapper* and *Reducer* as per the new API. Warm Regards, Tariq cloudfront.blogspot.com On Wed, Aug 28, 2013 at 1:26 AM, Don Nelson dieseld...@gmail.com wrote: I agree with @Shahab - it's simple enough to declare both interfaces in one class if that's what you want to do. But given the distributed behavior of Hadoop, it's likely that your mappers will be running on different nodes than your reducers anyway - why ship around duplicate code? On Tue, Aug 27, 2013 at 9:48 AM, Shahab Yunus shahab.yu...@gmail.comwrote: For starters (experts might have more complex reasons), what if your respective map and reduce logic becomes complex enough to demand separate classes? Why tie the clients to implement both by moving these in one Job interface. In the current design you can always implement both (map and reduce) interfaces if your logic is simple enough and go the other route, of separate classes if that is required. I think it is more flexible this way (you can always build up from and on top of granular design, rather than other way around.) I hope I understood your concern correctly... Regards, Shahab On Tue, Aug 27, 2013 at 11:35 AM, Andrew Pennebaker apenneba...@42six.com wrote: There seems to be an abundance of boilerplate patterns in MapReduce: * Write a class extending Map (1), implementing Mapper (2), with a map method (3) * Write a class extending Reduce (4), implementing Reducer (5), with a reduce method (6) Could we achieve the same behavior with a single Job interface requiring map() and reduce() methods? -- A child of five could understand this. Fetch me a child of five.
Re: Tutorials that work with modern Hadoop (v1.x.y)?
In the mean time, I was able to cobble together a working wordcount job. Hardest parts were installing hadoop and configuring the classpath. https://github.com/mcandre/hadoop-docs-tutorial#hadoop-docs-tutorial---distributed-wc On Thu, Aug 29, 2013 at 4:44 AM, Olivier Renault orena...@hortonworks.comwrote: Have you tried the hortonworks sandbox? It's a self contained Hadoop environment with dataset + tutorials (10ish) on Hive Pig. Thanks Olivier On 27 Aug 2013 15:53, Andrew Pennebaker apenneba...@42six.com wrote: There are a number of Hadoop tutorials and textbooks available, but they always seem to target older versions of Hadoop. Does anyone know of good tutorials that work with modern Hadoop verions (v1.x.y)? CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
RE: HBase client with security
Please ask this question in u...@hbase.apache.org, you would get better response there. Thanks Devaraj k -Original Message- From: Lanati, Matteo [mailto:matteo.lan...@lrz.de] Sent: 29 August 2013 14:03 To: user@hadoop.apache.org Subject: HBase client with security Hi all, I set up Hadoop (1.2.0), Zookeeper (3.4.5) and HBase (0.94.8-security) with security. HBase works if I launch the shell from the node running the master, but I'd like to use it from an external machine. I prepared one, copying the Hadoop and HBase installation folders and adapting the path (indeed I can use the same client to run MR jobs and interact with HDFS). Regarding HBase client configuration: - hbase-site.xml specifies property namehbase.security.authentication/name valuekerberos/value /property property namehbase.rpc.engine/name valueorg.apache.hadoop.hbase.ipc.SecureRpcEngine/value /property property namehbase.zookeeper.quorum/name valuemaster.hadoop.local,host49.hadoop.local/value /property where the zookeeper hosts are reachable and can be solved via DNS. I had to specify them otherwise the shell complains about org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid - I have a keytab for the principal I want to use (user running hbase/my client hostname@MYREALM), correctly addressed by the file hbase/conf/zk-jaas.conf. In hbase-env.sh, the variable HBASE_OPTS points to zk-jaas.conf. Nonetheless, when I issue a command from a HBase shell on the client machine, I got an error in the HBase master log 2013-08-29 10:11:30,890 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 6: readAndProcess threw exception org.apache.hadoop.security.AccessControlException: Authentication is required. Count of bytes read: 0 org.apache.hadoop.security.AccessControlException: Authentication is required at org.apache.hadoop.hbase.ipc.SecureServer$SecureConnection.readAndProcess(SecureServer.java:435) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:748) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:539) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:514) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) It looks like there's a mismatch between the client and the master regarding the authentication mechanism. Note that from the same client machine I can launch and use a Zookeeper shell. What am I missing in the client configuration? Does /etc/krb5.conf play any role into this? Thanks, Matteo Matteo Lanati Distributed Resources Group Leibniz-Rechenzentrum (LRZ) Boltzmannstrasse 1 85748 Garching b. München (Germany) Phone: +49 89 35831 8724
Re: HBase client with security
Hi Devaraj, you're right, I just subscribed, sorry for the spam. Matteo On Aug 29, 2013, at 3:31 PM, Devaraj k devara...@huawei.com wrote: Please ask this question in u...@hbase.apache.org, you would get better response there. Thanks Devaraj k -Original Message- From: Lanati, Matteo [mailto:matteo.lan...@lrz.de] Sent: 29 August 2013 14:03 To: user@hadoop.apache.org Subject: HBase client with security Hi all, I set up Hadoop (1.2.0), Zookeeper (3.4.5) and HBase (0.94.8-security) with security. HBase works if I launch the shell from the node running the master, but I'd like to use it from an external machine. I prepared one, copying the Hadoop and HBase installation folders and adapting the path (indeed I can use the same client to run MR jobs and interact with HDFS). Regarding HBase client configuration: - hbase-site.xml specifies property namehbase.security.authentication/name valuekerberos/value /property property namehbase.rpc.engine/name valueorg.apache.hadoop.hbase.ipc.SecureRpcEngine/value /property property namehbase.zookeeper.quorum/name valuemaster.hadoop.local,host49.hadoop.local/value /property where the zookeeper hosts are reachable and can be solved via DNS. I had to specify them otherwise the shell complains about org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid - I have a keytab for the principal I want to use (user running hbase/my client hostname@MYREALM), correctly addressed by the file hbase/conf/zk-jaas.conf. In hbase-env.sh, the variable HBASE_OPTS points to zk-jaas.conf. Nonetheless, when I issue a command from a HBase shell on the client machine, I got an error in the HBase master log 2013-08-29 10:11:30,890 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 6: readAndProcess threw exception org.apache.hadoop.security.AccessControlException: Authentication is required. Count of bytes read: 0 org.apache.hadoop.security.AccessControlException: Authentication is required at org.apache.hadoop.hbase.ipc.SecureServer$SecureConnection.readAndProcess(SecureServer.java:435) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:748) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:539) at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:514) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) It looks like there's a mismatch between the client and the master regarding the authentication mechanism. Note that from the same client machine I can launch and use a Zookeeper shell. What am I missing in the client configuration? Does /etc/krb5.conf play any role into this? Thanks, Matteo Matteo Lanati Distributed Resources Group Leibniz-Rechenzentrum (LRZ) Boltzmannstrasse 1 85748 Garching b. München (Germany) Phone: +49 89 35831 8724 Matteo Lanati Distributed Resources Group Leibniz-Rechenzentrum (LRZ) Boltzmannstrasse 1 85748 Garching b. München (Germany) Phone: +49 89 35831 8724
Re: Hadoop Yarn - samples
Take a look at the dist-shell example in http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/ I recently wrote up another simplified version of it for illustration purposes here: https://github.com/hortonworks/simple-yarn-app Arun On Aug 28, 2013, at 4:47 AM, Manickam P manicka...@outlook.com wrote: Hi, I have just installed Hadoop 2.0.5 alpha version. I want to analyse how the Yarn resource manager and node mangers works. I executed the map reduce examples but i want to execute the samples in Yarn. Searching for that but unable to find any. Please help me. Thanks, Manickam P -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hadoop Yarn - samples
Is there an example of running a sample yarn application that will only allow one container to start per host? Punnoose, Roshan rashan.punnr...@merck.commailto:rashan.punnr...@merck.com On Aug 29, 2013, at 10:08 AM, Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com wrote: Take a look at the dist-shell example in http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/ I recently wrote up another simplified version of it for illustration purposes here: https://github.com/hortonworks/simple-yarn-app Arun On Aug 28, 2013, at 4:47 AM, Manickam P manicka...@outlook.commailto:manicka...@outlook.com wrote: Hi, I have just installed Hadoop 2.0.5 alpha version. I want to analyse how the Yarn resource manager and node mangers works. I executed the map reduce examples but i want to execute the samples in Yarn. Searching for that but unable to find any. Please help me. Thanks, Manickam P -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.
Is hadoop tread safe?
Hi all, Is hadoop thread safe? Do mappers make use of threads in any chance? A little bit of information on the way they execute in parallel would help me out. Thanks. Regards, Pavan
Re: Is hadoop tread safe?
Mappers don't communicate with each other in traditional MapReduce. If you need something more MPI-ish then look to MPI over YARN or write your own YARN app. If you need multi-threading within the mapper then it is up to you as the java developer to make it thread safe. Use the concurrent libraries like anything else and Bob's your uncle. Having overly-complicated mappers can be difficult to manage however and it kind of misses the mark for MapReduce problems. Maybe if you expand on your use case a bit someone here can provide specific advice. On Thu, Aug 29, 2013 at 10:33 AM, Pavan Sudheendra pavan0...@gmail.comwrote: Hi all, Is hadoop thread safe? Do mappers make use of threads in any chance? A little bit of information on the way they execute in parallel would help me out. Thanks. Regards, Pavan -- * * * * *Adam Muise* Solution Engineer *Hortonworks* amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.http://hortonworks.com/ Hortonworks Virtual Sandbox http://hortonworks.com/sandbox Hadoop: Disruptive Possibilities by Jeff Needhamhttp://hortonworks.com/resources/?did=72cat=1 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Is hadoop tread safe?
No, I had written a huge Map Reduce program which talks with hbase and does a lot of computing using it as a source as well as sink.. One of my colleague saw my code and saw that I had used a lot of static function instead of making use of proper OOP concepts.. He was telling me that it shouldn't be the way I should go about doing it.. But my code works fine.. So, was wondering will I face any problem in the future because of this.. That's all. Regards, Pavan On Aug 29, 2013 8:11 PM, Adam Muise amu...@hortonworks.com wrote: Mappers don't communicate with each other in traditional MapReduce. If you need something more MPI-ish then look to MPI over YARN or write your own YARN app. If you need multi-threading within the mapper then it is up to you as the java developer to make it thread safe. Use the concurrent libraries like anything else and Bob's your uncle. Having overly-complicated mappers can be difficult to manage however and it kind of misses the mark for MapReduce problems. Maybe if you expand on your use case a bit someone here can provide specific advice. On Thu, Aug 29, 2013 at 10:33 AM, Pavan Sudheendra pavan0...@gmail.comwrote: Hi all, Is hadoop thread safe? Do mappers make use of threads in any chance? A little bit of information on the way they execute in parallel would help me out. Thanks. Regards, Pavan -- * * * * *Adam Muise* Solution Engineer *Hortonworks* amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.http://hortonworks.com/ Hortonworks Virtual Sandbox http://hortonworks.com/sandbox Hadoop: Disruptive Possibilities by Jeff Needhamhttp://hortonworks.com/resources/?did=72cat=1 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: reading input stream
Wait.. this is something new to me.. This goes is driver setup??? mapper?? can you elaborate a bit on this?? On Thu, Aug 29, 2013 at 12:43 AM, Shekhar Sharma shekhar2...@gmail.comwrote: Path p = new Path(path of the file which youwould like to read from HDFS); FSDataInputStream iStream = FileSystem.open(p); String str; while((str = iStream.readLine())!=null) { System.out.printn(str); } Regards, Som Shekhar Sharma +91-8197243810 On Thu, Aug 29, 2013 at 12:15 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, Probably a very stupid question. I have this data in binary format... and the following piece of code works for me in normal java. public classparser { public static void main(String [] args) throws Exception{ String filename = sample.txt; File file = new File(filename); FileInputStream fis = new FileInputStream(filename); System.out.println(Total file size to read (in bytes) : + fis.available()); BSONDecoder bson = new BSONDecoder(); System.out.println(bson.readObject(fis)); } } Now finally the last line is the answer.. Now, I want to implement this on hadoop but the challenge (which I think) is.. that I am not reading or parsing data line by line.. rather its a stream of data??? right?? How do i replicate the above code logic.. but in hadoop?
[yarn] job is not getting assigned
Hi, I am in the middle of setting up a hadoop 2 cluster. I am using the hadoop 2.1-beta tarball. My cluster has 1 master node running the hdfs namenode, the resourcemanger and the job history server. Next to that I have 3 nodes acting as datanodes and nodemanagers. In order to test, if everything is working, I submitted the teragen job from the hadoop-examples jar like this: $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.1.0-beta.jar teragen 1000 /user/vagrant/teragen The job starts up and I get the following output: 13/08/29 14:42:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/08/29 14:42:47 INFO client.RMProxy: Connecting to ResourceManager at master.local/192.168.7.10:8032 13/08/29 14:42:48 INFO terasort.TeraSort: Generating 1000 using 2 13/08/29 14:42:48 INFO mapreduce.JobSubmitter: number of splits:2 13/08/29 14:42:48 WARN conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name 13/08/29 14:42:48 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/08/29 14:42:48 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/08/29 14:42:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1377787324271_0001 13/08/29 14:42:50 INFO impl.YarnClientImpl: Submitted application application_1377787324271_0001 to ResourceManager at master.local/ 192.168.7.10:8032 13/08/29 14:42:50 INFO mapreduce.Job: The url to track the job: http://master.local:8088/proxy/application_1377787324271_0001/ 13/08/29 14:42:50 INFO mapreduce.Job: Running job: job_1377787324271_0001 and then it stops. If I check the UI, I see this: application_1377787324271_0001http://master.local:8088/cluster/app/application_1377787324271_0001 vagrantTeraGenMAPREDUCEdefaultThu, 29 Aug 2013 14:42:49 GMTN/AACCEPTED UNDEFINED UNASSIGNED http://master.local:8088/cluster/apps# I have no idea, why it is not starting, nor what to look for. Any pointers are more than welcome! Thanks! - André -- André Kelpe an...@concurrentinc.com http://concurrentinc.com
RE: Hadoop Yarn - samples
Hi Arun, Thanks for your reply. Actually i've installed apache hadoop. The samples you shared looks like hortonworks so will it work fine for me? I got a doubt on this so asking here. Thanks, Manickam P From: a...@hortonworks.com Subject: Re: Hadoop Yarn - samples Date: Thu, 29 Aug 2013 07:08:00 -0700 To: user@hadoop.apache.org Take a look at the dist-shell example in http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/ I recently wrote up another simplified version of it for illustration purposes here: https://github.com/hortonworks/simple-yarn-app Arun On Aug 28, 2013, at 4:47 AM, Manickam P manicka...@outlook.com wrote:Hi, I have just installed Hadoop 2.0.5 alpha version. I want to analyse how the Yarn resource manager and node mangers works. I executed the map reduce examples but i want to execute the samples in Yarn. Searching for that but unable to find any. Please help me. Thanks, Manickam P --Arun C. MurthyHortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Is hadoop tread safe?
Map tasks run in parallel spawned JVMs, so are isolated from one another at runtime. Use of static functions shouldn't affect you generally. Default Map I/O is single-threaded. If you plan to use multiple-threads, use MultiThreadedMapper for proper thread-safety. On Thu, Aug 29, 2013 at 8:15 PM, Pavan Sudheendra pavan0...@gmail.com wrote: No, I had written a huge Map Reduce program which talks with hbase and does a lot of computing using it as a source as well as sink.. One of my colleague saw my code and saw that I had used a lot of static function instead of making use of proper OOP concepts.. He was telling me that it shouldn't be the way I should go about doing it.. But my code works fine.. So, was wondering will I face any problem in the future because of this.. That's all. Regards, Pavan On Aug 29, 2013 8:11 PM, Adam Muise amu...@hortonworks.com wrote: Mappers don't communicate with each other in traditional MapReduce. If you need something more MPI-ish then look to MPI over YARN or write your own YARN app. If you need multi-threading within the mapper then it is up to you as the java developer to make it thread safe. Use the concurrent libraries like anything else and Bob's your uncle. Having overly-complicated mappers can be difficult to manage however and it kind of misses the mark for MapReduce problems. Maybe if you expand on your use case a bit someone here can provide specific advice. On Thu, Aug 29, 2013 at 10:33 AM, Pavan Sudheendra pavan0...@gmail.com wrote: Hi all, Is hadoop thread safe? Do mappers make use of threads in any chance? A little bit of information on the way they execute in parallel would help me out. Thanks. Regards, Pavan -- Adam Muise Solution Engineer Hortonworks amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop. Hortonworks Virtual Sandbox Hadoop: Disruptive Possibilities by Jeff Needham CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Harsh J
Re: Hadoop client user
Thanks Harsh. That is a very good explanation. I am trying to understand how in a production cluster, hadoop user and hadoop clients would be. What users should exist in NN, JT, DN ? Regards, Rajendra From: Harsh J ha...@cloudera.com To: user@hadoop.apache.org user@hadoop.apache.org Sent: Thursday, August 29, 2013 1:25 AM Subject: Re: Hadoop client user The user1 will mainly require a home directory on the HDFS, created by the HDFS administrator user ('hadoop' in your case): sudo -u hadoop hadoop fs -mkdir /user/user1; sudo -u hadoop hadoop fs -chown user1:user1 /user/user1. After this, the user should be able to run jobs and manipulate files in their own directory. On Thu, Aug 29, 2013 at 10:21 AM, Hadoop Raj hadoop...@yahoo.com wrote: Hi, I have a hadoop learning environment on a pseudo distributed mode. It is owned by the user 'hadoop'. I am trying to get an understanding on how can another user on this box can act as a Hadoop client and able to create HDFS files and run Map Reduce jobs. Say I have a Linux user 'user1'. What permissions , privileges and configuration settings are required for 'user1' to act as a Hadoop client? Thanks, Raj -- Harsh J
Re: [yarn] job is not getting assigned
This usually means there are no available resources as seen by the ResourceManager. Do you see Active Nodes on the RM web UI first page? If not, you'll have to check the NodeManager logs to see if they crashed for some reason. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Aug 29, 2013, at 7:52 AM, Andre Kelpe wrote: Hi, I am in the middle of setting up a hadoop 2 cluster. I am using the hadoop 2.1-beta tarball. My cluster has 1 master node running the hdfs namenode, the resourcemanger and the job history server. Next to that I have 3 nodes acting as datanodes and nodemanagers. In order to test, if everything is working, I submitted the teragen job from the hadoop-examples jar like this: $ hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.1.0-beta.jar teragen 1000 /user/vagrant/teragen The job starts up and I get the following output: 13/08/29 14:42:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 13/08/29 14:42:47 INFO client.RMProxy: Connecting to ResourceManager at master.local/192.168.7.10:8032 13/08/29 14:42:48 INFO terasort.TeraSort: Generating 1000 using 2 13/08/29 14:42:48 INFO mapreduce.JobSubmitter: number of splits:2 13/08/29 14:42:48 WARN conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name 13/08/29 14:42:48 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/08/29 14:42:48 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/08/29 14:42:48 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/08/29 14:42:48 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/08/29 14:42:48 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/08/29 14:42:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1377787324271_0001 13/08/29 14:42:50 INFO impl.YarnClientImpl: Submitted application application_1377787324271_0001 to ResourceManager at master.local/192.168.7.10:8032 13/08/29 14:42:50 INFO mapreduce.Job: The url to track the job: http://master.local:8088/proxy/application_1377787324271_0001/ 13/08/29 14:42:50 INFO mapreduce.Job: Running job: job_1377787324271_0001 and then it stops. If I check the UI, I see this: application_1377787324271_0001vagrant TeraGen MAPREDUCE default Thu, 29 Aug 2013 14:42:49 GMT N/A ACCEPTEDUNDEFINED UNASSIGNED I have no idea, why it is not starting, nor what to look for. Any pointers are more than welcome! Thanks! - André -- André Kelpe an...@concurrentinc.com http://concurrentinc.com -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. signature.asc Description: Message signed with OpenPGP using GPGMail
Multidata center support
We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help.
Hadoop HA error JOURNAL is not supported in state standby
Hi, I'm facing an error while starting Hadoop in HA(2.0.5) cluster , both the NameNode started in standby mode and not changing the state. When I tried to do health check through hdfs haadmin -checkhealth service id it's giving me below error. Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: clone2/XX.XX.XX.XX; destination host is: clone1:8020; I checked the logs at NN side. 2013-08-30 00:49:16,074 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:16,074 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 192.168.126.31:48266: error: org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:32,391 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode clone2:8020 2013-08-30 00:49:32,403 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1411) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:859) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4445) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:766) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:8758) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1235) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy11.rollEditLog(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:268) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:310) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292) Did I missed something? Thanks
copy files from hdfs to local fs
Ok, A very stupid question... I have a large file in /user/input/foo.txt I want to copy first 100 lines from this location to local filesystem... And the data is very sensitive so i am bit hesistant to experiment. What is the right way to copy sample data from hdfs to local fs.
Re: Hadoop HA error JOURNAL is not supported in state standby
Thanks Harsh, I don't have auto failover configuration, but also I have tried to do this manually but didn't get success. hdfs haadmin -transitionToActive node1 Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: clone2/XX.XX.XX.XX; destination host is: clone1:8020; So is there any alternative to resolve this issue?. Thanks On 8/30/13, Harsh J ha...@cloudera.com wrote: On the actual issue though: Do you also have auto-failover configured? On Fri, Aug 30, 2013 at 1:39 AM, orahad bigdata oracle...@gmail.com wrote: Hi, I'm facing an error while starting Hadoop in HA(2.0.5) cluster , both the NameNode started in standby mode and not changing the state. When I tried to do health check through hdfs haadmin -checkhealth service id it's giving me below error. Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: clone2/XX.XX.XX.XX; destination host is: clone1:8020; I checked the logs at NN side. 2013-08-30 00:49:16,074 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:16,074 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 192.168.126.31:48266: error: org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:32,391 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode clone2:8020 2013-08-30 00:49:32,403 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1411) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:859) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4445) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:766) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:8758) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1235) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy11.rollEditLog(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:268) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:310) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292) Did I missed something? Thanks -- Harsh J
Re: Hadoop HA error JOURNAL is not supported in state standby
Looks like you have some incompatibility between your client side and the server side? Are you also running 2.0.5 in your client side? As Harsh mentioned, the NN side warning msg is not related to your InvalidProtocolBufferException. The warning msg indicates that both of your NN are in the Standby state. Thanks, -Jing On Thu, Aug 29, 2013 at 1:36 PM, orahad bigdata oracle...@gmail.com wrote: Thanks Harsh, I don't have auto failover configuration, but also I have tried to do this manually but didn't get success. hdfs haadmin -transitionToActive node1 Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: clone2/XX.XX.XX.XX; destination host is: clone1:8020; So is there any alternative to resolve this issue?. Thanks On 8/30/13, Harsh J ha...@cloudera.com wrote: On the actual issue though: Do you also have auto-failover configured? On Fri, Aug 30, 2013 at 1:39 AM, orahad bigdata oracle...@gmail.com wrote: Hi, I'm facing an error while starting Hadoop in HA(2.0.5) cluster , both the NameNode started in standby mode and not changing the state. When I tried to do health check through hdfs haadmin -checkhealth service id it's giving me below error. Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: clone2/XX.XX.XX.XX; destination host is: clone1:8020; I checked the logs at NN side. 2013-08-30 00:49:16,074 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:16,074 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 192.168.126.31:48266: error: org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby 2013-08-30 00:49:32,391 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode clone2:8020 2013-08-30 00:49:32,403 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1411) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:859) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4445) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:766) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:8758) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) at org.apache.hadoop.ipc.Client.call(Client.java:1235) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy11.rollEditLog(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:139) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:268) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:310) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456) at
Re: copy files from hdfs to local fs
hadoop fs -copyToLocal or hadoop fs -get It copies the whole file and won't be able just to copy part of the file, what is interesting is there is a tail command but no head. Kim On Thu, Aug 29, 2013 at 1:35 PM, Chengi Liu chengi.liu...@gmail.com wrote: Ok, A very stupid question... I have a large file in /user/input/foo.txt I want to copy first 100 lines from this location to local filesystem... And the data is very sensitive so i am bit hesistant to experiment. What is the right way to copy sample data from hdfs to local fs.
Re: copy files from hdfs to local fs
tail will work as well.. ??? but i want to extract just (say) n lines out of this file? On Thu, Aug 29, 2013 at 1:43 PM, Kim Chew kchew...@gmail.com wrote: hadoop fs -copyToLocal or hadoop fs -get It copies the whole file and won't be able just to copy part of the file, what is interesting is there is a tail command but no head. Kim On Thu, Aug 29, 2013 at 1:35 PM, Chengi Liu chengi.liu...@gmail.comwrote: Ok, A very stupid question... I have a large file in /user/input/foo.txt I want to copy first 100 lines from this location to local filesystem... And the data is very sensitive so i am bit hesistant to experiment. What is the right way to copy sample data from hdfs to local fs.
Hadoop Yarn
I have some jvm options which i want to configure only for a few nodes in the cluster using Hadoop yarn. How do i di it. If i edit the mapred-site.xml it gets applied to all the task jvms. I just want handful of map jvms to have that option and other map jvm not have that options. Thanks Rajesh Sent from my iPhone
TB per core sweet spot
Hi, I realize there is no perfect spec for data nodes as lot depends on use cases and work loads but I am curious if there are any rules of thumb or no-go zones in terms of how many terabytes per core is ok? So a few questions assuming 1 core per hdd holds: Is there a no-go zone in terms of tb/core? I ask because I am seeing 4TB/core nodes in some of the clusters and wondering if that's too much? Does tb/core depend on the core speed? For example, while a 1.8Ghz might be able to handle 1TB, going to 4TB requires a 3.6Ghz E5 Xeon core? Dramatic difference between Xeon E3 vs E5 or incremental? Any comments on disk choice - SATA vs SAS, 5.9k vs 7.2k vs 10k, SATA2 vs 3? Again, I realize there is a huge YMMV factor here but I would love to hear experiences or research people have done before picking specs for their nodes including vendors/models. Thanks, Xuri
Hadoop Clients (Hive,Pig) and Hadoop Cluster
Hi, I am trying to setup a multi node hadoop cluster. I am trying to understand where hadoop clients like (Hive,Pig,Sqoop) would be installed in the Hadoop Cluster. Say - I have three Linux machines- Node 1- Master - (Name Node , Job Tracker and Secondary Name Node) Node 2 - Slave (Task Tracker,Data Node) Node 3 - Slave (Task Tracker,Data Node) On which machines should I install Hive? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Pig? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Sqoop? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? Thanks, Raj
Re: Hadoop Clients (Hive,Pig) and Hadoop Cluster
Yes, ideally you want to setup a 4th gateway node to run clients. http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Security-Guide/AppxG-Setting-Up-Gateway.html On Thu, Aug 29, 2013 at 3:11 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi, I am trying to setup a multi node hadoop cluster. I am trying to understand where hadoop clients like (Hive,Pig,Sqoop) would be installed in the Hadoop Cluster. Say - I have three Linux machines- Node 1- Master - (Name Node , Job Tracker and Secondary Name Node) Node 2- Slave (Task Tracker,Data Node) Node 3- Slave (Task Tracker,Data Node) On which machines should I install Hive? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Pig? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Sqoop? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? Thanks, Raj
Re: Hadoop Clients (Hive,Pig) and Hadoop Cluster
Regarding Sqoop, you can install it wherever you would have access to your database and HDFS cluster, you could e.g. install it on the namenode if you want it as long as it has access to the database that is the source or target of your data transfer. On Thu, Aug 29, 2013 at 3:11 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi, I am trying to setup a multi node hadoop cluster. I am trying to understand where hadoop clients like (Hive,Pig,Sqoop) would be installed in the Hadoop Cluster. Say - I have three Linux machines- Node 1- Master - (Name Node , Job Tracker and Secondary Name Node) Node 2- Slave (Task Tracker,Data Node) Node 3- Slave (Task Tracker,Data Node) On which machines should I install Hive? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Pig? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? On which machines should I install Sqoop? Should it be installed or Can it be installed on a separate machine? What user and privileges are required ? Thanks, Raj
Re: Hadoop Yarn
You'll have to change the MapReduce code. What options are you exactly looking for and why should they be only applied on some nodes? Some kind of sampling? More details can help us help you. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Aug 29, 2013, at 1:59 PM, Rajesh Jain wrote: I have some jvm options which i want to configure only for a few nodes in the cluster using Hadoop yarn. How do i di it. If i edit the mapred-site.xml it gets applied to all the task jvms. I just want handful of map jvms to have that option and other map jvm not have that options. Thanks Rajesh Sent from my iPhone -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Hadoop Yarn
Hi Vinod These are jvm parameters to inject agent only on some nodes for sampling. Is there a property because code change is not a option. Second is there a way to tell the jvms how much data size to process. Thanks Sent from my iPhone On Aug 29, 2013, at 6:37 PM, Vinod Kumar Vavilapalli vino...@apache.org wrote: You'll have to change the MapReduce code. What options are you exactly looking for and why should they be only applied on some nodes? Some kind of sampling? More details can help us help you. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Aug 29, 2013, at 1:59 PM, Rajesh Jain wrote: I have some jvm options which i want to configure only for a few nodes in the cluster using Hadoop yarn. How do i di it. If i edit the mapred-site.xml it gets applied to all the task jvms. I just want handful of map jvms to have that option and other map jvm not have that options. Thanks Rajesh Sent from my iPhone CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. signature.asc
secondary sort - number of reducers
I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help. Thanks Adeel
RE: copy files from hdfs to local fs
What's wrong by using old Unix pipe? hadoop fs -cat /user/input/foo.txt | head -100 local_file Date: Thu, 29 Aug 2013 13:50:37 -0700 Subject: Re: copy files from hdfs to local fs From: chengi.liu...@gmail.com To: user@hadoop.apache.org tail will work as well.. ??? but i want to extract just (say) n lines out of this file? On Thu, Aug 29, 2013 at 1:43 PM, Kim Chew kchew...@gmail.com wrote: hadoop fs -copyToLocal or hadoop fs -get It copies the whole file and won't be able just to copy part of the file, what is interesting is there is a tail command but no head. Kim On Thu, Aug 29, 2013 at 1:35 PM, Chengi Liu chengi.liu...@gmail.com wrote: Ok, A very stupid question... I have a large file in /user/input/foo.txt I want to copy first 100 lines from this location to local filesystem... And the data is very sensitive so i am bit hesistant to experiment. What is the right way to copy sample data from hdfs to local fs.
Re: Issue with fs.delete
Wow this is one helluva forum where people needing help leave the problem to the expert's imagination. Even paid support would close a ticket like that without looking twice. On Wed, Aug 28, 2013 at 4:40 AM, Harsh J ha...@cloudera.com wrote: Please also try to share your error/stacktraces when you post a question. All I can suspect is that your URI is malformed, and is missing the authority component. That is, it should be hdfs://host:port/path/to/file and not hdfs:/path/to/file. On Wed, Aug 28, 2013 at 1:44 PM, rab ra rab...@gmail.com wrote: -- Forwarded message -- From: rab ra rab...@gmail.com Date: 28 Aug 2013 13:26 Subject: Issue with fs.delete To: us...@hadoop.apache.org us...@hadoop.apache.org Hello, I am having a trouble in deleting a file from hdfs. I am using hadoop 1.2.1 stable release. I use the following code segment in my program fs.delete(new Path(hdfs:/user/username/input/input.txt)) fs.copyFromLocalFile(false,false,new Path(input.txt),new Path(hdfs:/user/username/input/input.txt)) Any hint? -- Harsh J
Cache file conflict
Hi... After updating the source JARs of an application that launches a second job while running a MR job, the following error keeps occurring: org.apache.hadoop.mapred.InvalidJobConfException: cache file (mapreduce.job.cache.files) scheme: hdfs, host: server, port: 9000, file: /tmp/hadoop-yarn/staging/root/.staging/job_1367474197612_0887/libjars/Some.jar, conflicts with cache file (mapreduce.job.cache.files) hdfs://server:9000/tmp/hadoop-yarn/staging/root/.staging/job_1367474197612_0888/libjars/Some.jar at org.apache.hadoop.mapreduce.v2.util.MRApps.parseDistributedCacheArtifacts(MRApps.java:338) at org.apache.hadoop.mapreduce.v2.util.MRApps.setupDistributedCache(MRApps.java:273) at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:419) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:288) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215) where job_1367474197612_0887 is the name of the initial job, job_1367474197612_0888 is the name of the subsequent job, and Some.jar is a JAR file specific to the application. Any ideas as to how the above error could be eliminated? Thanks!
Re: Cache file conflict
you should check this https://issues.apache.org/jira/browse/MAPREDUCE-4493?focusedCommentId=13713706page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13713706 Thanks, Omkar Joshi *Hortonworks Inc.* http://www.hortonworks.com On Thu, Aug 29, 2013 at 5:06 PM, Public Network Services publicnetworkservi...@gmail.com wrote: Hi... After updating the source JARs of an application that launches a second job while running a MR job, the following error keeps occurring: org.apache.hadoop.mapred.InvalidJobConfException: cache file (mapreduce.job.cache.files) scheme: hdfs, host: server, port: 9000, file: /tmp/hadoop-yarn/staging/root/.staging/job_1367474197612_0887/libjars/Some.jar, conflicts with cache file (mapreduce.job.cache.files) hdfs://server:9000/tmp/hadoop-yarn/staging/root/.staging/job_1367474197612_0888/libjars/Some.jar at org.apache.hadoop.mapreduce.v2.util.MRApps.parseDistributedCacheArtifacts(MRApps.java:338) at org.apache.hadoop.mapreduce.v2.util.MRApps.setupDistributedCache(MRApps.java:273) at org.apache.hadoop.mapred.YARNRunner.createApplicationSubmissionContext(YARNRunner.java:419) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:288) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1367) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215) where job_1367474197612_0887 is the name of the initial job, job_1367474197612_0888 is the name of the subsequent job, and Some.jar is a JAR file specific to the application. Any ideas as to how the above error could be eliminated? Thanks! -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: secondary sort - number of reducers
so it cant figure out an appropriate number of reducers as it does for mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer .. since im overriding the partitioner class shouldnt that decide how manyredeucers there should be based on how many different partition values being returned by the custom partiotioner On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley i...@cloudera.com wrote: If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1. Regards Ian. On Aug 29, 2013, at 4:23 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help. Thanks Adeel --- Ian Wrigley Sr. Curriculum Manager Cloudera, Inc Cell: (323) 819 4075
Re: secondary sort - number of reducers
okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys java.io.IOException: Illegal partition for *Atlanta:GA|Atlanta:GA:1:Adeel*(-1) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39) at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) public int getPartition(Text key, HCatRecord record, int numParts) { //extract the group key from composite key String groupKey = key.toString().split(\\|)[0]; return groupKey.hashCode() % numParts; } On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma shekhar2...@gmail.comwrote: No...partitionr decides which keys should go to which reducer...and number of reducers you need to decide...No of reducers depends on factors like number of key value pair, use case etc Regards, Som Shekhar Sharma +91-8197243810 On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi adeelmahm...@gmail.com wrote: so it cant figure out an appropriate number of reducers as it does for mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer .. since im overriding the partitioner class shouldnt that decide how manyredeucers there should be based on how many different partition values being returned by the custom partiotioner On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley i...@cloudera.com wrote: If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1. Regards Ian. On Aug 29, 2013, at 4:23 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help. Thanks Adeel --- Ian Wrigley Sr. Curriculum Manager Cloudera, Inc Cell: (323) 819 4075
RE: secondary sort - number of reducers
The method getPartition() needs to return a positive number. Simply use hashCode() method is not enough. See the Hadoop HashPartitioner implementation: return (key.hashCode() Integer.MAX_VALUE) % numReduceTasks; When I first read this code, I always wonder why not use Math.abs? Is ( Integer.MAX_VALUE) faster? Yong Date: Thu, 29 Aug 2013 20:55:46 -0400 Subject: Re: secondary sort - number of reducers From: adeelmahm...@gmail.com To: user@hadoop.apache.org okay so when i specify the number of reducers e.g. in my example i m using 4 (for a much smaller data set) it works if I use a single column in my composite key .. but if I add multiple columns in the composite key separated by a delimi .. it then throws the illegal partition error (keys before the pipe are group keys and after the pipe are the sort keys and my partioner only uses the group keys java.io.IOException: Illegal partition for Atlanta:GA|Atlanta:GA:1:Adeel (-1) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1073) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:39)at com.att.hadoop.hivesort.HSMapper.map(HSMapper.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249) public int getPartition(Text key, HCatRecord record, int numParts) { //extract the group key from composite key String groupKey = key.toString().split(\\|)[0]; return groupKey.hashCode() % numParts; } On Thu, Aug 29, 2013 at 8:31 PM, Shekhar Sharma shekhar2...@gmail.com wrote: No...partitionr decides which keys should go to which reducer...and number of reducers you need to decide...No of reducers depends on factors like number of key value pair, use case etc Regards, Som Shekhar Sharma +91-8197243810 On Fri, Aug 30, 2013 at 5:54 AM, Adeel Qureshi adeelmahm...@gmail.com wrote: so it cant figure out an appropriate number of reducers as it does for mappers .. in my case hadoop is using 2100+ mappers and then only 1 reducer .. since im overriding the partitioner class shouldnt that decide how manyredeucers there should be based on how many different partition values being returned by the custom partiotioner On Thu, Aug 29, 2013 at 7:38 PM, Ian Wrigley i...@cloudera.com wrote: If you don't specify the number of Reducers, Hadoop will use the default -- which, unless you've changed it, is 1. Regards Ian. On Aug 29, 2013, at 4:23 PM, Adeel Qureshi adeelmahm...@gmail.com wrote: I have implemented secondary sort in my MR job and for some reason if i dont specify the number of reducers it uses 1 which doesnt seems right because im working with 800M+ records and one reducer slows things down significantly. Is this some kind of limitation with the secondary sort that it has to use a single reducer .. that kind of would defeat the purpose of having a scalable solution such as secondary sort. I would appreciate any help. Thanks Adeel --- Ian Wrigley Sr. Curriculum Manager Cloudera, Inc Cell: (323) 819 4075
Re: Hadoop Yarn
Hi Rajesh, Have you looked at re-using the profiling options to inject the jvm options to a defined range of tasks? http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Profiling -- Hitesh On Aug 29, 2013, at 3:51 PM, Rajesh Jain wrote: Hi Vinod These are jvm parameters to inject agent only on some nodes for sampling. Is there a property because code change is not a option. Second is there a way to tell the jvms how much data size to process. Thanks Sent from my iPhone On Aug 29, 2013, at 6:37 PM, Vinod Kumar Vavilapalli vino...@apache.org wrote: You'll have to change the MapReduce code. What options are you exactly looking for and why should they be only applied on some nodes? Some kind of sampling? More details can help us help you. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Aug 29, 2013, at 1:59 PM, Rajesh Jain wrote: I have some jvm options which i want to configure only for a few nodes in the cluster using Hadoop yarn. How do i di it. If i edit the mapred-site.xml it gets applied to all the task jvms. I just want handful of map jvms to have that option and other map jvm not have that options. Thanks Rajesh Sent from my iPhone CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. signature.asc
Re: Multidata center support
My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help.
RE: Hadoop Yarn - samples
Perhaps you can try writing the same yarn application using these steps. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html Thanks Devaraj k From: Punnoose, Roshan [mailto:rashan.punnr...@merck.com] Sent: 29 August 2013 19:43 To: user@hadoop.apache.org Subject: Re: Hadoop Yarn - samples Is there an example of running a sample yarn application that will only allow one container to start per host? Punnoose, Roshan rashan.punnr...@merck.commailto:rashan.punnr...@merck.com On Aug 29, 2013, at 10:08 AM, Arun C Murthy a...@hortonworks.commailto:a...@hortonworks.com wrote: Take a look at the dist-shell example in http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/ I recently wrote up another simplified version of it for illustration purposes here: https://github.com/hortonworks/simple-yarn-app Arun On Aug 28, 2013, at 4:47 AM, Manickam P manicka...@outlook.commailto:manicka...@outlook.com wrote: Hi, I have just installed Hadoop 2.0.5 alpha version. I want to analyse how the Yarn resource manager and node mangers works. I executed the map reduce examples but i want to execute the samples in Yarn. Searching for that but unable to find any. Please help me. Thanks, Manickam P -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.