Hadoop automated tests
Hi all, Are there automated tests available for testing sanity of hadoop layer and also for negative tests i.e. One Data node going down, HBase Region Server going down, Namenode, Jobtracker etc. By Hadoop Layer I am asking about Hadoop, MapReduce, HBase, Zookeeper. What does hadoop dev team use for this? Any pointers, documentation articles would help a lot. Thanks Harshad
Re: Hadoop automated tests
[Cc bigtop-dev@] We have stack tests as a part of Bigtop project. We don't do fault injection tests like you describe just yet, but that be a great contribution to the project. Cos On Wed, Oct 16, 2013 at 02:12PM, hdev ml wrote: Hi all, Are there automated tests available for testing sanity of hadoop layer and also for negative tests i.e. One Data node going down, HBase Region Server going down, Namenode, Jobtracker etc. By Hadoop Layer I am asking about Hadoop, MapReduce, HBase, Zookeeper. What does hadoop dev team use for this? Any pointers, documentation articles would help a lot. Thanks Harshad
Re: is jdk required to run hadoop or jre alone is sufficient
You will need a JDK. Certain tools like Sqoop, etc. have a dependency on JDK for compiling generated code at runtime, and will not function without a JDK. On Wed, Oct 16, 2013 at 10:38 AM, oc tsdb oc.t...@gmail.com wrote: HI , I would like to know if JRE alone is sufficient to run HADOOP services or JDK is required ? we are planning to install latest stable version of hadoop Thanks, Oc.tsdb -- Harsh J
How to execute wordcount with compression?
Hi, I want execute the wordcount in yarn with compression enabled with a dir with several files, but for that I must compress the input. dir1/file1.txt dir1/file2.txt dir1/file3.txt dir1/file4.txt dir1/file5.txt 1 - Should I compress the whole dir or each file in the dir? 2 - Should I use gzip or bzip2? 3 - Do I need to setup any yarn configuration file? 4 - when the job is running, the files are decompressed before running the mappers and compressed again after reducers executed? -- Thanks,
Re: HDFS / Federated HDFS - Doubts
I have couple of questions about HDFS federation: Can I state different block store directories for each namespace on a datanode ? Can I have some datanodes dedicated to a particular namespace only ? This seems quite interesting. Way to go ! On Tue, Oct 1, 2013 at 9:52 PM, Krishna Kumaar Natarajan natar...@umn.eduwrote: Hi All, While trying to understand federated HDFS in detail I had few doubts and listing them down for your help. 1. In case of *HDFS(without HDFS federation)*, the metadata or the data about the blocks belonging to the files in HDFS is maintained in the main memory of the name node or it is stored on permanent storage of the namenode and is brought in the main memory on demand basis ? [Krishna] Based on my understanding, I assume the entire metadata is in main memory which is an issue by itself. Please correct me if my understanding is wrong. 2. In case of* federated HDFS*, the metadata or the data about the blocks belonging to files in a particular namespace is maintained in the main memory of the namenode or it is stored on the permanent storage of the namenode and is brought in the main memory on demand basis ? 3. Are the metadata information stored in separate cluster nodes(block management layer separation) as discussed in Appendix B of this document ? https://issues.apache.org/jira/secure/attachment/12453067/high-level-design.pdf 4. I would like to know if the following proposals are already implemented in federated HDFS. ( http://www.slideshare.net/hortonworks/hdfs-futures-namenode-federation-for-improved-efficiency-and-scalability slide-17) - Separation of namespace and block management layers (same as qn.3) - Partial namespace in memory for further scalability - Move partial namespace from one namenode to another Thanks, Krishna
Re: Hosting Hadoop
Hi Dhaval, Sorry just saw this email (oops) so might not be relevant - but: We didn't encounter too much Funky issues that we were worried about regarding random resource constraints or random outages that might happen when sharing a physical box with unknown neighbors. But overall we feel the virtualization is robbing us of significant CPU, and more importantly they don't have ideal instance types. The M1.xlarges are too small storage wise (we ended up paying for more CPU than we needed to get the amount of storage we needed) and the hs1.8xlarge are too big - they have 24 drives and it feels like we lose a good amount of CPU controlling IO across all those drives, and we now have significantly more storage than we need in order to get enough CPU to keep our SLAs. For initial set-up - AWS is way quicker than owning hardware. But if you already have hardware, moving to AWS I think will increase your monthly bills to get comparable performance. On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Alex, did you run into funky issues with EC2/EMR? The kind of issues that would come up because its a virtualized environment? We currently own our hardware and are just trying to do an ROI analysis on whether moving to Amazon can reduce our admin costs. Currently administering a Hadoop cluster is a bit expensive (in terms of man hours spent trying to replace disks and so on) and we are exploring whether its possible to avoid some of those costs Regards, Dhaval -- *From:* alex bohr alexjb...@gmail.com *To:* user@hadoop.apache.org *Cc:* Dhaval Shah prince_mithi...@yahoo.co.in *Sent:* Monday, 12 August 2013 1:41 PM *Subject:* Re: Hosting Hadoop I've had good experience running a large hadoop cluster on EC2 instances. After almost 1 year we haven't had any significant down time, just lost a small # of data nodes. I don't think EMR is an ideal solution if your cluster will be running 24/7. But for running a large cluster, I don't see how you it's more cost efficient to run in the cloud than to own the hardware and we're trying to move off the cloud onto our own hardware. Can I ask why you're looking to move to the cloud? On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote: check altiscale as well On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Thanks for the list Marcos. I will go through the slides/links. I think that's helpful Regards, Dhaval -- *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com *To:* Dhaval Shah prince_mithi...@yahoo.co.in *Cc:* user@hadoop.apache.org *Sent:* Thursday, 8 August 2013 4:50 PM *Subject:* Re: Hosting Hadoop Well, all depends, because many companies use Cloud Computing platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop hosting: http://aws.amazon.com/elasticmapreduce http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html http://bitrefinery.com/services/hadoop-hosting http://www.joyent.com/products/compute-service/features/hadoop There a lot of companies using HBase hosted in Cloud. The last HBaseCon was full of great use-cases: HBase at Pinterest: http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/ HBase at Groupon http://www.hbasecon.com/sessions/apache-hbase-at-groupon/ A great talk by Benoit for Networking design for HBase: http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/ Using Coprocessors to Index Columns in an Elasticsearch Cluster http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/ 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in: We are exploring the possibility of hosting Hadoop outside of our data centers. I am aware that Hadoop in general isn't exactly designed to run on virtual hardware. So a few questions: 1. Are there any providers out there who would host Hadoop on dedicated physical hardware? 2. Has anyone had success hosting Hadoop on virtualized hardware where 100% uptime and performance/stability are very important (we use HBase as a real time database and it needs to be up all the time)? Thanks, Dhaval -- Marcos Ortiz Valmaseda Product Manager at PDVSA http://about.me/marcosortiz -- Nitin Pawar
Re: HDFS / Federated HDFS - Doubts
On Wed, Oct 16, 2013 at 9:22 AM, Steve Edison sediso...@gmail.com wrote: I have couple of questions about HDFS federation: Can I state different block store directories for each namespace on a datanode ? No. The main idea of federation was not to physically partition the storage across namespace, but to use all the available storage across the namespaces, to ensure better utilzation. Can I have some datanodes dedicated to a particular namespace only ? As I said earlier, all the datanodes are shared across namespaces. If you want to dedicate datanodes to a particular namespace, you might as well create it as two separate clusters with different set of datanodes and a separate namespace. This seems quite interesting. Way to go ! On Tue, Oct 1, 2013 at 9:52 PM, Krishna Kumaar Natarajan natar...@umn.edu wrote: Hi All, While trying to understand federated HDFS in detail I had few doubts and listing them down for your help. 1. In case of *HDFS(without HDFS federation)*, the metadata or the data about the blocks belonging to the files in HDFS is maintained in the main memory of the name node or it is stored on permanent storage of the namenode and is brought in the main memory on demand basis ? [Krishna] Based on my understanding, I assume the entire metadata is in main memory which is an issue by itself. Please correct me if my understanding is wrong. 2. In case of* federated HDFS*, the metadata or the data about the blocks belonging to files in a particular namespace is maintained in the main memory of the namenode or it is stored on the permanent storage of the namenode and is brought in the main memory on demand basis ? 3. Are the metadata information stored in separate cluster nodes(block management layer separation) as discussed in Appendix B of this document ? https://issues.apache.org/jira/secure/attachment/12453067/high-level-design.pdf 4. I would like to know if the following proposals are already implemented in federated HDFS. ( http://www.slideshare.net/hortonworks/hdfs-futures-namenode-federation-for-improved-efficiency-and-scalability slide-17) - Separation of namespace and block management layers (same as qn.3) - Partial namespace in memory for further scalability - Move partial namespace from one namenode to another Thanks, Krishna -- http://hortonworks.com/download/ -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Hosting Hadoop
Thanks for sharing the experience Alex. I kind of anticipated the kind of issues you mentioned here but just wanted to make sure I explore all possible options Regards, Dhaval On Wednesday, 16 October 2013 1:34 PM, alex bohr alexjb...@gmail.com wrote: Hi Dhaval, Sorry just saw this email (oops) so might not be relevant - but: We didn't encounter too much Funky issues that we were worried about regarding random resource constraints or random outages that might happen when sharing a physical box with unknown neighbors. But overall we feel the virtualization is robbing us of significant CPU, and more importantly they don't have ideal instance types. The M1.xlarges are too small storage wise (we ended up paying for more CPU than we needed to get the amount of storage we needed) and the hs1.8xlarge are too big - they have 24 drives and it feels like we lose a good amount of CPU controlling IO across all those drives, and we now have significantly more storage than we need in order to get enough CPU to keep our SLAs. For initial set-up - AWS is way quicker than owning hardware. But if you already have hardware, moving to AWS I think will increase your monthly bills to get comparable performance. On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah prince_mithi...@yahoo.co.in wrote: Alex, did you run into funky issues with EC2/EMR? The kind of issues that would come up because its a virtualized environment? We currently own our hardware and are just trying to do an ROI analysis on whether moving to Amazon can reduce our admin costs. Currently administering a Hadoop cluster is a bit expensive (in terms of man hours spent trying to replace disks and so on) and we are exploring whether its possible to avoid some of those costs Regards, Dhaval From: alex bohr alexjb...@gmail.com To: user@hadoop.apache.org Cc: Dhaval Shah prince_mithi...@yahoo.co.in Sent: Monday, 12 August 2013 1:41 PM Subject: Re: Hosting Hadoop I've had good experience running a large hadoop cluster on EC2 instances. After almost 1 year we haven't had any significant down time, just lost a small # of data nodes. I don't think EMR is an ideal solution if your cluster will be running 24/7. But for running a large cluster, I don't see how you it's more cost efficient to run in the cloud than to own the hardware and we're trying to move off the cloud onto our own hardware. Can I ask why you're looking to move to the cloud? On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.com wrote: check altiscale as well On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah prince_mithi...@yahoo.co.in wrote: Thanks for the list Marcos. I will go through the slides/links. I think that's helpful Regards, Dhaval From: Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com To: Dhaval Shah prince_mithi...@yahoo.co.in Cc: user@hadoop.apache.org Sent: Thursday, 8 August 2013 4:50 PM Subject: Re: Hosting Hadoop Well, all depends, because many companies use Cloud Computing platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop hosting: http://aws.amazon.com/elasticmapreduce http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html http://bitrefinery.com/services/hadoop-hosting http://www.joyent.com/products/compute-service/features/hadoop There a lot of companies using HBase hosted in Cloud. The last HBaseCon was full of great use-cases: HBase at Pinterest: http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/ HBase at Groupon http://www.hbasecon.com/sessions/apache-hbase-at-groupon/ A great talk by Benoit for Networking design for HBase: http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/ Using Coprocessors to Index Columns in an Elasticsearch Cluster http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/ 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in: We are exploring the possibility of hosting Hadoop outside of our data centers. I am aware that Hadoop in general isn't exactly designed to run on virtual hardware. So a few questions: 1. Are there any providers out there who would host Hadoop on dedicated physical hardware? 2. Has anyone had success hosting Hadoop on virtualized hardware where 100% uptime and performance/stability are very important (we use HBase as a real time database and it needs to be up all the time)? Thanks, Dhaval -- Marcos Ortiz Valmaseda Product Manager at PDVSA http://about.me/marcosortiz -- Nitin Pawar
even possible?
Question is on cdh3u4, the cluster was setup before I owned this cluster, and somehow the namenode/jobtracker/datanode/tasktracker every server process is run by a user named foo, and all job are launch and run by foo user include HDFS directories/files structure owenership, basically foo is everywhere. Today i start to think of trying to correct this by having has namenode + datanode run by hdfs user has jobtracker + tasktracker run by mapred user So far, i have a very short list that need to be changed, and i will try out in the test cluster. eg. create use hdfs, mapred every where ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to hdfs ownership of mapred.local.dir, will change to mapred restart the cluster with hdfs for HDFS side, and mapred for MapRed side. i am 100% sure that i missed certain things that have to take care, I will really appreciate all the input. However, the original question i would love to ask is this even feasible or make sense trying to change this. Thanks P
Re: even possible?
Don't fix it if it ain't broken =P There shouldn't be any reason why you couldn't change it (back) to the standard way that cloudera distributions are set up. Off the top of my head, I can't think of anything that you're missing. But at the same time, if your cluster is working as is, why change it? On Wed, Oct 16, 2013 at 2:24 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Question is on cdh3u4, the cluster was setup before I owned this cluster, and somehow the namenode/jobtracker/datanode/tasktracker every server process is run by a user named foo, and all job are launch and run by foo user include HDFS directories/files structure owenership, basically foo is everywhere. Today i start to think of trying to correct this by having has namenode + datanode run by hdfs user has jobtracker + tasktracker run by mapred user So far, i have a very short list that need to be changed, and i will try out in the test cluster. eg. create use hdfs, mapred every where ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to hdfs ownership of mapred.local.dir, will change to mapred restart the cluster with hdfs for HDFS side, and mapred for MapRed side. i am 100% sure that i missed certain things that have to take care, I will really appreciate all the input. However, the original question i would love to ask is this even feasible or make sense trying to change this. Thanks P
Re: Hadoop automated tests
Thanks Konstantin. Will take a look at Bigtop and see if that fits our scenario. Harshad On Wed, Oct 16, 2013 at 2:16 PM, Konstantin Boudnik c...@apache.org wrote: [Cc bigtop-dev@] We have stack tests as a part of Bigtop project. We don't do fault injection tests like you describe just yet, but that be a great contribution to the project. Cos On Wed, Oct 16, 2013 at 02:12PM, hdev ml wrote: Hi all, Are there automated tests available for testing sanity of hadoop layer and also for negative tests i.e. One Data node going down, HBase Region Server going down, Namenode, Jobtracker etc. By Hadoop Layer I am asking about Hadoop, MapReduce, HBase, Zookeeper. What does hadoop dev team use for this? Any pointers, documentation articles would help a lot. Thanks Harshad