Hadoop automated tests

2013-10-16 Thread hdev ml
Hi all,

Are there automated tests available for testing sanity of hadoop layer and
also for negative tests i.e. One Data node going down, HBase Region Server
going down, Namenode, Jobtracker etc.

By Hadoop Layer I am asking  about Hadoop, MapReduce, HBase, Zookeeper.

What does hadoop dev team use for this? Any pointers, documentation
articles would help a lot.

Thanks
Harshad


Re: Hadoop automated tests

2013-10-16 Thread Konstantin Boudnik
[Cc bigtop-dev@]

We have stack tests as a part of Bigtop project. We don't do fault injection 
tests
like you describe just yet, but that be a great contribution to the project.

Cos

On Wed, Oct 16, 2013 at 02:12PM, hdev ml wrote:
 Hi all,
 
 Are there automated tests available for testing sanity of hadoop layer and
 also for negative tests i.e. One Data node going down, HBase Region Server
 going down, Namenode, Jobtracker etc.
 
 By Hadoop Layer I am asking  about Hadoop, MapReduce, HBase, Zookeeper.
 
 What does hadoop dev team use for this? Any pointers, documentation
 articles would help a lot.
 
 Thanks
 Harshad


Re: is jdk required to run hadoop or jre alone is sufficient

2013-10-16 Thread Harsh J
You will need a JDK. Certain tools like Sqoop, etc. have a dependency
on JDK for compiling generated code at runtime, and will not function
without a JDK.

On Wed, Oct 16, 2013 at 10:38 AM, oc tsdb oc.t...@gmail.com wrote:
 HI ,

 I would like to know if JRE alone is sufficient to run HADOOP services or
 JDK is required ?

 we are planning to install latest stable version of hadoop

 Thanks,

 Oc.tsdb



-- 
Harsh J


How to execute wordcount with compression?

2013-10-16 Thread xeon

Hi,


I want execute the wordcount in yarn with compression enabled with a dir 
with several files, but for that I must compress the input.


dir1/file1.txt
dir1/file2.txt
dir1/file3.txt
dir1/file4.txt
dir1/file5.txt

1 - Should I compress the whole dir or each file in the dir?

2 - Should I use gzip or bzip2?

3 - Do I need to setup any yarn configuration file?

4 - when the job is running, the files are decompressed before running 
the mappers and compressed again after reducers executed?


--
Thanks,



Re: HDFS / Federated HDFS - Doubts

2013-10-16 Thread Steve Edison
I have couple of questions about HDFS federation:

Can I state different block store directories for each namespace on a
datanode ?
Can I have some datanodes dedicated to a particular namespace only ?

This seems quite interesting. Way to go !


On Tue, Oct 1, 2013 at 9:52 PM, Krishna Kumaar Natarajan
natar...@umn.eduwrote:

 Hi All,

 While trying to understand federated HDFS in detail I had few doubts and
 listing them down for your help.

1. In case of *HDFS(without HDFS federation)*, the metadata or the
data about the blocks belonging to the files in HDFS is maintained in the
main memory of the name node or it is stored on permanent storage of the
namenode and is brought in the main memory on demand basis ? [Krishna]
Based on my understanding, I assume the entire metadata is in main memory
which is an issue by itself. Please correct me if my understanding is 
 wrong.
2. In case of* federated HDFS*, the metadata or the data about the
blocks belonging to files in a particular namespace is maintained in the
main memory of the namenode or it is stored on the permanent storage of the
namenode and is brought in the main memory on demand basis ?
3. Are the metadata information stored in separate cluster nodes(block
management layer separation) as discussed in Appendix B of this document ?

 https://issues.apache.org/jira/secure/attachment/12453067/high-level-design.pdf
4. I would like to know if the following proposals are already
implemented in federated HDFS. (

 http://www.slideshare.net/hortonworks/hdfs-futures-namenode-federation-for-improved-efficiency-and-scalability
 slide-17)
- Separation of namespace and block management layers (same as qn.3)
   - Partial namespace in memory for further scalability
   - Move partial namespace from one namenode to another

 Thanks,
 Krishna



Re: Hosting Hadoop

2013-10-16 Thread alex bohr
Hi Dhaval,
Sorry just saw this email (oops) so might not be relevant - but:
We didn't encounter too much Funky issues that we were worried about
regarding random resource constraints or random outages that might happen
when sharing a physical box with unknown neighbors.

But overall we feel the virtualization is robbing us of significant CPU,
and more importantly they don't have ideal instance types.  The M1.xlarges
are too small storage wise (we ended up paying for more CPU than we needed
to get the amount of storage we needed) and the hs1.8xlarge are too big -
they have 24 drives and it feels like we lose a good amount of CPU
controlling IO across all those drives, and we now have significantly more
storage than we need in order to get enough CPU to keep our SLAs.

For initial set-up - AWS is way quicker than owning hardware.  But if you
already have hardware, moving to AWS I think will increase your monthly
bills to get comparable performance.


On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah
prince_mithi...@yahoo.co.inwrote:

 Alex, did you run into funky issues with EC2/EMR? The kind of issues that
 would come up because its a virtualized environment? We currently own our
 hardware and are just trying to do an ROI analysis on whether moving to
 Amazon can reduce our admin costs. Currently administering a Hadoop cluster
 is a bit expensive (in terms of man hours spent trying to replace disks and
 so on) and we are exploring whether its possible to avoid some of those
 costs

 Regards,
 Dhaval

   --
  *From:* alex bohr alexjb...@gmail.com
 *To:* user@hadoop.apache.org
 *Cc:* Dhaval Shah prince_mithi...@yahoo.co.in
 *Sent:* Monday, 12 August 2013 1:41 PM
 *Subject:* Re: Hosting Hadoop

 I've had good experience running a large hadoop cluster on EC2 instances.
  After almost 1 year we haven't had any significant down time, just lost a
 small # of data nodes.
 I don't think EMR is an ideal solution if your cluster will be running
 24/7.

 But for running a large cluster, I don't see how you it's more cost
 efficient to run in the cloud than to own the hardware and we're trying to
 move off the cloud onto our own hardware.  Can I ask why you're looking to
 move to the cloud?


 On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 check altiscale as well


 On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah 
 prince_mithi...@yahoo.co.inwrote:

 Thanks for the list Marcos. I will go through the slides/links. I think
 that's helpful

 Regards,
 Dhaval

   --
  *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com
 *To:* Dhaval Shah prince_mithi...@yahoo.co.in
 *Cc:* user@hadoop.apache.org
 *Sent:* Thursday, 8 August 2013 4:50 PM
 *Subject:* Re: Hosting Hadoop

 Well, all depends, because many companies use Cloud Computing
 platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop
 hosting:
 http://aws.amazon.com/elasticmapreduce
 http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html
 http://bitrefinery.com/services/hadoop-hosting
 http://www.joyent.com/products/compute-service/features/hadoop

 There a lot of companies using HBase hosted in Cloud. The last
 HBaseCon was full of great use-cases:
 HBase at Pinterest:
 http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/

 HBase at Groupon
 http://www.hbasecon.com/sessions/apache-hbase-at-groupon/

 A great talk by Benoit for Networking design for HBase:
 http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/

 Using Coprocessors to Index Columns in an Elasticsearch Cluster
 http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/

 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in:
  We are exploring the possibility of hosting Hadoop outside of our data
  centers. I am aware that Hadoop in general isn't exactly designed to run
 on
  virtual hardware. So a few questions:
  1. Are there any providers out there who would host Hadoop on dedicated
  physical hardware?
  2. Has anyone had success hosting Hadoop on virtualized hardware where
 100%
  uptime and performance/stability are very important (we use HBase as a
 real
  time database and it needs to be up all the time)?
 
  Thanks,
  Dhaval


 --
 Marcos Ortiz Valmaseda
 Product Manager at PDVSA
 http://about.me/marcosortiz





  --
 Nitin Pawar







Re: HDFS / Federated HDFS - Doubts

2013-10-16 Thread Suresh Srinivas
On Wed, Oct 16, 2013 at 9:22 AM, Steve Edison sediso...@gmail.com wrote:

 I have couple of questions about HDFS federation:

 Can I state different block store directories for each namespace on a
 datanode ?


No. The main idea of federation was not to physically partition the storage
across namespace, but to use all the available storage across the
namespaces, to ensure better utilzation.


 Can I have some datanodes dedicated to a particular namespace only ?


As I said earlier, all the datanodes are shared across namespaces. If you
want to dedicate datanodes to a particular namespace, you might as well
create it as two separate clusters with different set of datanodes and a
separate namespace.



 This seems quite interesting. Way to go !


 On Tue, Oct 1, 2013 at 9:52 PM, Krishna Kumaar Natarajan natar...@umn.edu
  wrote:

 Hi All,

 While trying to understand federated HDFS in detail I had few doubts and
 listing them down for your help.

1. In case of *HDFS(without HDFS federation)*, the metadata or the
data about the blocks belonging to the files in HDFS is maintained in the
main memory of the name node or it is stored on permanent storage of the
namenode and is brought in the main memory on demand basis ? [Krishna]
Based on my understanding, I assume the entire metadata is in main memory
which is an issue by itself. Please correct me if my understanding is 
 wrong.
2. In case of* federated HDFS*, the metadata or the data about the
blocks belonging to files in a particular namespace is maintained in the
main memory of the namenode or it is stored on the permanent storage of 
 the
namenode and is brought in the main memory on demand basis ?
3. Are the metadata information stored in separate cluster
nodes(block management layer separation) as discussed in Appendix B of 
 this
document ?

 https://issues.apache.org/jira/secure/attachment/12453067/high-level-design.pdf
4. I would like to know if the following proposals are already
implemented in federated HDFS. (

 http://www.slideshare.net/hortonworks/hdfs-futures-namenode-federation-for-improved-efficiency-and-scalability
 slide-17)
- Separation of namespace and block management layers (same as qn.3)
   - Partial namespace in memory for further scalability
   - Move partial namespace from one namenode to another

 Thanks,
 Krishna





-- 
http://hortonworks.com/download/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Hosting Hadoop

2013-10-16 Thread Dhaval Shah
Thanks for sharing the experience Alex. I kind of anticipated the kind of 
issues you mentioned here but just wanted to make sure I explore all possible 
options
 
Regards,
Dhaval



On Wednesday, 16 October 2013 1:34 PM, alex bohr alexjb...@gmail.com wrote:
 
Hi Dhaval,
Sorry just saw this email (oops) so might not be relevant - but:
We didn't encounter too much Funky issues that we were worried about regarding 
random resource constraints or random outages that might happen when sharing a 
physical box with unknown neighbors.

But overall we feel the virtualization is robbing us of significant CPU, and 
more importantly they don't have ideal instance types.  The M1.xlarges are too 
small storage wise (we ended up paying for more CPU than we needed to get the 
amount of storage we needed) and the hs1.8xlarge are too big - they have 24 
drives and it feels like we lose a good amount of CPU controlling IO across all 
those drives, and we now have significantly more storage than we need in order 
to get enough CPU to keep our SLAs.

For initial set-up - AWS is way quicker than owning hardware.  But if you 
already have hardware, moving to AWS I think will increase your monthly bills 
to get comparable performance.



On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah prince_mithi...@yahoo.co.in 
wrote:

Alex, did you run into funky issues with EC2/EMR? The kind of issues that would 
come up because its a virtualized environment? We currently own our hardware 
and are just trying to do an ROI analysis on whether moving to Amazon can 
reduce our admin costs. Currently administering a Hadoop cluster is a bit 
expensive (in terms of man hours spent trying to replace disks and so on) and 
we are exploring whether its possible to avoid some of those costs
 
Regards,
Dhaval




 From: alex bohr alexjb...@gmail.com
To: user@hadoop.apache.org 
Cc: Dhaval Shah prince_mithi...@yahoo.co.in 
Sent: Monday, 12 August 2013 1:41 PM
Subject: Re: Hosting Hadoop
 


I've had good experience running a large hadoop cluster on EC2 instances.  
After almost 1 year we haven't had any significant down time, just lost a 
small # of data nodes.  
I don't think EMR is an ideal solution if your cluster will be running 24/7.


But for running a large cluster, I don't see how you it's more cost efficient 
to run in the cloud than to own the hardware and we're trying to move off the 
cloud onto our own hardware.  Can I ask why you're looking to move to the 
cloud?



On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.com wrote:

check altiscale as well



On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah prince_mithi...@yahoo.co.in 
wrote:

Thanks for the list Marcos. I will go through the slides/links. I think 
that's helpful
 
Regards,
Dhaval




 From: Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com
To: Dhaval Shah prince_mithi...@yahoo.co.in 
Cc: user@hadoop.apache.org 
Sent: Thursday, 8 August 2013 4:50 PM
Subject: Re: Hosting Hadoop
 

Well, all depends, because many companies use Cloud Computing
platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop
hosting:
http://aws.amazon.com/elasticmapreduce
http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html
http://bitrefinery.com/services/hadoop-hosting
http://www.joyent.com/products/compute-service/features/hadoop

There a lot of companies using HBase hosted in Cloud. The last
HBaseCon was full of great use-cases:
HBase at
 Pinterest:
http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/

HBase at Groupon
http://www.hbasecon.com/sessions/apache-hbase-at-groupon/

A great talk by Benoit for Networking design for HBase:
http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/

Using Coprocessors to Index Columns in an Elasticsearch Cluster
http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/

2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in:
 We are exploring the possibility of hosting Hadoop outside of our data
 centers. I am aware that Hadoop in general isn't exactly designed to run on
 virtual hardware. So a few questions:
 1. Are there any providers out there who would host Hadoop on dedicated
 physical hardware?
 2. Has anyone had success hosting Hadoop on virtualized hardware where 100%
 uptime and performance/stability are very important (we use HBase as a real
 time database and it needs to be up all the time)?

 Thanks,
 Dhaval


-- 
Marcos Ortiz Valmaseda
Product Manager at PDVSA
http://about.me/marcosortiz






-- 
Nitin Pawar





even possible?

2013-10-16 Thread Patai Sangbutsarakum
Question is on cdh3u4, the cluster was setup before I owned this cluster,
and somehow the namenode/jobtracker/datanode/tasktracker every server
process is run by a user named foo, and all job are launch and run by foo
user include HDFS directories/files structure owenership, basically foo is
everywhere.

Today i start to think of trying to correct this by having
has namenode + datanode run by hdfs user
has jobtracker + tasktracker run by mapred user

So far, i have a very short list that need to be changed, and i will try
out in the test cluster.
eg.
create use hdfs, mapred every where
ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to
hdfs
ownership of mapred.local.dir, will change to mapred
restart the cluster with hdfs for HDFS side, and mapred for MapRed side.

i am 100% sure that i missed certain things that have to take care, I will
really appreciate all the input.

However, the original question i would love to ask is this even feasible or
make sense trying to change this.


Thanks
P


Re: even possible?

2013-10-16 Thread Pradeep Gollakota
Don't fix it if it ain't broken =P

There shouldn't be any reason why you couldn't change it (back) to the
standard way that cloudera distributions are set up. Off the top of my
head, I can't think of anything that you're missing. But at the same time,
if your cluster is working as is, why change it?


On Wed, Oct 16, 2013 at 2:24 PM, Patai Sangbutsarakum 
silvianhad...@gmail.com wrote:

 Question is on cdh3u4, the cluster was setup before I owned this cluster,
 and somehow the namenode/jobtracker/datanode/tasktracker every server
 process is run by a user named foo, and all job are launch and run by foo
 user include HDFS directories/files structure owenership, basically foo is
 everywhere.

 Today i start to think of trying to correct this by having
 has namenode + datanode run by hdfs user
 has jobtracker + tasktracker run by mapred user

 So far, i have a very short list that need to be changed, and i will try
 out in the test cluster.
 eg.
 create use hdfs, mapred every where
 ownership of dfs.name.dir, dfs.data.dir, fs.checkpoint.dir will change to
 hdfs
 ownership of mapred.local.dir, will change to mapred
 restart the cluster with hdfs for HDFS side, and mapred for MapRed side.

 i am 100% sure that i missed certain things that have to take care, I will
 really appreciate all the input.

 However, the original question i would love to ask is this even feasible
 or make sense trying to change this.


 Thanks
 P



Re: Hadoop automated tests

2013-10-16 Thread hdev ml
Thanks Konstantin. Will take a look at Bigtop and see if that fits our
scenario.
Harshad


On Wed, Oct 16, 2013 at 2:16 PM, Konstantin Boudnik c...@apache.org wrote:

 [Cc bigtop-dev@]

 We have stack tests as a part of Bigtop project. We don't do fault
 injection tests
 like you describe just yet, but that be a great contribution to the
 project.

 Cos

 On Wed, Oct 16, 2013 at 02:12PM, hdev ml wrote:
  Hi all,
 
  Are there automated tests available for testing sanity of hadoop layer
 and
  also for negative tests i.e. One Data node going down, HBase Region
 Server
  going down, Namenode, Jobtracker etc.
 
  By Hadoop Layer I am asking  about Hadoop, MapReduce, HBase, Zookeeper.
 
  What does hadoop dev team use for this? Any pointers, documentation
  articles would help a lot.
 
  Thanks
  Harshad