Re: Check compression codec of an HDFS file
The SequenceFile.Reader will work PErfect! (I should have seen that). As always - thanks Harsh On Thu, Dec 5, 2013 at 2:22 AM, Harsh J ha...@cloudera.com wrote: If you're looking for file header/contents based inspection, you could download the file and run the Linux utility 'file' on the file, and it should tell you the format. I don't know about Snappy (AFAIK, we don't have a snappy frame/container format support in Hadoop yet, although upstream Snappy issue 34 seems resolved now), but Gzip files can be identified simply by their header bytes for the magic sequence. If its sequence files you are looking to analyse, a simple way is to read its first few hundred bytes, which should have the codec string in it. Programmatically you can use https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec() for sequence files. On Thu, Dec 5, 2013 at 5:10 AM, alex bohr alexjb...@gmail.com wrote: What's the best way to check the compression codec that an HDFS file was written with? We use both Gzip and Snappy compression so I want a way to determine how a specific file is compressed. The closest I found is the getCodec but that relies on the file name suffix ... which don't exist since Reducers typically don't add a suffix to the filenames they create. Thanks -- Harsh J
Check compression codec of an HDFS file
What's the best way to check the compression codec that an HDFS file was written with? We use both Gzip and Snappy compression so I want a way to determine how a specific file is compressed. The closest I found is the *getCodec http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec(org.apache.hadoop.fs.Path) *but that relies on the file name suffix ... which don't exist since Reducers typically don't add a suffix to the filenames they create. Thanks
Namenode doesn't accurately know datanodes free-space
I see that the Namenode always reports datanodes as having about 5% more space then they actually do. And I recently added some smaller datanodes into the cluster and the drives filled up to 100%, not respecting the 5GB I had reserved for Map Reduce with this property from mapred-site.xml: property namedfs.datanode.du.reserved/name value5368709120/value /property Has anyone else experienced this behavior?
DFSClient: Could not complete write history logs
Hi, I've suddenly been having the JobTracker freeze up every couple hours when it goes into a loop trying to write Job history files. I get the error in various job but it's always on writing the _logs/history files. I'm running MRv1: Hadoop 2.0.0-cdh4.4.0 Here's a sample error: 2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer retrying.. I have to stop and restart the jobtracker and then it happens again, and the intervals between errors have been getting shorter. I see this ticket: https://issues.apache.org/jira/browse/HDFS-1059 But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks. I also found this thread: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3ccaf8-mnf7p_kr8snhbng1cdj70vget58_v+jnma21owymrc1...@mail.gmail.com%3E I'm not familiar with the different IO schedulers, so before I change this on all our datanodes - *does anyone recommend using deadline instead of CFQ? * We are using Ext4 file system on our datanodes which have 24 drives (we checked for any bad drives and found one that wasn't responding and pulled it from the config for that machine but errors keep happening). Or any other advice on addressing this inifinite loop beyond IO scheduler is much appreciated. Thanks, Alex
Re: Hosting Hadoop
Hi Dhaval, Sorry just saw this email (oops) so might not be relevant - but: We didn't encounter too much Funky issues that we were worried about regarding random resource constraints or random outages that might happen when sharing a physical box with unknown neighbors. But overall we feel the virtualization is robbing us of significant CPU, and more importantly they don't have ideal instance types. The M1.xlarges are too small storage wise (we ended up paying for more CPU than we needed to get the amount of storage we needed) and the hs1.8xlarge are too big - they have 24 drives and it feels like we lose a good amount of CPU controlling IO across all those drives, and we now have significantly more storage than we need in order to get enough CPU to keep our SLAs. For initial set-up - AWS is way quicker than owning hardware. But if you already have hardware, moving to AWS I think will increase your monthly bills to get comparable performance. On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Alex, did you run into funky issues with EC2/EMR? The kind of issues that would come up because its a virtualized environment? We currently own our hardware and are just trying to do an ROI analysis on whether moving to Amazon can reduce our admin costs. Currently administering a Hadoop cluster is a bit expensive (in terms of man hours spent trying to replace disks and so on) and we are exploring whether its possible to avoid some of those costs Regards, Dhaval -- *From:* alex bohr alexjb...@gmail.com *To:* user@hadoop.apache.org *Cc:* Dhaval Shah prince_mithi...@yahoo.co.in *Sent:* Monday, 12 August 2013 1:41 PM *Subject:* Re: Hosting Hadoop I've had good experience running a large hadoop cluster on EC2 instances. After almost 1 year we haven't had any significant down time, just lost a small # of data nodes. I don't think EMR is an ideal solution if your cluster will be running 24/7. But for running a large cluster, I don't see how you it's more cost efficient to run in the cloud than to own the hardware and we're trying to move off the cloud onto our own hardware. Can I ask why you're looking to move to the cloud? On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote: check altiscale as well On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Thanks for the list Marcos. I will go through the slides/links. I think that's helpful Regards, Dhaval -- *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com *To:* Dhaval Shah prince_mithi...@yahoo.co.in *Cc:* user@hadoop.apache.org *Sent:* Thursday, 8 August 2013 4:50 PM *Subject:* Re: Hosting Hadoop Well, all depends, because many companies use Cloud Computing platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop hosting: http://aws.amazon.com/elasticmapreduce http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html http://bitrefinery.com/services/hadoop-hosting http://www.joyent.com/products/compute-service/features/hadoop There a lot of companies using HBase hosted in Cloud. The last HBaseCon was full of great use-cases: HBase at Pinterest: http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/ HBase at Groupon http://www.hbasecon.com/sessions/apache-hbase-at-groupon/ A great talk by Benoit for Networking design for HBase: http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/ Using Coprocessors to Index Columns in an Elasticsearch Cluster http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/ 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in: We are exploring the possibility of hosting Hadoop outside of our data centers. I am aware that Hadoop in general isn't exactly designed to run on virtual hardware. So a few questions: 1. Are there any providers out there who would host Hadoop on dedicated physical hardware? 2. Has anyone had success hosting Hadoop on virtualized hardware where 100% uptime and performance/stability are very important (we use HBase as a real time database and it needs to be up all the time)? Thanks, Dhaval -- Marcos Ortiz Valmaseda Product Manager at PDVSA http://about.me/marcosortiz -- Nitin Pawar
in place upgrade to CDH4
Hi, I'm working on upgrading my cluster from CDH3u5 to CDH4. Trying to do the upgrade in place rather than creating a new cluster and migrating over. Doing this on a test cluster right now, but ran into an issue - First I uninstalled the CDH3 packages and installed the CDH4 ones, then upgraded the namenode and then started the namenode service. Then I started the datanode service on one of the data nodes and the machine started filling up quickly. It seems like it's re-writing the data into a new format. Is this correct, does the upgrade process rewrite the old data into a new format? And if so, that means I need a lot of free space on the data nodes that are being upgrade? Thanks
fsck -move is copying not moving
I have some corrupt blocks that I want to move to lost+found and work on recovering from the good blocks. So I ran hadoop fsck /my/bad/filepath -move And it copied a bunch of files to lost+found/my/bad/filepath. But the corrupt files are still at /my/bad/filepath. Is that expected? I thought fsck should Move not Copy the corrupt files. ... I then ran fsck /my/bad/filepath -delete and it deleted the bad file so it's all fine, but that seems unnecessary. I'm on CDH3u5. Thanks
Re: Hosting Hadoop
I've had good experience running a large hadoop cluster on EC2 instances. After almost 1 year we haven't had any significant down time, just lost a small # of data nodes. I don't think EMR is an ideal solution if your cluster will be running 24/7. But for running a large cluster, I don't see how you it's more cost efficient to run in the cloud than to own the hardware and we're trying to move off the cloud onto our own hardware. Can I ask why you're looking to move to the cloud? On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote: check altiscale as well On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah prince_mithi...@yahoo.co.inwrote: Thanks for the list Marcos. I will go through the slides/links. I think that's helpful Regards, Dhaval -- *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com *To:* Dhaval Shah prince_mithi...@yahoo.co.in *Cc:* user@hadoop.apache.org *Sent:* Thursday, 8 August 2013 4:50 PM *Subject:* Re: Hosting Hadoop Well, all depends, because many companies use Cloud Computing platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop hosting: http://aws.amazon.com/elasticmapreduce http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html http://bitrefinery.com/services/hadoop-hosting http://www.joyent.com/products/compute-service/features/hadoop There a lot of companies using HBase hosted in Cloud. The last HBaseCon was full of great use-cases: HBase at Pinterest: http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/ HBase at Groupon http://www.hbasecon.com/sessions/apache-hbase-at-groupon/ A great talk by Benoit for Networking design for HBase: http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/ Using Coprocessors to Index Columns in an Elasticsearch Cluster http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/ 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in: We are exploring the possibility of hosting Hadoop outside of our data centers. I am aware that Hadoop in general isn't exactly designed to run on virtual hardware. So a few questions: 1. Are there any providers out there who would host Hadoop on dedicated physical hardware? 2. Has anyone had success hosting Hadoop on virtualized hardware where 100% uptime and performance/stability are very important (we use HBase as a real time database and it needs to be up all the time)? Thanks, Dhaval -- Marcos Ortiz Valmaseda Product Manager at PDVSA http://about.me/marcosortiz -- Nitin Pawar
Best Practices: mapred.job.tracker.handler.count, dfs.namenode.handler.count
Hi, I'm looking for some feedback on how to decide how many threads to assign to the Namenode and Jobtracker? I currently have 24 data nodes (running CDH3) and am finding a lot varying advice on how to set these properties and change them as the cluster grows. Some (older) documentation (* http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/ , http://hadoop.apache.org/docs/r1.0.4/mapred-default.html* ) has it in the range of the default 10 for a smallish cluster. And the O'reilly *Hadoop Opertaions *book puts it a good deal higher and gives a handy precise formula of: natural log of # of nodes X 20 , or: python -c 'import math ; print int(math.log(24) * 20)' Which = 63 for 24 nodes. Does anyone have strong opinions on how to set these variables? Does anyone else use the natural log X 20? Any other factors beyond # of nodes that should be factored? I'm assuming memory available on the NameNode/Jobtracker plays a big part, but right now I have a good amount unused memory so I'm ok going with a higher #. My jobtracker is occasionally freezing so this is one of the configs I think might be causing problems. And second, less important, part of the question, is there any need to put these properties in their respective config files (mapred-site.xml, hdfs-site.xml) on any node other than the Namenode? I've looked but have never found any good documentation discussing which properties need to be on which machine, and I'd prefer to keep properties off of a machine if they don't need to be there (so I don't need to restart anything if the property changes, and keep environments simpler). Thanks