Re: Check compression codec of an HDFS file

2013-12-05 Thread alex bohr
The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh


On Thu, Dec 5, 2013 at 2:22 AM, Harsh J ha...@cloudera.com wrote:

 If you're looking for file header/contents based inspection, you could
 download the file and run the Linux utility 'file' on the file, and it
 should tell you the format.

 I don't know about Snappy (AFAIK, we don't have a snappy
 frame/container format support in Hadoop yet, although upstream Snappy
 issue 34 seems resolved now), but Gzip files can be identified simply
 by their header bytes for the magic sequence.

 If its sequence files you are looking to analyse, a simple way is to
 read its first few hundred bytes, which should have the codec string
 in it. Programmatically you can use

 https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
 for sequence files.

 On Thu, Dec 5, 2013 at 5:10 AM, alex bohr alexjb...@gmail.com wrote:
  What's the best way to check the compression codec that an HDFS file was
  written with?
 
  We use both Gzip and Snappy compression so I want a way to determine how
 a
  specific file is compressed.
 
  The closest I found is the getCodec but that relies on the file name
 suffix
  ... which don't exist since Reducers typically don't add a suffix to the
  filenames they create.
 
  Thanks



 --
 Harsh J



Check compression codec of an HDFS file

2013-12-04 Thread alex bohr
What's the best way to check the compression codec that an HDFS file was
written with?

We use both Gzip and Snappy compression so I want a way to determine how a
specific file is compressed.

The closest I found is the *getCodec
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec(org.apache.hadoop.fs.Path)
*but
that relies on the file name suffix ... which don't exist since Reducers
typically don't add a suffix to the filenames they create.

Thanks


Namenode doesn't accurately know datanodes free-space

2013-10-29 Thread alex bohr
I see that the Namenode always reports datanodes as having about 5% more
space then they actually do.

And I recently added some smaller datanodes into the cluster and the drives
filled up to 100%, not respecting the 5GB I had reserved for Map Reduce
with this property from mapred-site.xml:
property
  namedfs.datanode.du.reserved/name
  value5368709120/value
/property

Has anyone else experienced this behavior?


DFSClient: Could not complete write history logs

2013-10-25 Thread alex bohr
Hi,
I've suddenly been having the JobTracker freeze up every couple hours when
it goes into a loop trying to write Job history files.

I get the error in various job but it's always on writing the
_logs/history files.

I'm running MRv1: Hadoop 2.0.0-cdh4.4.0

Here's a sample error:
2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete
/user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer
retrying..

I have to stop and restart the jobtracker and then it happens again, and
the intervals between errors have been getting shorter.

I see this ticket:
https://issues.apache.org/jira/browse/HDFS-1059
But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks.

I also found this thread:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3ccaf8-mnf7p_kr8snhbng1cdj70vget58_v+jnma21owymrc1...@mail.gmail.com%3E

I'm not familiar with the different IO schedulers, so before I change this
on all our datanodes - *does anyone recommend using deadline instead of
CFQ? *
We are using Ext4 file system on our datanodes which have 24 drives (we
checked for any bad drives and found one that wasn't responding and pulled
it from the config for that machine but errors keep happening).

Or any other advice on addressing this inifinite loop beyond IO scheduler
is much appreciated.
Thanks,
Alex


Re: Hosting Hadoop

2013-10-16 Thread alex bohr
Hi Dhaval,
Sorry just saw this email (oops) so might not be relevant - but:
We didn't encounter too much Funky issues that we were worried about
regarding random resource constraints or random outages that might happen
when sharing a physical box with unknown neighbors.

But overall we feel the virtualization is robbing us of significant CPU,
and more importantly they don't have ideal instance types.  The M1.xlarges
are too small storage wise (we ended up paying for more CPU than we needed
to get the amount of storage we needed) and the hs1.8xlarge are too big -
they have 24 drives and it feels like we lose a good amount of CPU
controlling IO across all those drives, and we now have significantly more
storage than we need in order to get enough CPU to keep our SLAs.

For initial set-up - AWS is way quicker than owning hardware.  But if you
already have hardware, moving to AWS I think will increase your monthly
bills to get comparable performance.


On Wed, Aug 21, 2013 at 11:36 AM, Dhaval Shah
prince_mithi...@yahoo.co.inwrote:

 Alex, did you run into funky issues with EC2/EMR? The kind of issues that
 would come up because its a virtualized environment? We currently own our
 hardware and are just trying to do an ROI analysis on whether moving to
 Amazon can reduce our admin costs. Currently administering a Hadoop cluster
 is a bit expensive (in terms of man hours spent trying to replace disks and
 so on) and we are exploring whether its possible to avoid some of those
 costs

 Regards,
 Dhaval

   --
  *From:* alex bohr alexjb...@gmail.com
 *To:* user@hadoop.apache.org
 *Cc:* Dhaval Shah prince_mithi...@yahoo.co.in
 *Sent:* Monday, 12 August 2013 1:41 PM
 *Subject:* Re: Hosting Hadoop

 I've had good experience running a large hadoop cluster on EC2 instances.
  After almost 1 year we haven't had any significant down time, just lost a
 small # of data nodes.
 I don't think EMR is an ideal solution if your cluster will be running
 24/7.

 But for running a large cluster, I don't see how you it's more cost
 efficient to run in the cloud than to own the hardware and we're trying to
 move off the cloud onto our own hardware.  Can I ask why you're looking to
 move to the cloud?


 On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 check altiscale as well


 On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah 
 prince_mithi...@yahoo.co.inwrote:

 Thanks for the list Marcos. I will go through the slides/links. I think
 that's helpful

 Regards,
 Dhaval

   --
  *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com
 *To:* Dhaval Shah prince_mithi...@yahoo.co.in
 *Cc:* user@hadoop.apache.org
 *Sent:* Thursday, 8 August 2013 4:50 PM
 *Subject:* Re: Hosting Hadoop

 Well, all depends, because many companies use Cloud Computing
 platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop
 hosting:
 http://aws.amazon.com/elasticmapreduce
 http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html
 http://bitrefinery.com/services/hadoop-hosting
 http://www.joyent.com/products/compute-service/features/hadoop

 There a lot of companies using HBase hosted in Cloud. The last
 HBaseCon was full of great use-cases:
 HBase at Pinterest:
 http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/

 HBase at Groupon
 http://www.hbasecon.com/sessions/apache-hbase-at-groupon/

 A great talk by Benoit for Networking design for HBase:
 http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/

 Using Coprocessors to Index Columns in an Elasticsearch Cluster
 http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/

 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in:
  We are exploring the possibility of hosting Hadoop outside of our data
  centers. I am aware that Hadoop in general isn't exactly designed to run
 on
  virtual hardware. So a few questions:
  1. Are there any providers out there who would host Hadoop on dedicated
  physical hardware?
  2. Has anyone had success hosting Hadoop on virtualized hardware where
 100%
  uptime and performance/stability are very important (we use HBase as a
 real
  time database and it needs to be up all the time)?
 
  Thanks,
  Dhaval


 --
 Marcos Ortiz Valmaseda
 Product Manager at PDVSA
 http://about.me/marcosortiz





  --
 Nitin Pawar







in place upgrade to CDH4

2013-09-18 Thread alex bohr
Hi,
I'm working on upgrading my cluster from CDH3u5 to CDH4.  Trying to do the
upgrade in place rather than creating a new cluster and migrating over.

Doing this on a test cluster right now, but ran into an issue -
First I uninstalled the CDH3 packages and installed the CDH4 ones, then
upgraded the namenode and then started the namenode service.
Then I started the datanode service on one of the data nodes and the
machine started filling up quickly.
It seems like it's re-writing the data into a new format.   Is this
correct, does the upgrade process rewrite the old data into a new format?
 And if so, that means I need a lot of free space on the data nodes that
are being upgrade?

Thanks


fsck -move is copying not moving

2013-09-13 Thread alex bohr
I have some corrupt blocks that I want to move to lost+found and work on
recovering from the good blocks.

So I ran
hadoop fsck  /my/bad/filepath -move

And it copied a bunch of files to lost+found/my/bad/filepath.  But the
corrupt files are still at  /my/bad/filepath.

Is that expected?  I thought fsck should Move not Copy the corrupt files.

... I then ran fsck  /my/bad/filepath -delete and it deleted the bad file
so it's all fine, but that seems unnecessary.

I'm on CDH3u5.

Thanks


Re: Hosting Hadoop

2013-08-12 Thread alex bohr
I've had good experience running a large hadoop cluster on EC2 instances.
 After almost 1 year we haven't had any significant down time, just lost a
small # of data nodes.
I don't think EMR is an ideal solution if your cluster will be running 24/7.

But for running a large cluster, I don't see how you it's more cost
efficient to run in the cloud than to own the hardware and we're trying to
move off the cloud onto our own hardware.  Can I ask why you're looking to
move to the cloud?


On Fri, Aug 9, 2013 at 10:42 AM, Nitin Pawar nitinpawar...@gmail.comwrote:

 check altiscale as well


 On Fri, Aug 9, 2013 at 3:05 AM, Dhaval Shah 
 prince_mithi...@yahoo.co.inwrote:

 Thanks for the list Marcos. I will go through the slides/links. I think
 that's helpful

 Regards,
 Dhaval

   --
  *From:* Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com
 *To:* Dhaval Shah prince_mithi...@yahoo.co.in
 *Cc:* user@hadoop.apache.org
 *Sent:* Thursday, 8 August 2013 4:50 PM
 *Subject:* Re: Hosting Hadoop

 Well, all depends, because many companies use Cloud Computing
 platforms like Amazon EMR. Vmware, Rackscpace Cloud for Hadoop
 hosting:
 http://aws.amazon.com/elasticmapreduce
 http://www.vmware.com/company/news/releases/vmw-mapr-hadoop-062013.html
 http://bitrefinery.com/services/hadoop-hosting
 http://www.joyent.com/products/compute-service/features/hadoop

 There a lot of companies using HBase hosted in Cloud. The last
 HBaseCon was full of great use-cases:
 HBase at Pinterest:
 http://www.hbasecon.com/sessions/apache-hbase-operations-at-pinterest/

 HBase at Groupon
 http://www.hbasecon.com/sessions/apache-hbase-at-groupon/

 A great talk by Benoit for Networking design for HBase:

 http://www.hbasecon.com/sessions/scalable-network-designs-for-apache-hbase/

 Using Coprocessors to Index Columns in an Elasticsearch Cluster
 http://www.hbasecon.com/sessions/using-coprocessors-to-index-columns/

 2013/8/8, Dhaval Shah prince_mithi...@yahoo.co.in:
  We are exploring the possibility of hosting Hadoop outside of our data
  centers. I am aware that Hadoop in general isn't exactly designed to
 run on
  virtual hardware. So a few questions:
  1. Are there any providers out there who would host Hadoop on dedicated
  physical hardware?
  2. Has anyone had success hosting Hadoop on virtualized hardware where
 100%
  uptime and performance/stability are very important (we use HBase as a
 real
  time database and it needs to be up all the time)?
 
  Thanks,
  Dhaval


 --
 Marcos Ortiz Valmaseda
 Product Manager at PDVSA
 http://about.me/marcosortiz





 --
 Nitin Pawar



Best Practices: mapred.job.tracker.handler.count, dfs.namenode.handler.count

2013-03-04 Thread Alex Bohr
Hi,
I'm looking for some feedback on how to decide how many threads to assign
to the Namenode and Jobtracker?

I currently have 24 data nodes (running CDH3) and am finding a lot varying
advice on how to set these properties and change them as the cluster grows.

Some (older) documentation (*
http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
, http://hadoop.apache.org/docs/r1.0.4/mapred-default.html* ) has it in the
range of the default 10 for a smallish cluster.
And the O'reilly *Hadoop Opertaions *book puts it a good deal higher and
gives a handy precise formula of: natural log of # of nodes X 20 , or: python
-c 'import math ; print int(math.log(24) * 20)'
Which = 63 for 24 nodes.

Does anyone have strong opinions on how to set these variables?  Does
anyone else use the natural log X 20?
Any other factors beyond # of nodes that should be factored?  I'm assuming
memory available on the NameNode/Jobtracker plays a big part, but right now
I have a good amount unused memory so I'm ok going with a higher #.
My jobtracker is occasionally freezing so this is one of the configs I
think might be causing problems.

And second, less important, part of the question, is there any need to put
these properties in their respective config files (mapred-site.xml,
hdfs-site.xml) on any node other than the Namenode?
I've looked but have never found any good documentation discussing which
properties need to be on which machine, and I'd prefer to keep properties
off of a machine if they don't need to be there (so I don't need to restart
anything if the property changes, and keep environments simpler).

Thanks