Re: HDFS and Openstack - avoiding excessive redundancy

2011-11-11 Thread Graeme Seaton
One advantage to using Hadoop replication though, is that it provides a greater pool of potential servers for M/R jobs to execute on. If you simply use Openstack replication it will appear to the JobTracker that a particular block only exists on a single server and should only be executed on t

Re: HDFS and Openstack - avoiding excessive redundancy

2011-11-11 Thread Dejan Menges
Replication factor for HDFS can easily be changed to 1 if you don't need it's redundancy in hdfs-site.xml Regards, Dejo Sent from my iPhone On 12. 11. 2011., at 03:58, Edmon Begoli wrote: > A question related to standing up cloud infrastructure for running > Hadoop/HDFS. > > We are building

HDFS and Openstack - avoiding excessive redundancy

2011-11-11 Thread Edmon Begoli
A question related to standing up cloud infrastructure for running Hadoop/HDFS. We are building up an infrastructure using Openstack which has its own storage management redundancy. We are planning to use Openstack to instantiate Hadoop nodes (HDFS, M/R tasks, Hive, HBase) on demand. The problem

RE: Input split for a streaming job!

2011-11-11 Thread Tim Broberg
Or you could use the LZO patch and get *fast* splittable compression that doesn't depend on the bz2 generalized splittability scheme: http://www.cloudera.com/blog/2009/06/parallel-lzo-splittable-compression-for-hadoop/ http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-c

Re: Input split for a streaming job!

2011-11-11 Thread bejoy . hadoop
Hi Raj AFAIK 0.21is an unstable release and I fear anyone would recommend that for production. You can play around with the same, a better approach would be patching your CDH3u1 with the required patches for splittable BZip2, but make sure that your new patch doesn't break anything else.

Re: Input split for a streaming job!

2011-11-11 Thread Raj V
Tim I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch. I will try and use 0.21 Raj > >From: Tim Broberg >To: "common-user@hadoop.apache.org" ; Raj V >; Joey Echeverria >Sent: Friday, November 11, 2011 10:25 AM >Subject: RE: Input split for a s

RE: Input split for a streaming job!

2011-11-11 Thread Tim Broberg
What version of hadoop are you using? We just stumbled on the Jira item for BZIP2 splitting, and it appears to have been added in 0.21. When I diff 0.20.205 vs trunk, I see < public class BZip2Codec implements < org.apache.hadoop.io.compress.CompressionCodec { --- > @InterfaceAudience.Publi

Re: Input split for a streaming job!

2011-11-11 Thread Raj V
Joey,Anirudh, Bejoy I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat). And the input files were created using 32MB block size and the files are bzip2. So all things point to my input files being spliitable. I  will continue poking around. - best regards Raj >

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
You can enable speculative execution on a per-job basis, assuming your administrator hasn't marked it final. The parameter you want is mapred.map.tasks.speculative.execution. -Joey On Fri, Nov 11, 2011 at 10:57 AM, Keith Wiley wrote: > Oh, so the reported shuffle time incorporate the mapper time

Re: Slow shuffle stage?

2011-11-11 Thread Keith Wiley
Oh, so the reported shuffle time incorporate the mapper time. I suppose that makes sense since the two stages overlap. That explains a lot. As to the map times, it mostly due to task attempt failures. Any attempt that fails costs a lot of time. It is almost never a data-driven error, so sub

Re: Slow shuffle stage?

2011-11-11 Thread Keith Wiley
892 nodes, 4 tasks each, 3:1 mapper/reducer ratio. Each map task outputs four records, ~18MB each. They are fairly evenly distributed to the 17 reducers. As to the bandwidth of the cluster, I don't really know. I'll look into that. On Nov 10, 2011, at 7:07 PM, Prashant Sharma wrote: > Can y

Re: Slow shuffle stage?

2011-11-11 Thread Joey Echeverria
Another thing to look at is the map outlier. The shuffle will start by default when 5% of the maps are done, but won't finish until after the last map is done. Since one of your maps took 37 minutes, your shuffle take at least that long. I would check the following: Is the input skewed? Does t

Re: Input split for a streaming job!

2011-11-11 Thread Bejoy KS
Hi Raj Is your Streaming job using WholeFileInput Format or some Custom Input Format that reads files as a whole? If that is the case then this is the expected behavior. Also you mentioned you changed the dfs.block.size to 32 Mb.AFAIK this value would be applicable only for new fi

Re: Input split for a streaming job!

2011-11-11 Thread Anirudh Jhina
Raj, What InputFormat are you using? The compressed format is not splittable, so if you have 73 gzip files, there will be 73 corresponding mappers for each file respectively. Look at the TextInputFormat.isSplittable() description. Thanks, ~Anirudh On Thu, Nov 10, 2011 at 2:40 PM, Raj V wrote: