One advantage to using Hadoop replication though, is that it provides a
greater pool of potential servers for M/R jobs to execute on. If you
simply use Openstack replication it will appear to the JobTracker that a
particular block only exists on a single server and should only be
executed on t
Replication factor for HDFS can easily be changed to 1 if you don't need it's
redundancy in hdfs-site.xml
Regards,
Dejo
Sent from my iPhone
On 12. 11. 2011., at 03:58, Edmon Begoli wrote:
> A question related to standing up cloud infrastructure for running
> Hadoop/HDFS.
>
> We are building
A question related to standing up cloud infrastructure for running Hadoop/HDFS.
We are building up an infrastructure using Openstack which has its own
storage management redundancy.
We are planning to use Openstack to instantiate Hadoop nodes (HDFS,
M/R tasks, Hive, HBase)
on demand.
The problem
Or you could use the LZO patch and get *fast* splittable compression that
doesn't depend on the bz2 generalized splittability scheme:
http://www.cloudera.com/blog/2009/06/parallel-lzo-splittable-compression-for-hadoop/
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-c
Hi Raj
AFAIK 0.21is an unstable release and I fear anyone would recommend that
for production. You can play around with the same, a better approach would be
patching your CDH3u1 with the required patches for splittable BZip2, but make
sure that your new patch doesn't break anything else.
Tim
I am using CDH3 U1. ( 0.20.2+923) which does not have the patch.
I will try and use 0.21
Raj
>
>From: Tim Broberg
>To: "common-user@hadoop.apache.org" ; Raj V
>; Joey Echeverria
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a s
What version of hadoop are you using?
We just stumbled on the Jira item for BZIP2 splitting, and it appears to have
been added in 0.21.
When I diff 0.20.205 vs trunk, I see
< public class BZip2Codec implements
< org.apache.hadoop.io.compress.CompressionCodec {
---
> @InterfaceAudience.Publi
Joey,Anirudh, Bejoy
I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
And the input files were created using 32MB block size and the files are bzip2.
So all things point to my input files being spliitable.
I will continue poking around.
- best regards
Raj
>
You can enable speculative execution on a per-job basis, assuming your
administrator hasn't marked it final. The parameter you want is
mapred.map.tasks.speculative.execution.
-Joey
On Fri, Nov 11, 2011 at 10:57 AM, Keith Wiley wrote:
> Oh, so the reported shuffle time incorporate the mapper time
Oh, so the reported shuffle time incorporate the mapper time. I suppose that
makes sense since the two stages overlap. That explains a lot.
As to the map times, it mostly due to task attempt failures. Any attempt that
fails costs a lot of time. It is almost never a data-driven error, so
sub
892 nodes, 4 tasks each, 3:1 mapper/reducer ratio. Each map task outputs four
records, ~18MB each. They are fairly evenly distributed to the 17 reducers.
As to the bandwidth of the cluster, I don't really know. I'll look into that.
On Nov 10, 2011, at 7:07 PM, Prashant Sharma wrote:
> Can y
Another thing to look at is the map outlier. The shuffle will start by default
when 5% of the maps are done, but won't finish until after the last map is
done. Since one of your maps took 37 minutes, your shuffle take at least that
long.
I would check the following:
Is the input skewed?
Does t
Hi Raj
Is your Streaming job using WholeFileInput Format or some Custom
Input Format that reads files as a whole? If that is the case then this is
the expected behavior.
Also you mentioned you changed the dfs.block.size to 32 Mb.AFAIK
this value would be applicable only for new fi
Raj,
What InputFormat are you using? The compressed format is not splittable, so
if you have 73 gzip files, there will be 73 corresponding mappers for each
file respectively. Look at the TextInputFormat.isSplittable() description.
Thanks,
~Anirudh
On Thu, Nov 10, 2011 at 2:40 PM, Raj V wrote:
14 matches
Mail list logo