Hi Erik,

On Fri, Jun 26, 2009 at 4:24 PM, Erik Forsberg<forsb...@opera.com> wrote:

> I'm considering using a Hadoop version with .bz2 support - probably
> Cloudera's 18.3 dist, but if I understand correctly, .bz2 files are
> not split.

Yes. The bzip2 compressed files are not splittable in current
versions, maybe it will be introduced in next version. You may be
interested in this patch
https://issues.apache.org/jira/browse/HADOOP-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel.

> I expect that for most jobs, the number of log files will exceed the
> number of cores in my hadoop cluster.
>
> Is it possible to estimate if I'll get a performance hit
> because of the lack of splitting under these circumstances?

The bzip2 files are not split that means your block size of HDFS is
720M. Even though the number of your log files may exceed the number
of cores in your cluster, large blocks will decrease load balancing.


-- 
Zhong Wang

Reply via email to