Hi Abshiek
  Merging happens as a last stage of hive jobs. Say your hive query is 
translated to n MR jobs when you enable merge you can set a size that is needed 
to merge (usually block size). So after n MR jobs there would be a map only job 
that is automatically triggered from hive which merges the smaller files into 
larger ones. The intermediate output files are not retained in hive and hence 
the final large enough files remains in hdfs.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: Abhishek Pratap Singh <manu.i...@gmail.com>
Date: Mon, 26 Mar 2012 14:41:53 
To: <common-u...@hadoop.apache.org>; <bejoy.had...@gmail.com>
Reply-To: user@hive.apache.org
Cc: <user@hive.apache.org>
Subject: Re: Zero Byte file in HDFS

Thanks Bejoy for this input, Does this merging will combine all the small
files to least block size for the very first mappers of the hive job?
Well i ll explore on this, my interest on deleting the zero byte files from
HDFS comes from reducing the cost of Bookkeeping these files in system. The
metadata of any files in HDFS occupy approx 150kb space, considering
thousand or millions of zero byte file in HDFS will cost a lot.

Regards,
Abhishek


On Mon, Mar 26, 2012 at 2:27 PM, Bejoy KS <bejoy.had...@gmail.com> wrote:

> Hi Abshikek
>       I can propose a better solution. Enable merge in hive. So that the
> smaller files would be merged to at lest the hdfs block size(your choice)
> and would benefit subsequent hive jobs on the same.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -----Original Message-----
> From: Abhishek Pratap Singh <manu.i...@gmail.com>
> Date: Mon, 26 Mar 2012 14:20:18
> To: <common-u...@hadoop.apache.org>; <user@hive.apache.org>
> Reply-To: common-u...@hadoop.apache.org
> Subject: Zero Byte file in HDFS
>
> Hi All,
>
> I was just going through the implementation scenario of avoiding or
> deleting Zero byte file in HDFS. I m using Hive partition table where the
> data in partition come from INSERT OVERWRITE command using the SELECT from
> few other tables.
> Sometimes 0 Byte files are being generated in those partitions and during
> the course of time the amount of these files in the HDFS will increase
> enormously, decreasing the performance of hadoop job on that table /
> folder. I m looking for best way to avoid generation or deleting the zero
> byte file.
>
> I can think of few ways to implement this
>
> 1) Programmatically using the Filesystem object and cleaning the zero byte
> file.
> 2) Using Hadoop fs and Linux command combination to identify the zero byte
> file and delete it.
> 3) LazyOutputFormat (Applicable in Hadoop based custom jobs).
>
> Kindly guide on efficient ways to achieve the same.
>
> Regards,
> Abhishek
>
>

Reply via email to