Hi All, I was just going through the implementation scenario of avoiding or deleting Zero byte file in HDFS. I m using Hive partition table where the data in partition come from INSERT OVERWRITE command using the SELECT from few other tables. Sometimes 0 Byte files are being generated in those partitions and during the course of time the amount of these files in the HDFS will increase enormously, decreasing the performance of hadoop job on that table / folder. I m looking for best way to avoid generation or deleting the zero byte file.
I can think of few ways to implement this 1) Programmatically using the Filesystem object and cleaning the zero byte file. 2) Using Hadoop fs and Linux command combination to identify the zero byte file and delete it. 3) LazyOutputFormat (Applicable in Hadoop based custom jobs). Kindly guide on efficient ways to achieve the same. Regards, Abhishek