Also consider setting up a Spark job or similar (Impala, Hive) to periodically read the Avro files and output in a columnar format (Parquet or ORC) which would give you small-files compaction (assuming you delete the source files periodically) and better analytical read performance on the columnar files.
Mike On Fri, Oct 12, 2018 at 12:20 AM Rickard Cardell <[email protected]> wrote: > > > Den fre 20 apr. 2018 20:49Nitin Kumar <[email protected]> skrev: > >> Hi All, >> >> I am using Flume v1.8 in which Flume agent comprises of Kafka Channel & >> HDFS Sink. >> I am able to write data in Avro file on HDFS into a external HIVE table, >> but the problem is whenever Flume gets restarted it closes that file and >> open a new file because of which I can see many small files. (Data is >> partition on the basis of date) >> >> Can't Flume append to existing file to avoid creation of new file? >> > Hi > No, not hdfs-sink at least > > Also, how can I solve this problem which leads to creation of too many >> small files? >> > > > We also used the hdfs-sink but because of the high maintenance we went for > hbase-sink instead, which also gave us deduplication. The major drawback is > that it requires an extra step, an hbase to hdfs job. > > Your many-small-files problem might be solved with an extra step, e.g > oozie job, that would merge smaller files to larger ones. > > That would also solve the problem with the left over temp-files that flume > doesn't clean up in some circumstances > > /Rickard > > >> Any help would be appreciated. >> >> -- >> >> *Regards,Nitin Kumar* >> >
