HDFS small file generation problem

nibiau Sun, 27 Sep 2015 06:37:03 -0700

Hello,
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?


Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

HDFS small file generation problem

Reply via email to