Hello, I'm still investigating my small file generation problem generated by my Spark Streaming jobs. Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have to store them inside HDFS in order to treat them by PIG jobs on-demand. The problem is the fact that I generate a lot of small files in HDFS (several millions) and it can be problematic. I investigated to use Hbase or Archive file but I don't want to do it finally. So, what about this solution : - Spark streaming generate on the fly several millions of small files in HDFS - Each night I merge them inside a big daily file - I launch my PIG jobs on this big file ?
Other question I have : - Is it possible to append a big file (daily) by adding on the fly my event ? Tks a lot Nicolas --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org