Re: HDFS small file generation problem

nibiau Fri, 02 Oct 2015 09:38:27 -0700

Ok thanks, but can I also update data instead of insert data ?

----- Mail original -----
De: "Brett Antonides" <banto...@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem









I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext. 
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ? 

Tks 
Nicolas 

----- Mail original ----- 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

Reply via email to