subject:"HDFS small file generation problem"

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke

Hive was originally not designed for updates,  because it was.purely
warehouse focused, the most recent one can do updates, deletes etc in a
transactional way.
However, you may also use Hbase with phoenix for that depending on your
other functional and non-functional requirements

Le sam. 3 oct. 2015 à 16:48,   a écrit :

> Thanks a lot, why you said "the most recent version" ?
>
> - Mail original -
> De: "Jörn Franke" 
> À: "nibiau" 
> Cc: banto...@gmail.com, user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 13:56:43
> Objet: Re: RE : Re: HDFS small file generation problem
>
>
>
> Yes the most recent version yes, or you can use phoenix on top of hbase. I
> recommend to try out both and see which one is the most suitable.
>
>
>
> Le sam. 3 oct. 2015 à 13:13, nibiau < nib...@free.fr > a écrit :
>
>
>
>
> Hello,
> Thanks if I understand correctly Hive can be a usable to my context ?
>
>
> Nicolas
>
>
>
>
>
>
>
>
>
> Envoyé depuis mon appareil mobile Samsung
> Jörn Franke < jornfra...@gmail.com > a écrit :
>
>
>
> If you use transactional tables in hive together with insert, update,
> delete then it does the "concatenate " for you automatically in regularly
> intervals. Currently this works only with tables in orc.format (stored as
> orc)
>
>
>
>
> Le sam. 3 oct. 2015 à 11:45, < nib...@free.fr > a écrit :
>
>
> Hello,
> So, does Hive is a solution for my need :
> - I receive small messages (10KB) identified by ID (product ID for example)
> - Each message I receive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to process batch on it later.
>
> If I use Hive I suppose I have to use INSERT and UPDATE records and
> periodically CONCATENATE.
> After a CONCATENATE I suppose the records are still updatable.
>
> Tks to confirm if it can be solution for my use case. Or any other idea..
>
> Thanks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "Brett Antonides" < banto...@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: HDFS small file generation problem
>
>
>
> You can update data in hive if you use the orc format
>
>
>
> Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :
>
>
> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" < banto...@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> - Mail original -
> De: "Brett Antonides" < banto...@gmail.com >
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spa

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau

Thanks a lot, why you said "the most recent version" ?

- Mail original -
De: "Jörn Franke" 
À: "nibiau" 
Cc: banto...@gmail.com, user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 13:56:43
Objet: Re: RE : Re: HDFS small file generation problem



Yes the most recent version yes, or you can use phoenix on top of hbase. I 
recommend to try out both and see which one is the most suitable. 



Le sam. 3 oct. 2015 à 13:13, nibiau < nib...@free.fr > a écrit : 




Hello, 
Thanks if I understand correctly Hive can be a usable to my context ? 


Nicolas 









Envoyé depuis mon appareil mobile Samsung 
Jörn Franke < jornfra...@gmail.com > a écrit : 



If you use transactional tables in hive together with insert, update, delete 
then it does the "concatenate " for you automatically in regularly intervals. 
Currently this works only with tables in orc.format (stored as orc) 




Le sam. 3 oct. 2015 à 11:45, < nib...@free.fr > a écrit : 


Hello, 
So, does Hive is a solution for my need : 
- I receive small messages (10KB) identified by ID (product ID for example) 
- Each message I receive is the last picture of my product ID, so I just want 
basically to store last picture products inside HDFS 
in order to process batch on it later. 

If I use Hive I suppose I have to use INSERT and UPDATE records and 
periodically CONCATENATE. 
After a CONCATENATE I suppose the records are still updatable. 

Tks to confirm if it can be solution for my use case. Or any other idea.. 

Thanks a lot ! 
Nicolas 


- Mail original - 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "Brett Antonides" < banto...@gmail.com > 
Cc: user@spark.apache.org 
Envoyé: Samedi 3 Octobre 2015 11:17:51 
Objet: Re: HDFS small file generation problem 



You can update data in hive if you use the orc format 



Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit : 


Hello, 
Finally Hive is not a solution as I cannot update the data. 
And for archive file I think it would be the same issue. 
Any other solutions ? 

Nicolas 

- Mail original - 
De: nib...@free.fr 
À: "Brett Antonides" < banto...@gmail.com > 
Cc: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:37:22 
Objet: Re: HDFS small file generation problem 

Ok thanks, but can I also update data instead of insert data ? 

- Mail original - 
De: "Brett Antonides" < banto...@gmail.com > 
À: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:18:18 
Objet: Re: HDFS small file generation problem 








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext. 
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ? 

Tks 
Nicolas 

- Mail original - 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For a

Re: RE : Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke

Yes the most recent version yes, or you can use phoenix on top of hbase. I
recommend to try out both and see which one is the most suitable.

Le sam. 3 oct. 2015 à 13:13, nibiau  a écrit :

> Hello,
> Thanks if I understand correctly Hive can be a usable to my context ?
>
> Nicolas
>
>
>
>
> Envoyé depuis mon appareil mobile Samsung
>
> Jörn Franke  a écrit :
>
> If you use transactional tables in hive together with insert, update,
> delete then it does the "concatenate " for you automatically in regularly
> intervals. Currently this works only with tables in orc.format (stored as
> orc)
>
> Le sam. 3 oct. 2015 à 11:45,   a écrit :
>
>> Hello,
>> So, does Hive is a solution for my need :
>> - I receive small messages (10KB) identified by ID (product ID for
>> example)
>> - Each message I receive is the last picture of my product ID, so I just
>> want basically to store last picture products inside HDFS
>> in order to process batch on it later.
>>
>> If I use Hive I suppose I have to use INSERT and UPDATE records and
>> periodically CONCATENATE.
>> After a CONCATENATE I suppose the records are still updatable.
>>
>> Tks to confirm if it can be solution for my use case. Or any other idea..
>>
>> Thanks a lot !
>> Nicolas
>>
>>
>> ----- Mail original -
>> De: "Jörn Franke" 
>> À: nib...@free.fr, "Brett Antonides" 
>> Cc: user@spark.apache.org
>> Envoyé: Samedi 3 Octobre 2015 11:17:51
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>> You can update data in hive if you use the orc format
>>
>>
>>
>> Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :
>>
>>
>> Hello,
>> Finally Hive is not a solution as I cannot update the data.
>> And for archive file I think it would be the same issue.
>> Any other solutions ?
>>
>> Nicolas
>>
>> - Mail original -
>> De: nib...@free.fr
>> À: "Brett Antonides" < banto...@gmail.com >
>> Cc: user@spark.apache.org
>> Envoyé: Vendredi 2 Octobre 2015 18:37:22
>> Objet: Re: HDFS small file generation problem
>>
>> Ok thanks, but can I also update data instead of insert data ?
>>
>> - Mail original -
>> De: "Brett Antonides" < banto...@gmail.com >
>> À: user@spark.apache.org
>> Envoyé: Vendredi 2 Octobre 2015 18:18:18
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>>
>>
>>
>>
>>
>> I had a very similar problem and solved it with Hive and ORC files using
>> the Spark SQLContext.
>> * Create a table in Hive stored as an ORC file (I recommend using
>> partitioning too)
>> * Use SQLContext.sql to Insert data into the table
>> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
>> merge your many small files into larger files optimized for your HDFS block
>> size
>> * Since the CONCATENATE command operates on files in place it is
>> transparent to any downstream processing
>>
>> Cheers,
>> Brett
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>>
>>
>> Hello,
>> Yes but :
>> - In the Java API I don't find a API to create a HDFS archive
>> - As soon as I receive a message (with messageID) I need to replace the
>> old existing file by the new one (name of file being the messageID), is it
>> possible with archive ?
>>
>> Tks
>> Nicolas
>>
>> - Mail original -
>> De: "Jörn Franke" < jornfra...@gmail.com >
>> À: nib...@free.fr , "user" < user@spark.apache.org >
>> Envoyé: Lundi 28 Septembre 2015 23:53:56
>> Objet: Re: HDFS small file generation problem
>>
>>
>>
>>
>>
>> Use hadoop archive
>>
>>
>>
>> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>>
>>
>> Hello,
>> I'm still investigating my small file generation problem generated by my
>> Spark Streaming jobs.
>> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
>> 10kb), and I have to store them inside HDFS in order to treat them by PIG
>> jobs on-demand.
>> The problem is the fact that I generate a lot of small files in HDFS
>> (several millions) and it can be problematic.
>> I investigated to use Hbase or Archive file but I don't want to do it
>> finally.
>> So

RE : Re: HDFS small file generation problem

2015-10-03 Thread nibiau

Hello,
Thanks if I understand correctly Hive can be a usable to my context ?

Nicolas




Envoyé depuis mon appareil mobile SamsungJörn Franke  a 
écrit :If you use transactional tables in hive together with insert, update, 
delete then it does the "concatenate " for you automatically in regularly 
intervals. Currently this works only with tables in orc.format (stored as orc)

Le sam. 3 oct. 2015 à 11:45,   a écrit :
Hello,
So, does Hive is a solution for my need :
- I receive small messages (10KB) identified by ID (product ID for example)
- Each message I receive is the last picture of my product ID, so I just want 
basically to store last picture products inside HDFS
in order to process batch on it later.

If I use Hive I suppose I have to use INSERT and UPDATE records and 
periodically CONCATENATE.
After a CONCATENATE I suppose the records are still updatable.

Tks to confirm if it can be solution for my use case. Or any other idea..

Thanks a lot !
Nicolas


- Mail original -
De: "Jörn Franke" 
À: nib...@free.fr, "Brett Antonides" 
Cc: user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 11:17:51
Objet: Re: HDFS small file generation problem



You can update data in hive if you use the orc format



Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :


Hello,
Finally Hive is not a solution as I cannot update the data.
And for archive file I think it would be the same issue.
Any other solutions ?

Nicolas

- Mail original -
De: nib...@free.fr
À: "Brett Antonides" < banto...@gmail.com >
Cc: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:37:22
Objet: Re: HDFS small file generation problem

Ok thanks, but can I also update data instead of insert data ?

- Mail original -
De: "Brett Antonides" < banto...@gmail.com >
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext.
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too)
* Use SQLContext.sql to Insert data into the table
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing

Cheers,
Brett









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:


Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ?

Tks
Nicolas

- Mail original -
De: "Jörn Franke" < jornfra...@gmail.com >
À: nib...@free.fr , "user" < user@spark.apache.org >
Envoyé: Lundi 28 Septembre 2015 23:53:56
Objet: Re: HDFS small file generation problem





Use hadoop archive



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :


Hello,
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?

Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke

Another alternative is hbase with phoenix as the SQL layer on top

Le sam. 3 oct. 2015 à 11:45,   a écrit :

> Hello,
> So, does Hive is a solution for my need :
> - I receive small messages (10KB) identified by ID (product ID for example)
> - Each message I receive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to process batch on it later.
>
> If I use Hive I suppose I have to use INSERT and UPDATE records and
> periodically CONCATENATE.
> After a CONCATENATE I suppose the records are still updatable.
>
> Tks to confirm if it can be solution for my use case. Or any other idea..
>
> Thanks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" 
> À: nib...@free.fr, "Brett Antonides" 
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: HDFS small file generation problem
>
>
>
> You can update data in hive if you use the orc format
>
>
>
> Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :
>
>
> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" < banto...@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> - Mail original -
> De: "Brett Antonides" < banto...@gmail.com >
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke

If you use transactional tables in hive together with insert, update,
delete then it does the "concatenate " for you automatically in regularly
intervals. Currently this works only with tables in orc.format (stored as
orc)

Le sam. 3 oct. 2015 à 11:45,   a écrit :

> Hello,
> So, does Hive is a solution for my need :
> - I receive small messages (10KB) identified by ID (product ID for example)
> - Each message I receive is the last picture of my product ID, so I just
> want basically to store last picture products inside HDFS
> in order to process batch on it later.
>
> If I use Hive I suppose I have to use INSERT and UPDATE records and
> periodically CONCATENATE.
> After a CONCATENATE I suppose the records are still updatable.
>
> Tks to confirm if it can be solution for my use case. Or any other idea..
>
> Thanks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" 
> À: nib...@free.fr, "Brett Antonides" 
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: HDFS small file generation problem
>
>
>
> You can update data in hive if you use the orc format
>
>
>
> Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :
>
>
> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> - Mail original -----
> De: nib...@free.fr
> À: "Brett Antonides" < banto...@gmail.com >
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> - Mail original -
> De: "Brett Antonides" < banto...@gmail.com >
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-10-03 Thread nibiau

Hello,
So, does Hive is a solution for my need :
- I receive small messages (10KB) identified by ID (product ID for example)
- Each message I receive is the last picture of my product ID, so I just want 
basically to store last picture products inside HDFS 
in order to process batch on it later.

If I use Hive I suppose I have to use INSERT and UPDATE records and 
periodically CONCATENATE.
After a CONCATENATE I suppose the records are still updatable.

Tks to confirm if it can be solution for my use case. Or any other idea..

Thanks a lot !
Nicolas 


- Mail original -
De: "Jörn Franke" 
À: nib...@free.fr, "Brett Antonides" 
Cc: user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 11:17:51
Objet: Re: HDFS small file generation problem



You can update data in hive if you use the orc format 



Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit : 


Hello, 
Finally Hive is not a solution as I cannot update the data. 
And for archive file I think it would be the same issue. 
Any other solutions ? 

Nicolas 

- Mail original - 
De: nib...@free.fr 
À: "Brett Antonides" < banto...@gmail.com > 
Cc: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:37:22 
Objet: Re: HDFS small file generation problem 

Ok thanks, but can I also update data instead of insert data ? 

- Mail original - 
De: "Brett Antonides" < banto...@gmail.com > 
À: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:18:18 
Objet: Re: HDFS small file generation problem 








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext. 
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ? 

Tks 
Nicolas 

- Mail original - 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

2015-10-03 Thread Jörn Franke

You can update data in hive if you use the orc format

Le sam. 3 oct. 2015 à 10:42,   a écrit :

> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" 
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> - Mail original -
> De: "Brett Antonides" 
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-10-03 Thread Jagat Singh

Hello Nicolas,

Hive solution is just to concatenate the files , it does not alter or
change records.
On 3 Oct 2015 6:42 pm,  wrote:

> Hello,
> Finally Hive is not a solution as I cannot update the data.
> And for archive file I think it would be the same issue.
> Any other solutions ?
>
> Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" 
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also update data instead of insert data ?
>
> - Mail original -
> De: "Brett Antonides" 
> À: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:18:18
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
>
>
>
> I had a very similar problem and solved it with Hive and ORC files using
> the Spark SQLContext.
> * Create a table in Hive stored as an ORC file (I recommend using
> partitioning too)
> * Use SQLContext.sql to Insert data into the table
> * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to
> merge your many small files into larger files optimized for your HDFS block
> size
> * Since the CONCATENATE command operates on files in place it is
> transparent to any downstream processing
>
> Cheers,
> Brett
>
>
>
>
>
>
>
>
>
> On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:
>
>
> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" < jornfra...@gmail.com >
> À: nib...@free.fr , "user" < user@spark.apache.org >
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-10-03 Thread nibiau

Hello,
Finally Hive is not a solution as I cannot update the data.
And for archive file I think it would be the same issue.
Any other solutions ?

Nicolas

- Mail original -
De: nib...@free.fr
À: "Brett Antonides" 
Cc: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:37:22
Objet: Re: HDFS small file generation problem

Ok thanks, but can I also update data instead of insert data ?

- Mail original -
De: "Brett Antonides" 
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext. 
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ? 

Tks 
Nicolas 

- Mail original - 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

2015-10-02 Thread nibiau

Ok thanks, but can I also update data instead of insert data ?

- Mail original -
De: "Brett Antonides" 
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext. 
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ? 

Tks 
Nicolas 

- Mail original - 
De: "Jörn Franke" < jornfra...@gmail.com > 
À: nib...@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

2015-10-02 Thread Brett Antonides

I had a very similar problem and solved it with Hive and ORC files using
the Spark SQLContext.
* Create a table in Hive stored as an ORC file (I recommend using
partitioning too)
* Use SQLContext.sql to Insert data into the table
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge
your many small files into larger files optimized for your HDFS block size
   * Since the CONCATENATE command operates on files in place it is
transparent to any downstream processing

Cheers,
Brett


On Fri, Oct 2, 2015 at 3:48 PM,  wrote:

> Hello,
> Yes but :
> - In the Java API I don't find a API to create a HDFS archive
> - As soon as I receive a message (with messageID) I need to replace the
> old existing file by the new one (name of file being the messageID), is it
> possible with archive ?
>
> Tks
> Nicolas
>
> - Mail original -
> De: "Jörn Franke" 
> À: nib...@free.fr, "user" 
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
> Use hadoop archive
>
>
>
> Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
>
>
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-10-02 Thread nibiau

Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ?

Tks
Nicolas

- Mail original -
De: "Jörn Franke" 
À: nib...@free.fr, "user" 
Envoyé: Lundi 28 Septembre 2015 23:53:56
Objet: Re: HDFS small file generation problem



Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs. 
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: HDFS small file generation problem

2015-09-28 Thread Jörn Franke

Use hadoop archive

Le dim. 27 sept. 2015 à 15:36,   a écrit :

> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: HDFS small file generation problem

2015-09-27 Thread Deenar Toraskar

You could try a couple of things

a) use Kafka for stream processing, store current incoming events and spark
streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too
(in a micro batched mode), so every x minutes. Kafka is more suited to
processing lots of small events/
b) Coalesce small files on HDFS into a big hourly, daily file. Use HDFS
partitioning to ensure that your pig job reads the least amount of
partitions.

Deenar

On 27 September 2015 at 14:47, ayan guha  wrote:

> I would suggest not to write small files to hdfs. rather you can hold them
> in memory, maybe off heap. and then you may flush it to hdfs using another
> job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
> already has something like it)
>
> On Sun, Sep 27, 2015 at 11:36 PM,  wrote:
>
>> Hello,
>> I'm still investigating my small file generation problem generated by my
>> Spark Streaming jobs.
>> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
>> 10kb), and I have to store them inside HDFS in order to treat them by PIG
>> jobs on-demand.
>> The problem is the fact that I generate a lot of small files in HDFS
>> (several millions) and it can be problematic.
>> I investigated to use Hbase or Archive file but I don't want to do it
>> finally.
>> So, what about this solution :
>> - Spark streaming generate on the fly several millions of small files in
>> HDFS
>> - Each night I merge them inside a big daily file
>> - I launch my PIG jobs on this big file ?
>>
>> Other question I have :
>> - Is it possible to append a big file (daily) by adding on the fly my
>> event ?
>>
>> Tks a lot
>> Nicolas
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: HDFS small file generation problem

2015-09-27 Thread ayan guha

I would suggest not to write small files to hdfs. rather you can hold them
in memory, maybe off heap. and then you may flush it to hdfs using another
job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
already has something like it)

On Sun, Sep 27, 2015 at 11:36 PM,  wrote:

> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them inside HDFS in order to treat them by PIG
> jobs on-demand.
> The problem is the fact that I generate a lot of small files in HDFS
> (several millions) and it can be problematic.
> I investigated to use Hbase or Archive file but I don't want to do it
> finally.
> So, what about this solution :
> - Spark streaming generate on the fly several millions of small files in
> HDFS
> - Each night I merge them inside a big daily file
> - I launch my PIG jobs on this big file ?
>
> Other question I have :
> - Is it possible to append a big file (daily) by adding on the fly my
> event ?
>
> Tks a lot
> Nicolas
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

HDFS small file generation problem

2015-09-27 Thread nibiau

Hello,
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?

Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot
Nicolas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RE : Re: HDFS small file generation problem

Re: RE : Re: HDFS small file generation problem

Re: RE : Re: HDFS small file generation problem

RE : Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

Re: HDFS small file generation problem

HDFS small file generation problem

17 matches

Site Navigation

Mail list logo

Footer information