"As per my understanding, storing 5minutes file means we could not create
RDD more granular than 5minutes."

This depends on the file format. Many file formats are splittable (like
parquet), meaning that you can seek into various points of the file.

2015-05-05 12:45 GMT-04:00 Rendy Bambang Junior <rendy.b.jun...@gmail.com>:

> Let say I am storing my data in HDFS with folder structure and file
> partitioning as per below:
> /analytics/2015/05/02/partition-2015-05-02-13-50-0000
> Note that new file is created every 5 minutes.
>
> As per my understanding, storing 5minutes file means we could not create
> RDD more granular than 5minutes.
>
> In the other hand, when we want to aggregate monthly data, number of file
> will be enormous (around 84000 files).
>
> My question is, what are the consideration to say that the number of file
> to be loaded to RDD is just 'too many'? Is 84000 'too many' files?
>
> One thing that comes to my mind is overhead when spark try to open file,
> however Im not sure whether it is a valid concern.
>
> Rendy
>

Reply via email to