Hi All,

 

If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.

 

20150101a.parq

20150101b.parq

20150102a.parq

20150102b.parq

20150102c.parq

.

201501010a.parq

.

 

Now I try to write a program to process the data. And I want to make sure
each day's data into one partition, of course I can load all into one big
RDD to do partition but it will be very slow. As I already know the time
series of the file name, is it possible for me to load the data into the RDD
also preserve the partition? I know I can preserve the partition by each
file, but is it anyway for me to load the RDD and preserve partition based
on a set of files: one partition multiple files?

 

I think it is possible because when I load a RDD from 100 files (assume
cross 100 days), I will have 100 partitions (if I disable file split when
load file). Then I can use a special coalesce to repartition the RDD? But I
don't know is it possible to do that in current Spark now?

 

Regards,

 

Shuai 

Reply via email to