Hi Shuai,
It should certainly be possible to do it that way, but I would recommend
against it. If you look at HadoopRDD, its doing all sorts of little
book-keeping that you would most likely want to mimic. eg., tracking the
number of bytes records that are read, setting up all the hadoop
It should be *possible* to do what you want ... but if I understand you
right, there isn't really any very easy way to do it. I think you would
need to write your own subclass of RDD, which has its own logic on how the
input files get put divided among partitions. You can probably subclass
Hi All,
If I have a set of time series data files, they are in parquet format and
the data for each day are store in naming convention, but I will not know
how many files for one day.
20150101a.parq
20150101b.parq
20150102a.parq
20150102b.parq
20150102c.parq
.
201501010a.parq
.