Re: How to preserve/preset partition information when load time series data?

2015-03-16 Thread Imran Rashid
Hi Shuai, It should certainly be possible to do it that way, but I would recommend against it. If you look at HadoopRDD, its doing all sorts of little book-keeping that you would most likely want to mimic. eg., tracking the number of bytes records that are read, setting up all the hadoop

Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid
It should be *possible* to do what you want ... but if I understand you right, there isn't really any very easy way to do it. I think you would need to write your own subclass of RDD, which has its own logic on how the input files get put divided among partitions. You can probably subclass

How to preserve/preset partition information when load time series data?

2015-03-09 Thread Shuai Zheng
Hi All, If I have a set of time series data files, they are in parquet format and the data for each day are store in naming convention, but I will not know how many files for one day. 20150101a.parq 20150101b.parq 20150102a.parq 20150102b.parq 20150102c.parq . 201501010a.parq .