Re: How to preserve/preset partition information when load time series data?

2015-03-16 Thread Imran Rashid
Hi Shuai,

It should certainly be possible to do it that way, but I would recommend
against it.  If you look at HadoopRDD, its doing all sorts of little
book-keeping that you would most likely want to mimic.  eg., tracking the
number of bytes  records that are read, setting up all the hadoop
configuration, splits, readers, scheduling tasks for locality, etc.  Thats
why I suggested that really you want to just create a small variant of
HadoopRDD.

hope that helps,
Imran


On Sat, Mar 14, 2015 at 11:10 AM, Shawn Zheng szheng.c...@gmail.com wrote:

 Sorry for reply late.

 But I just think of one solution: if I load all the file name itself (not
 the contain of the file), so I have a RDD[key, iterable[filename]], then I
 run mapPartitionsToPair on it with preservesPartitioning=true

 Do you think it is a right solution? I am not sure whether it has
 potential issue if I try to fake/enforce the partition in my own way.

 Regards,

 Shuai

 On Wed, Mar 11, 2015 at 8:09 PM, Imran Rashid iras...@cloudera.com
 wrote:

 It should be *possible* to do what you want ... but if I understand you
 right, there isn't really any very easy way to do it.  I think you would
 need to write your own subclass of RDD, which has its own logic on how the
 input files get put divided among partitions.  You can probably subclass
 HadoopRDD and just modify getPartitions().  your logic could look at the
 day of each filename to decide which partition it goes into.  You'd need to
 make corresponding changes for HadoopPartition  the compute() method.

 (or if you can't subclass HadoopRDD directly you can use it for
 inspiration.)

 On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com
 wrote:

 Hi All,



 If I have a set of time series data files, they are in parquet format
 and the data for each day are store in naming convention, but I will not
 know how many files for one day.



 20150101a.parq

 20150101b.parq

 20150102a.parq

 20150102b.parq

 20150102c.parq

 …

 201501010a.parq

 …



 Now I try to write a program to process the data. And I want to make
 sure each day’s data into one partition, of course I can load all into one
 big RDD to do partition but it will be very slow. As I already know the
 time series of the file name, is it possible for me to load the data into
 the RDD also preserve the partition? I know I can preserve the partition by
 each file, but is it anyway for me to load the RDD and preserve partition
 based on a set of files: one partition multiple files?



 I think it is possible because when I load a RDD from 100 files (assume
 cross 100 days), I will have 100 partitions (if I disable file split when
 load file). Then I can use a special coalesce to repartition the RDD? But I
 don’t know is it possible to do that in current Spark now?



 Regards,



 Shuai






Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid
It should be *possible* to do what you want ... but if I understand you
right, there isn't really any very easy way to do it.  I think you would
need to write your own subclass of RDD, which has its own logic on how the
input files get put divided among partitions.  You can probably subclass
HadoopRDD and just modify getPartitions().  your logic could look at the
day of each filename to decide which partition it goes into.  You'd need to
make corresponding changes for HadoopPartition  the compute() method.

(or if you can't subclass HadoopRDD directly you can use it for
inspiration.)

On Mon, Mar 9, 2015 at 11:18 AM, Shuai Zheng szheng.c...@gmail.com wrote:

 Hi All,



 If I have a set of time series data files, they are in parquet format and
 the data for each day are store in naming convention, but I will not know
 how many files for one day.



 20150101a.parq

 20150101b.parq

 20150102a.parq

 20150102b.parq

 20150102c.parq

 …

 201501010a.parq

 …



 Now I try to write a program to process the data. And I want to make sure
 each day’s data into one partition, of course I can load all into one big
 RDD to do partition but it will be very slow. As I already know the time
 series of the file name, is it possible for me to load the data into the
 RDD also preserve the partition? I know I can preserve the partition by
 each file, but is it anyway for me to load the RDD and preserve partition
 based on a set of files: one partition multiple files?



 I think it is possible because when I load a RDD from 100 files (assume
 cross 100 days), I will have 100 partitions (if I disable file split when
 load file). Then I can use a special coalesce to repartition the RDD? But I
 don’t know is it possible to do that in current Spark now?



 Regards,



 Shuai