Re: Process time series RDD after sortByKey

2015-03-17 Thread Shawn Zheng
');] *Sent:* Monday, March 16, 2015 11:22 AM *To:* Shawn Zheng; user@spark.apache.org javascript:_e(%7B%7D,'cvml','user@spark.apache.org'); *Subject:* Re: Process time series RDD after sortByKey Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com javascript:_e

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
@spark.apache.org *Subject:* Re: Process time series RDD after sortByKey Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com wrote: Sorry I response late. Zhan Zhang's solution is very interesting and I look at into it, but it is not what I want. Basically I

Re: Process time series RDD after sortByKey

2015-03-16 Thread Imran Rashid
Hi Shuai, On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com wrote: Sorry I response late. Zhan Zhang's solution is very interesting and I look at into it, but it is not what I want. Basically I want to run the job sequentially and also gain parallelism. So if possible, if

RE: Process time series RDD after sortByKey

2015-03-16 Thread Shuai Zheng
valuable approach to me so I am desired to learn. Regards, Shuai From: Imran Rashid [mailto:iras...@cloudera.com] Sent: Monday, March 16, 2015 11:22 AM To: Shawn Zheng; user@spark.apache.org Subject: Re: Process time series RDD after sortByKey Hi Shuai, On Sat, Mar 14, 2015

Re: Process time series RDD after sortByKey

2015-03-11 Thread Imran Rashid
this is a very interesting use case. First of all, its worth pointing out that if you really need to process the data sequentially, fundamentally you are limiting the parallelism you can get. Eg., if you need to process the entire data set sequentially, then you can't get any parallelism. If

Re: Process time series RDD after sortByKey

2015-03-09 Thread Zhan Zhang
Does the code flow similar to following work for you, which processes each partition of an RDD sequentially? while( iterPartition RDD.partitions.length) { val res = sc.runJob(this, (it: Iterator[T]) = somFunc, iterPartition, allowLocal = true) Some other function after processing

Process time series RDD after sortByKey

2015-03-09 Thread Shuai Zheng
Hi All, I am processing some time series data. For one day, it might has 500GB, then for each hour, it is around 20GB data. I need to sort the data before I start process. Assume I can sort them successfully dayRDD.sortByKey but after that, I might have thousands of partitions (to