');]
*Sent:* Monday, March 16, 2015 11:22 AM
*To:* Shawn Zheng; user@spark.apache.org
javascript:_e(%7B%7D,'cvml','user@spark.apache.org');
*Subject:* Re: Process time series RDD after sortByKey
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com
javascript:_e
@spark.apache.org
*Subject:* Re: Process time series RDD after sortByKey
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com
wrote:
Sorry I response late.
Zhan Zhang's solution is very interesting and I look at into it, but it is
not what I want. Basically I
Hi Shuai,
On Sat, Mar 14, 2015 at 11:02 AM, Shawn Zheng szheng.c...@gmail.com wrote:
Sorry I response late.
Zhan Zhang's solution is very interesting and I look at into it, but it is
not what I want. Basically I want to run the job sequentially and also gain
parallelism. So if possible, if
valuable approach
to me so I am desired to learn.
Regards,
Shuai
From: Imran Rashid [mailto:iras...@cloudera.com]
Sent: Monday, March 16, 2015 11:22 AM
To: Shawn Zheng; user@spark.apache.org
Subject: Re: Process time series RDD after sortByKey
Hi Shuai,
On Sat, Mar 14, 2015
this is a very interesting use case. First of all, its worth pointing out
that if you really need to process the data sequentially, fundamentally you
are limiting the parallelism you can get. Eg., if you need to process the
entire data set sequentially, then you can't get any parallelism. If
Does the code flow similar to following work for you, which processes each
partition of an RDD sequentially?
while( iterPartition RDD.partitions.length) {
val res = sc.runJob(this, (it: Iterator[T]) = somFunc, iterPartition,
allowLocal = true)
Some other function after processing
Hi All,
I am processing some time series data. For one day, it might has 500GB, then
for each hour, it is around 20GB data.
I need to sort the data before I start process. Assume I can sort them
successfully
dayRDD.sortByKey
but after that, I might have thousands of partitions (to