Won't you be able to use case statement to generate a virtual column (like
partition_num), then use analytic SQL partition by this virtual column?
In this case, the full dataset will be just scanned once.
Yong
Date: Thu, 29 Oct 2015 10:51:53 -0700
Subject: RDD's filter() or using 'where'
Thanks Yong for your response.
Let me see if I can understand what you're suggesting, so the whole data
set, when I load them into Spark(I'm using custom Hadoop InputFormat), I
will add an extra field to each element in RDD, like bucket_id.
For example
Key:
1 - 10, bucket_id=1
11-20,
an do whatever analytic function
you want.
Yong
Date: Thu, 29 Oct 2015 12:53:35 -0700
Subject: Re: RDD's filter() or using 'where' condition in SparkSQL
From: anfernee...@gmail.com
To: java8...@hotmail.com
CC: user@spark.apache.org
Thanks Yong for your response.
Let me see if I can understand what