RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where'

Re: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Thanks Yong for your response. Let me see if I can understand what you're suggesting, so the whole data set, when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an extra field to each element in RDD, like bucket_id. For example Key: 1 - 10, bucket_id=1 11-20,

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
an do whatever analytic function you want. Yong Date: Thu, 29 Oct 2015 12:53:35 -0700 Subject: Re: RDD's filter() or using 'where' condition in SparkSQL From: anfernee...@gmail.com To: java8...@hotmail.com CC: user@spark.apache.org Thanks Yong for your response. Let me see if I can understand what