I am going to have the above scenario without using limit clause then will it work check among all the partitions. On Dec 24, 2015 9:26 AM, "汪洋" <tiandiwo...@icloud.com> wrote:
> I see. > > Thanks. > > > 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道: > > There has to have a central point to collaboratively collecting exactly > 10000 records, currently the approach is using one single partitions, which > is easy to implement. > Otherwise, the driver has to count the number of records in each partition > and then decide how many records to be materialized in each partition, > because some partition may not have enough number of records, sometimes it > is even empty. > > I didn’t see any straightforward walk around for this. > > Thanks. > > Zhan Zhang > > > > On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com> wrote: > > It is an application running as an http server. So I collect the data as > the response. > > 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com> 写道: > > When you call collect() it will bring all the data to the driver. Do you > mean to call persist() instead? > > ------------------------------ > From: tiandiwo...@icloud.com > Subject: Problem using limit clause in spark sql > Date: Wed, 23 Dec 2015 21:26:51 +0800 > To: user@spark.apache.org > > Hi, > I am using spark sql in a way like this: > > sqlContext.sql(“select * from table limit 10000”).map(...).collect() > > The problem is that the limit clause will collect all the 10,000 records > into a single partition, resulting the map afterwards running only in one > partition and being really slow.I tried to use repartition, but it is > kind of a waste to collect all those records into one partition and then > shuffle them around and then collect them again. > > Is there a way to work around this? > BTW, there is no order by clause and I do not care which 10000 records I > get as long as the total number is less or equal then 10000. > > > > >