I see. Thanks.
> 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道: > > There has to have a central point to collaboratively collecting exactly 10000 > records, currently the approach is using one single partitions, which is easy > to implement. > Otherwise, the driver has to count the number of records in each partition > and then decide how many records to be materialized in each partition, > because some partition may not have enough number of records, sometimes it is > even empty. > > I didn’t see any straightforward walk around for this. > > Thanks. > > Zhan Zhang > > > > On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com > <mailto:tiandiwo...@icloud.com>> wrote: > >> It is an application running as an http server. So I collect the data as the >> response. >> >>> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com >>> <mailto:justupl...@hotmail.com>> 写道: >>> >>> When you call collect() it will bring all the data to the driver. Do you >>> mean to call persist() instead? >>> >>> From: tiandiwo...@icloud.com <mailto:tiandiwo...@icloud.com> >>> Subject: Problem using limit clause in spark sql >>> Date: Wed, 23 Dec 2015 21:26:51 +0800 >>> To: user@spark.apache.org <mailto:user@spark.apache.org> >>> >>> Hi, >>> I am using spark sql in a way like this: >>> >>> sqlContext.sql(“select * from table limit 10000”).map(...).collect() >>> >>> The problem is that the limit clause will collect all the 10,000 records >>> into a single partition, resulting the map afterwards running only in one >>> partition and being really slow.I tried to use repartition, but it is kind >>> of a waste to collect all those records into one partition and then shuffle >>> them around and then collect them again. >>> >>> Is there a way to work around this? >>> BTW, there is no order by clause and I do not care which 10000 records I >>> get as long as the total number is less or equal then 10000. >> >