There has to have a central point to collaboratively collecting exactly 10000 records, currently the approach is using one single partitions, which is easy to implement. Otherwise, the driver has to count the number of records in each partition and then decide how many records to be materialized in each partition, because some partition may not have enough number of records, sometimes it is even empty.
I didn’t see any straightforward walk around for this. Thanks. Zhan Zhang On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>> wrote: It is an application running as an http server. So I collect the data as the response. 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com<mailto:justupl...@hotmail.com>> 写道: When you call collect() it will bring all the data to the driver. Do you mean to call persist() instead? ________________________________ From: tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com> Subject: Problem using limit clause in spark sql Date: Wed, 23 Dec 2015 21:26:51 +0800 To: user@spark.apache.org<mailto:user@spark.apache.org> Hi, I am using spark sql in a way like this: sqlContext.sql(“select * from table limit 10000”).map(...).collect() The problem is that the limit clause will collect all the 10,000 records into a single partition, resulting the map afterwards running only in one partition and being really slow.I tried to use repartition, but it is kind of a waste to collect all those records into one partition and then shuffle them around and then collect them again. Is there a way to work around this? BTW, there is no order by clause and I do not care which 10000 records I get as long as the total number is less or equal then 10000.