I see.  

Thanks.


> 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道:
> 
> There has to have a central point to collaboratively collecting exactly 10000 
> records, currently the approach is using one single partitions, which is easy 
> to implement. 
> Otherwise, the driver has to count the number of records in each partition 
> and then decide how many records  to be materialized in each partition, 
> because some partition may not have enough number of records, sometimes it is 
> even empty.
> 
> I didn’t see any straightforward walk around for this.
> 
> Thanks.
> 
> Zhan Zhang
> 
> 
> 
> On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com 
> <mailto:tiandiwo...@icloud.com>> wrote:
> 
>> It is an application running as an http server. So I collect the data as the 
>> response.
>> 
>>> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com 
>>> <mailto:justupl...@hotmail.com>> 写道:
>>> 
>>> When you call collect() it will bring all the data to the driver. Do you 
>>> mean to call persist() instead?
>>> 
>>> From: tiandiwo...@icloud.com <mailto:tiandiwo...@icloud.com>
>>> Subject: Problem using limit clause in spark sql
>>> Date: Wed, 23 Dec 2015 21:26:51 +0800
>>> To: user@spark.apache.org <mailto:user@spark.apache.org>
>>> 
>>> Hi,
>>> I am using spark sql in a way like this:
>>> 
>>> sqlContext.sql(“select * from table limit 10000”).map(...).collect()
>>> 
>>> The problem is that the limit clause will collect all the 10,000 records 
>>> into a single partition, resulting the map afterwards running only in one 
>>> partition and being really slow.I tried to use repartition, but it is kind 
>>> of a waste to collect all those records into one partition and then shuffle 
>>> them around and then collect them again.
>>> 
>>> Is there a way to work around this? 
>>> BTW, there is no order by clause and I do not care which 10000 records I 
>>> get as long as the total number is less or equal then 10000.
>> 
> 

Reply via email to