There has to have a central point to collaboratively collecting exactly 10000 
records, currently the approach is using one single partitions, which is easy 
to implement.
Otherwise, the driver has to count the number of records in each partition and 
then decide how many records  to be materialized in each partition, because 
some partition may not have enough number of records, sometimes it is even 
empty.

I didn’t see any straightforward walk around for this.

Thanks.

Zhan Zhang



On Dec 23, 2015, at 5:32 PM, 汪洋 
<tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>> wrote:

It is an application running as an http server. So I collect the data as the 
response.

在 2015年12月24日,上午8:22,Hudong Wang 
<justupl...@hotmail.com<mailto:justupl...@hotmail.com>> 写道:

When you call collect() it will bring all the data to the driver. Do you mean 
to call persist() instead?

________________________________
From: tiandiwo...@icloud.com<mailto:tiandiwo...@icloud.com>
Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800
To: user@spark.apache.org<mailto:user@spark.apache.org>

Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 10000”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this?
BTW, there is no order by clause and I do not care which 10000 records I get as 
long as the total number is less or equal then 10000.


Reply via email to