I am going to have the above scenario without using limit clause then will
it work check among all the partitions.
On Dec 24, 2015 9:26 AM, "汪洋" <tiandiwo...@icloud.com> wrote:

> I see.
>
> Thanks.
>
>
> 在 2015年12月24日,上午11:44,Zhan Zhang <zzh...@hortonworks.com> 写道:
>
> There has to have a central point to collaboratively collecting exactly
> 10000 records, currently the approach is using one single partitions, which
> is easy to implement.
> Otherwise, the driver has to count the number of records in each partition
> and then decide how many records  to be materialized in each partition,
> because some partition may not have enough number of records, sometimes it
> is even empty.
>
> I didn’t see any straightforward walk around for this.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Dec 23, 2015, at 5:32 PM, 汪洋 <tiandiwo...@icloud.com> wrote:
>
> It is an application running as an http server. So I collect the data as
> the response.
>
> 在 2015年12月24日,上午8:22,Hudong Wang <justupl...@hotmail.com> 写道:
>
> When you call collect() it will bring all the data to the driver. Do you
> mean to call persist() instead?
>
> ------------------------------
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
>
> Hi,
> I am using spark sql in a way like this:
>
> sqlContext.sql(“select * from table limit 10000”).map(...).collect()
>
> The problem is that the limit clause will collect all the 10,000 records
> into a single partition, resulting the map afterwards running only in one
> partition and being really slow.I tried to use repartition, but it is
> kind of a waste to collect all those records into one partition and then
> shuffle them around and then collect them again.
>
> Is there a way to work around this?
> BTW, there is no order by clause and I do not care which 10000 records I
> get as long as the total number is less or equal then 10000.
>
>
>
>
>

Reply via email to