Re: LIMIT issue of SparkSQL

Liz Bai Fri, 28 Oct 2016 02:37:40 -0700

Sorry for the late reply.
The size of the raw data is 20G and it is composed of two columns. We generated 
it by this 
<https://github.com/lizbai0821/DataGenerator/blob/master/src/main/scala/Gen.scala>.
The test queries are very simple,
1). select ColA from Table limit 1
2). select ColA from Table
3). select ColA from Table where ColB=0
4). select ColA from Table where ColB=0 limit 1
We found that if we use `result.collect()`, it does early stop upon getting 
adequate results for query 1) and 4).
However, we used to run `result.write.parquet`, and there is no early stop and 
scans much more data than `result.collect()`.


Below are the detailed testing summary,
Query
Method of Saving Results
Run Time
select ColA from Table limit 1
result.write.Parquet
1m 56s
select ColA from Table
1m 40s
select ColA from Table where ColB=0 limit 1
1m 32s
select ColA from Table where ColB=0 
1m 21s
select ColA from Table limit 1
result.collect()
18s
select ColA from Table where ColB=0 limit 1
18s

Thanks.

Best,
Liz
> On 27 Oct 2016, at 2:16 AM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> That is surprising then, you may have found a bug.  What timings are you 
> seeing?  Can you reproduce it with data you can share? I'd open a JIRA if so.
> 
> On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai <liz...@icloud.com 
> <mailto:liz...@icloud.com>> wrote:
> We used Parquet as data source. The query is like “select ColA from table 
> limit 1”. Attached is the query plan of it. (However its run time is just the 
> same as “select ColA from table”.)
> We expected an early stop upon getting 1 result, rather than scanning all 
> records and finally collect it with limit in the final phase. 
> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when 
> involving table joins. But some cases such as “Filter + Projection + Limit”  
> will benefit from `limit push down`.
> May I know if there is any detailed solutions for this?
> 
> Thanks so much.
> 
> Best,
> Liz
> 
> <queryplan.png>
>> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> 
>> It is not about limits on specific tables.  We do support that.  The case 
>> I'm describing involves pushing limits across system boundaries.  It is 
>> certainly possible to do this, but the current datasource API does provide 
>> this information (other than the implicit limit that is pushed down to the 
>> consumed iterator of the data source).
>> 
>> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> This is an interesting point.
>> 
>> As far as I know in any database (practically all RDBMS Oracle, SAP etc), 
>> the LIMIT affects the collection part of the result set.
>> 
>> The result set is carried out fully on the query that may involve multiple 
>> joins on multiple underlying tables.
>> 
>> To limit the actual query by LIMIT on each underlying table does not make 
>> sense and will not be industry standard AFAK.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 24 October 2016 at 06:48, Michael Armbrust <mich...@databricks.com 
>> <mailto:mich...@databricks.com>> wrote:
>> - dev + user
>> 
>> Can you give more info about the query?  Maybe a full explain()?  Are you 
>> using a datasource like JDBC?  The API does not currently push down limits, 
>> but the documentation talks about how you can use a query instead of a table 
>> if that is what you are looking to do.
>> 
>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <liz...@icloud.com 
>> <mailto:liz...@icloud.com>> wrote:
>> Hi all,
>> 
>> Let me clarify the problem: 
>> 
>> Suppose we have a simple table `A` with 100 000 000 records
>> 
>> Problem:
>> When we execute sql query ‘select * from A Limit 500`,
>> It scan through all 100 000 000 records. 
>> Normal behaviour should be that once 500 records is found, engine stop 
>> scanning.
>> 
>> Detailed observation:
>> We found that there are “GlobalLimit / LocalLimit” physical operators
>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
>>  
>> <https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala>
>> But during query plan generation, GlobalLimit / LocalLimit is not applied to 
>> the query plan.
>> 
>> Could you please help us to inspect LIMIT problem? 
>> Thanks.
>> 
>> Best,
>> Liz
>>> On 23 Oct 2016, at 10:11 PM, Xiao Li <gatorsm...@gmail.com 
>>> <mailto:gatorsm...@gmail.com>> wrote:
>>> 
>>> Hi, Liz,
>>> 
>>> CollectLimit means `Take the first `limit` elements and collect them to a 
>>> single partition.`
>>> 
>>> Thanks,
>>> 
>>> Xiao 
>>> 
>>> 2016-10-23 5:21 GMT-07:00 Ran Bai <liz...@icloud.com 
>>> <mailto:liz...@icloud.com>>:
>>> Hi all,
>>> 
>>> I found the runtime for query with or without “LIMIT” keyword is the same. 
>>> We looked into it and found actually there is “GlobalLimit / LocalLimit” in 
>>> logical plan, however no relevant physical plan there. Is this a bug or 
>>> something else? Attached are the logical and physical plans when running 
>>> "SELECT * FROM seq LIMIT 1".
>>> 
>>> 
>>> More specifically, We expected a early stop upon getting adequate results.
>>> Thanks so much.
>>> 
>>> Best,
>>> Liz
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> 
>> 
>> 
>> 
>> 
> 
>

Re: LIMIT issue of SparkSQL

Reply via email to