Sorry for the late reply. The size of the raw data is 20G and it is composed of two columns. We generated it by this <https://github.com/lizbai0821/DataGenerator/blob/master/src/main/scala/Gen.scala>. The test queries are very simple, 1). select ColA from Table limit 1 2). select ColA from Table 3). select ColA from Table where ColB=0 4). select ColA from Table where ColB=0 limit 1 We found that if we use `result.collect()`, it does early stop upon getting adequate results for query 1) and 4). However, we used to run `result.write.parquet`, and there is no early stop and scans much more data than `result.collect()`.
Below are the detailed testing summary, Query Method of Saving Results Run Time select ColA from Table limit 1 result.write.Parquet 1m 56s select ColA from Table 1m 40s select ColA from Table where ColB=0 limit 1 1m 32s select ColA from Table where ColB=0 1m 21s select ColA from Table limit 1 result.collect() 18s select ColA from Table where ColB=0 limit 1 18s Thanks. Best, Liz > On 27 Oct 2016, at 2:16 AM, Michael Armbrust <mich...@databricks.com> wrote: > > That is surprising then, you may have found a bug. What timings are you > seeing? Can you reproduce it with data you can share? I'd open a JIRA if so. > > On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai <liz...@icloud.com > <mailto:liz...@icloud.com>> wrote: > We used Parquet as data source. The query is like “select ColA from table > limit 1”. Attached is the query plan of it. (However its run time is just the > same as “select ColA from table”.) > We expected an early stop upon getting 1 result, rather than scanning all > records and finally collect it with limit in the final phase. > Btw, I agree with Mich’s concerning. `Limit push down` is impossible when > involving table joins. But some cases such as “Filter + Projection + Limit” > will benefit from `limit push down`. > May I know if there is any detailed solutions for this? > > Thanks so much. > > Best, > Liz > > <queryplan.png> >> On 25 Oct 2016, at 5:54 AM, Michael Armbrust <mich...@databricks.com >> <mailto:mich...@databricks.com>> wrote: >> >> It is not about limits on specific tables. We do support that. The case >> I'm describing involves pushing limits across system boundaries. It is >> certainly possible to do this, but the current datasource API does provide >> this information (other than the implicit limit that is pushed down to the >> consumed iterator of the data source). >> >> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> This is an interesting point. >> >> As far as I know in any database (practically all RDBMS Oracle, SAP etc), >> the LIMIT affects the collection part of the result set. >> >> The result set is carried out fully on the query that may involve multiple >> joins on multiple underlying tables. >> >> To limit the actual query by LIMIT on each underlying table does not make >> sense and will not be industry standard AFAK. >> >> HTH >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> On 24 October 2016 at 06:48, Michael Armbrust <mich...@databricks.com >> <mailto:mich...@databricks.com>> wrote: >> - dev + user >> >> Can you give more info about the query? Maybe a full explain()? Are you >> using a datasource like JDBC? The API does not currently push down limits, >> but the documentation talks about how you can use a query instead of a table >> if that is what you are looking to do. >> >> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai <liz...@icloud.com >> <mailto:liz...@icloud.com>> wrote: >> Hi all, >> >> Let me clarify the problem: >> >> Suppose we have a simple table `A` with 100 000 000 records >> >> Problem: >> When we execute sql query ‘select * from A Limit 500`, >> It scan through all 100 000 000 records. >> Normal behaviour should be that once 500 records is found, engine stop >> scanning. >> >> Detailed observation: >> We found that there are “GlobalLimit / LocalLimit” physical operators >> https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala >> >> <https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala> >> But during query plan generation, GlobalLimit / LocalLimit is not applied to >> the query plan. >> >> Could you please help us to inspect LIMIT problem? >> Thanks. >> >> Best, >> Liz >>> On 23 Oct 2016, at 10:11 PM, Xiao Li <gatorsm...@gmail.com >>> <mailto:gatorsm...@gmail.com>> wrote: >>> >>> Hi, Liz, >>> >>> CollectLimit means `Take the first `limit` elements and collect them to a >>> single partition.` >>> >>> Thanks, >>> >>> Xiao >>> >>> 2016-10-23 5:21 GMT-07:00 Ran Bai <liz...@icloud.com >>> <mailto:liz...@icloud.com>>: >>> Hi all, >>> >>> I found the runtime for query with or without “LIMIT” keyword is the same. >>> We looked into it and found actually there is “GlobalLimit / LocalLimit” in >>> logical plan, however no relevant physical plan there. Is this a bug or >>> something else? Attached are the logical and physical plans when running >>> "SELECT * FROM seq LIMIT 1". >>> >>> >>> More specifically, We expected a early stop upon getting adequate results. >>> Thanks so much. >>> >>> Best, >>> Liz >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> <mailto:dev-unsubscr...@spark.apache.org> >>> >> >> >> >> > >