Hi all,

We used Parquet and Spark 2.0 to do the testing. The table below is the summary 
of what we have found about `Limit` keyword. Query-2 reveals that SparkSQL does 
early stop upon getting adequate results. But we are curious of Query-1 and 
Query-2. It seems that, either writing result RDD as Parquet or filtering on 
columns will lead to scanning much more data.
No.
SQL statement
Filter
Method of saving result
Runtime(s)
Input data size
1
select ColA from Table limit 1
no
writeParquet
216
205MB
2
select ColA from Table limit 1
no
Collect
22
38.3KB
3
select ColA from Table where ColB = 50 limit 1
yes
Collect
229
1776.4MB
We are wondering if this is a bug or something else. Could you please help on 
it?
Thanks.

Best regards,
Liz

Reply via email to