Hi all, We used Parquet and Spark 2.0 to do the testing. The table below is the summary of what we have found about `Limit` keyword. Query-2 reveals that SparkSQL does early stop upon getting adequate results. But we are curious of Query-1 and Query-2. It seems that, either writing result RDD as Parquet or filtering on columns will lead to scanning much more data. No. SQL statement Filter Method of saving result Runtime(s) Input data size 1 select ColA from Table limit 1 no writeParquet 216 205MB 2 select ColA from Table limit 1 no Collect 22 38.3KB 3 select ColA from Table where ColB = 50 limit 1 yes Collect 229 1776.4MB We are wondering if this is a bug or something else. Could you please help on it? Thanks.
Best regards, Liz