Nathan Grand created SPARK-30792:
------------------------------------

             Summary: Dataframe .limit() performance improvements
                 Key: SPARK-30792
                 URL: https://issues.apache.org/jira/browse/SPARK-30792
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Nathan Grand


It seems that
{code:java}
.limit(){code}
is much less efficient than it could be/one would expect when reading a large 
dataset from parquet:
{code:java}
val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000)
// Do something with sample ...{code}
This might take hours, depending on the size of the data.

By comparison,
{code:java}
spark.read.parquet("/Some/Large/Data.parquet").show(1000){code}
is essentially instant.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to