Nathan Grand created SPARK-30792: ------------------------------------ Summary: Dataframe .limit() performance improvements Key: SPARK-30792 URL: https://issues.apache.org/jira/browse/SPARK-30792 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Nathan Grand
It seems that {code:java} .limit(){code} is much less efficient than it could be/one would expect when reading a large dataset from parquet: {code:java} val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000) // Do something with sample ...{code} This might take hours, depending on the size of the data. By comparison, {code:java} spark.read.parquet("/Some/Large/Data.parquet").show(1000){code} is essentially instant. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org