Jerry Lam created SPARK-10731: --------------------------------- Summary: The head() implementation of dataframe is very slow Key: SPARK-10731 URL: https://issues.apache.org/jira/browse/SPARK-10731 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Reporter: Jerry Lam
df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. It seems the dataframe requires 3 stages to return the first row. It reads all data (which is about 1 billion rows) and run Limit twice. The take(1) implementation in the RDD performs much better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org