Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first().
For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I was frustrated because I can do better without using spark :) Since I like spark, so I tried to figure out why. It seems the dataframe requires 3 stages to give me the first row. It reads all data (which is about 1 billion rows) and run Limit twice. Instead of head(), show(1) runs much faster. Not to mention that if I do: df.rdd.take(1) //runs much faster. Is this expected? Why head/first/take is so slow for dataframe? Is it a bug in the optimizer? or I did something wrong? Best Regards, Jerry