Hi Yin, You are right! I just tried the scala version with the above lines, it works as expected. I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a bit. I created a ticket for this (SPARK-10731 <https://issues.apache.org/jira/browse/SPARK-10731>).
Best Regards, Jerry On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create a JIRA? >> >> Thanks, >> >> Yin >> >> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> Hi Spark Developers, >>> >>> I just ran some very simple operations on a dataset. I was surprise by >>> the execution plan of take(1), head() or first(). >>> >>> For your reference, this is what I did in pyspark 1.5: >>> df=sqlContext.read.parquet("someparquetfiles") >>> df.head() >>> >>> The above lines take over 15 minutes. I was frustrated because I can do >>> better without using spark :) Since I like spark, so I tried to figure out >>> why. It seems the dataframe requires 3 stages to give me the first row. It >>> reads all data (which is about 1 billion rows) and run Limit twice. >>> >>> Instead of head(), show(1) runs much faster. Not to mention that if I do: >>> >>> df.rdd.take(1) //runs much faster. >>> >>> Is this expected? Why head/first/take is so slow for dataframe? Is it a >>> bug in the optimizer? or I did something wrong? >>> >>> Best Regards, >>> >>> Jerry >>> >> >> >