btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
> Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Spark Developers, >> >> I just ran some very simple operations on a dataset. I was surprise by >> the execution plan of take(1), head() or first(). >> >> For your reference, this is what I did in pyspark 1.5: >> df=sqlContext.read.parquet("someparquetfiles") >> df.head() >> >> The above lines take over 15 minutes. I was frustrated because I can do >> better without using spark :) Since I like spark, so I tried to figure out >> why. It seems the dataframe requires 3 stages to give me the first row. It >> reads all data (which is about 1 billion rows) and run Limit twice. >> >> Instead of head(), show(1) runs much faster. Not to mention that if I do: >> >> df.rdd.take(1) //runs much faster. >> >> Is this expected? Why head/first/take is so slow for dataframe? Is it a >> bug in the optimizer? or I did something wrong? >> >> Best Regards, >> >> Jerry >> > >