Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin, You are right! I just tried the scala version with the above lines, it works as expected. I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > >> Hi

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1),