Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

Yin Huai Mon, 21 Sep 2015 10:02:26 -0700

btw, does 1.4 has the same problem?

On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:


> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
>> Hi Spark Developers,
>>
>> I just ran some very simple operations on a dataset. I was surprise by
>> the execution plan of take(1), head() or first().
>>
>> For your reference, this is what I did in pyspark 1.5:
>> df=sqlContext.read.parquet("someparquetfiles")
>> df.head()
>>
>> The above lines take over 15 minutes. I was frustrated because I can do
>> better without using spark :) Since I like spark, so I tried to figure out
>> why. It seems the dataframe requires 3 stages to give me the first row. It
>> reads all data (which is about 1 billion rows) and run Limit twice.
>>
>> Instead of head(), show(1) runs much faster. Not to mention that if I do:
>>
>> df.rdd.take(1) //runs much faster.
>>
>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>> bug in the optimizer? or I did something wrong?
>>
>> Best Regards,
>>
>> Jerry
>>
>
>

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

Reply via email to