I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.

On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens also in 1.4 for pyspark but I thought the
> pyspark code just calls the scala code via py4j. I didn't expect that this
> bug is pyspark specific. That surprises me actually a bit. I created a
> ticket for this (SPARK-10731
> <https://issues.apache.org/jira/browse/SPARK-10731>).
>
> Best Regards,
>
> Jerry
>
>
> On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> btw, does 1.4 has the same problem?
>>
>> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote:
>>
>>> Hi Jerry,
>>>
>>> Looks like it is a Python-specific issue. Can you create a JIRA?
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote:
>>>
>>>> Hi Spark Developers,
>>>>
>>>> I just ran some very simple operations on a dataset. I was surprise by
>>>> the execution plan of take(1), head() or first().
>>>>
>>>> For your reference, this is what I did in pyspark 1.5:
>>>> df=sqlContext.read.parquet("someparquetfiles")
>>>> df.head()
>>>>
>>>> The above lines take over 15 minutes. I was frustrated because I can do
>>>> better without using spark :) Since I like spark, so I tried to figure out
>>>> why. It seems the dataframe requires 3 stages to give me the first row. It
>>>> reads all data (which is about 1 billion rows) and run Limit twice.
>>>>
>>>> Instead of head(), show(1) runs much faster. Not to mention that if I
>>>> do:
>>>>
>>>> df.rdd.take(1) //runs much faster.
>>>>
>>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a
>>>> bug in the optimizer? or I did something wrong?
>>>>
>>>> Best Regards,
>>>>
>>>> Jerry
>>>>
>>>
>>>
>>
>

Reply via email to