Hi Spark Developers,
I just ran some very simple operations on a dataset. I was surprise by the
execution plan of take(1), head() or first().
For your reference, this is what I did in pyspark 1.5:
df=sqlContext.read.parquet("someparquetfiles")
df.head()
The above lines take over 15 minutes. I
Seems 1.4 has the same issue.
On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
> btw, does 1.4 has the same problem?
>
> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
>
>> Hi Jerry,
>>
>> Looks like it is a Python-specific issue. Can you create
Hi Yin,
You are right! I just tried the scala version with the above lines, it
works as expected.
I'm not sure if it happens also in 1.4 for pyspark but I thought the
pyspark code just calls the scala code via py4j. I didn't expect that this
bug is pyspark specific. That surprises me actually a
I just noticed you found 1.4 has the same issue. I added that as well in
the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote:
> Hi Yin,
>
> You are right! I just tried the scala version with the above lines, it
> works as expected.
> I'm not sure if it happens
Looks like the problem is df.rdd does not work very well with limit. In
scala, df.limit(1).rdd will also trigger the issue you observed. I will add
this in the jira.
On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote:
> I just noticed you found 1.4 has the same issue. I
Hi Jerry,
Looks like it is a Python-specific issue. Can you create a JIRA?
Thanks,
Yin
On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote:
> Hi Spark Developers,
>
> I just ran some very simple operations on a dataset. I was surprise by the
> execution plan of take(1),
btw, does 1.4 has the same problem?
On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote:
> Hi Jerry,
>
> Looks like it is a Python-specific issue. Can you create a JIRA?
>
> Thanks,
>
> Yin
>
> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote:
>
>> Hi