Re: PySpark syntax vs Pandas syntax

Hyukjin Kwon Tue, 26 Mar 2019 00:08:37 -0700

BTW, I am working on the documentation related with this subject at
https://issues.apache.org/jira/browse/SPARK-26022 to describe the difference


2019년 3월 26일 (화) 오후 3:34, Reynold Xin <[email protected]>님이 작성:

> We have some early stuff there but not quite ready to talk about it in
> public yet (I hope soon though). Will shoot you a separate email on it.
>
> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <[email protected]>
> wrote:
>
>> Thanks for the reply Reynold - Has this shim project started ?
>> I'd love to contribute to it - as it looks like I have started making a
>> bunch of helper functions to do something similar for my current task and
>> would prefer not doing it in isolation.
>> Was considering making a git repo and pushing stuff there just today
>> morning. But if there's already folks working on it - I'd prefer
>> collaborating.
>>
>> Note - I'm not recommending we make the logical plan mutable (as I am
>> scared of that too!). I think there are other ways of handling that - but
>> we can go into details later.
>>
>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <[email protected]> wrote:
>>
>>> We have been thinking about some of these issues. Some of them are
>>> harder to do, e.g. Spark DataFrames are fundamentally immutable, and making
>>> the logical plan mutable is a significant deviation from the current
>>> paradigm that might confuse the hell out of some users. We are considering
>>> building a shim layer as a separate project on top of Spark (so we can make
>>> rapid releases based on feedback) just to test this out and see how well it
>>> could work in practice.
>>>
>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>> I was doing some spark to pandas (and vice versa) conversion because
>>>> some of the pandas codes we have don't work on huge data. And some spark
>>>> codes work very slow on small data.
>>>>
>>>> It was nice to see that pyspark had some similar syntax for the common
>>>> pandas operations that the python community is used to.
>>>>
>>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>>> Column selects: df[['col1', 'col2']]
>>>> Row Filters: df[df['col1'] < 3.0]
>>>>
>>>> I was wondering about a bunch of other functions in pandas which seemed
>>>> common. And thought there must've been a discussion about it in the
>>>> community - hence started this thread.
>>>>
>>>> I was wondering whether there has been discussion on adding the
>>>> following functions:
>>>>
>>>> *Column setters*:
>>>> In Pandas:
>>>> df['col3'] = df['col1'] * 3.0
>>>> While I do the following in PySpark:
>>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>>
>>>> *Column apply()*:
>>>> In Pandas:
>>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>>> While I do the following in PySpark:
>>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
>>>> df['col1']))
>>>>
>>>> I understand that this one cannot be as simple as in pandas due to the
>>>> output-type that's needed here. But could be done like:
>>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>>
>>>> Multi column in pandas is:
>>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>>> Maybe this can be done in pyspark as or if we can send a
>>>> pyspark.sql.Row directly it would be similar (?):
>>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 *
>>>> 3.0), 'float')
>>>>
>>>> *Rename*:
>>>> In Pandas:
>>>> df.rename(columns={...})
>>>> While I do the following in PySpark:
>>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>>
>>>> *To Dictionary*:
>>>> In Pandas:
>>>> df.to_dict(orient='list')
>>>> While I do the following in PySpark:
>>>> {f.name: [row[i] for row in df.collect()] for i, f in
>>>> enumerate(df.schema.fields)}
>>>>
>>>> I thought I'd start the discussion with these and come back to some of
>>>> the others I see that could be helpful.
>>>>
>>>> *Note*: (with the column functions in mind) I understand the concept
>>>> of the DataFrame cannot be modified. And I am not suggesting we change that
>>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>>
>>>>

Re: PySpark syntax vs Pandas syntax

Reply via email to