Re: PySpark syntax vs Pandas syntax

Reynold Xin Mon, 25 Mar 2019 23:35:15 -0700

We have some early stuff there but not quite ready to talk about it in
public yet (I hope soon though). Will shoot you a separate email on it.


On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <abdealikoth...@gmail.com>
wrote:

> Thanks for the reply Reynold - Has this shim project started ?
> I'd love to contribute to it - as it looks like I have started making a
> bunch of helper functions to do something similar for my current task and
> would prefer not doing it in isolation.
> Was considering making a git repo and pushing stuff there just today
> morning. But if there's already folks working on it - I'd prefer
> collaborating.
>
> Note - I'm not recommending we make the logical plan mutable (as I am
> scared of that too!). I think there are other ways of handling that - but
> we can go into details later.
>
> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote:
>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> Hi,
>>> I was doing some spark to pandas (and vice versa) conversion because
>>> some of the pandas codes we have don't work on huge data. And some spark
>>> codes work very slow on small data.
>>>
>>> It was nice to see that pyspark had some similar syntax for the common
>>> pandas operations that the python community is used to.
>>>
>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>> Column selects: df[['col1', 'col2']]
>>> Row Filters: df[df['col1'] < 3.0]
>>>
>>> I was wondering about a bunch of other functions in pandas which seemed
>>> common. And thought there must've been a discussion about it in the
>>> community - hence started this thread.
>>>
>>> I was wondering whether there has been discussion on adding the
>>> following functions:
>>>
>>> *Column setters*:
>>> In Pandas:
>>> df['col3'] = df['col1'] * 3.0
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>
>>> *Column apply()*:
>>> In Pandas:
>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
>>> df['col1']))
>>>
>>> I understand that this one cannot be as simple as in pandas due to the
>>> output-type that's needed here. But could be done like:
>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>
>>> Multi column in pandas is:
>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>>> directly it would be similar (?):
>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>>> 'float')
>>>
>>> *Rename*:
>>> In Pandas:
>>> df.rename(columns={...})
>>> While I do the following in PySpark:
>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>
>>> *To Dictionary*:
>>> In Pandas:
>>> df.to_dict(orient='list')
>>> While I do the following in PySpark:
>>> {f.name: [row[i] for row in df.collect()] for i, f in
>>> enumerate(df.schema.fields)}
>>>
>>> I thought I'd start the discussion with these and come back to some of
>>> the others I see that could be helpful.
>>>
>>> *Note*: (with the column functions in mind) I understand the concept of
>>> the DataFrame cannot be modified. And I am not suggesting we change that
>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>
>>>

Re: PySpark syntax vs Pandas syntax

Reply via email to