Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
Nice, will test it out +1

On Tue, Mar 26, 2019, 22:38 Reynold Xin  wrote:

> We just made the repo public: https://github.com/databricks/spark-pandas
>
>
> On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter  > wrote:
>
>> To add more details to what Reynold mentioned. As you said, there is
>> going to be some slight differences in any case between Pandas and Spark in
>> any case, simply because Spark needs to know the return types of the
>> functions. In your case, you would need to slightly refactor your apply
>> method to the following (in python 3) to add type hints:
>>
>> ```
>> def f(x) -> float: return x * 3.0
>> df['col3'] = df['col1'].apply(f)
>> ```
>>
>> This has the benefit of keeping your code fully compliant with both
>> pandas and pyspark. We will share more information in the future.
>>
>> Tim
>>
>> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon  wrote:
>>
>> BTW, I am working on the documentation related with this subject at
>> https://issues.apache.org/jira/browse/SPARK-26022 to describe the
>> difference
>>
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성:
>>
>> We have some early stuff there but not quite ready to talk about it in
>> public yet (I hope soon though). Will shoot you a separate email on it.
>>
>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>> Thanks for the reply Reynold - Has this shim project started ?
>> I'd love to contribute to it - as it looks like I have started making a
>> bunch of helper functions to do something similar for my current task and
>> would prefer not doing it in isolation.
>> Was considering making a git repo and pushing stuff there just today
>> morning. But if there's already folks working on it - I'd prefer
>> collaborating.
>>
>> Note - I'm not recommending we make the logical plan mutable (as I am
>> scared of that too!). I think there are other ways of handling that - but
>> we can go into details later.
>>
>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>> Hi,
>> I was doing some spark to pandas (and vice versa) conversion because some
>> of the pandas codes we have don't work on huge data. And some spark codes
>> work very slow on small data.
>>
>> It was nice to see that pyspark had some similar syntax for the common
>> pandas operations that the python community is used to.
>>
>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>> Column selects: df[['col1', 'col2']]
>> Row Filters: df[df['col1'] < 3.0]
>>
>> I was wondering about a bunch of other functions in pandas which seemed
>> common. And thought there must've been a discussion about it in the
>> community - hence started this thread.
>>
>> I was wondering whether there has been discussion on adding the following
>> functions:
>>
>> *Column setters*:
>> In Pandas:
>> df['col3'] = df['col1'] * 3.0
>> While I do the following in PySpark:
>> df = df.withColumn('col3', df['col1'] * 3.0)
>>
>> *Column apply()*:
>> In Pandas:
>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>> While I do the following in PySpark:
>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>>
>> I understand that this one cannot be as simple as in pandas due to the
>> output-type that's needed here. But could be done like:
>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>
>> Multi column in pandas is:
>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>> directly it would be similar (?):
>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>> 'float')
>>
>> *Rename*:
>> In Pandas:
>> df.rename(columns={...})
>> While I do the following in PySpark:
>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>
>> *To Dictionary*:
>> In Pandas:
>> df.to_dict(orient='list')
>> While I do the following in PySpark:
>> {f.name: [row[i] for row in df.collect()] for i, f in
>> enumerate(df.schema.fields)}
>>
>> I thought I'd start the discussion with these and come back to some of
>> the others I see that could be helpful.
>>
>> *Note*: (with the column functions in mind) I understand the concept of
>> the DataFrame cannot be modified. And I am not suggesting we change that
>> nor any underlying principle. Just trying to add syntactic sugar here.
>>
>>
>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We just made the repo public: https://github.com/databricks/spark-pandas

On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > 
wrote:

> 
> To add more details to what Reynold mentioned. As you said, there is going
> to be some slight differences in any case between Pandas and Spark in any
> case, simply because Spark needs to know the return types of the
> functions. In your case, you would need to slightly refactor your apply
> method to the following (in python 3) to add type hints:
> 
> 
> ```
> def f(x) -> float: return x * 3.0
> df['col3'] = df['col1'].apply(f)
> ```
> 
> 
> This has the benefit of keeping your code fully compliant with both pandas
> and pyspark. We will share more information in the future.
> 
> 
> Tim
> 
> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> 
> 
>> BTW, I am working on the documentation related with this subject at https:/
>> / issues. apache. org/ jira/ browse/ SPARK-26022 (
>> https://issues.apache.org/jira/browse/SPARK-26022 ) to describe the
>> difference
>> 
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) >님이 작성:
>> 
>> 
>>> We have some early stuff there but not quite ready to talk about it in
>>> public yet (I hope soon though). Will shoot you a separate email on it.
>>> 
>>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < abdealikothari@ gmail. 
>>> com
>>> ( abdealikoth...@gmail.com ) > wrote:
>>> 
>>> 
 Thanks for the reply Reynold - Has this shim project started ?
 I'd love to contribute to it - as it looks like I have started making a
 bunch of helper functions to do something similar for my current task and
 would prefer not doing it in isolation.
 Was considering making a git repo and pushing stuff there just today
 morning. But if there's already folks working on it - I'd prefer
 collaborating.
 
 
 Note - I'm not recommending we make the logical plan mutable (as I am
 scared of that too!). I think there are other ways of handling that - but
 we can go into details later.
 
 On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> We have been thinking about some of these issues. Some of them are harder
> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
> logical plan mutable is a significant deviation from the current paradigm
> that might confuse the hell out of some users. We are considering building
> a shim layer as a separate project on top of Spark (so we can make rapid
> releases based on feedback) just to test this out and see how well it
> could work in practice.
> 
> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < abdealikothari@ gmail. 
> com
> ( abdealikoth...@gmail.com ) > wrote:
> 
> 
>> Hi,
>> I was doing some spark to pandas (and vice versa) conversion because some
>> of the pandas codes we have don't work on huge data. And some spark codes
>> work very slow on small data.
>> 
>> It was nice to see that pyspark had some similar syntax for the common
>> pandas operations that the python community is used to.
>> 
>> 
>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>> 
>> Column selects: df[['col1', 'col2']]
>> 
>> Row Filters: df[df['col1'] < 3.0]
>> 
>> 
>> 
>> I was wondering about a bunch of other functions in pandas which seemed
>> common. And thought there must've been a discussion about it in the
>> community - hence started this thread.
>> 
>> 
>> I was wondering whether there has been discussion on adding the following
>> functions:
>> *
>> Column setters* :
>> In Pandas:
>> df['col3'] = df['col1'] * 3.0
>> 
>> While I do the following in PySpark:
>> 
>> df = df.withColumn('col3', df['col1'] * 3.0)
>> 
>> 
>> *Column apply()* :
>> In Pandas:
>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>> 
>> While I do the following in PySpark:
>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')( 
>> df['col1']))
>> 
>> 
>> 
>> 
>> 
>> I understand that this one cannot be as simple as in pandas due to the
>> output-type that's needed here. But could be done like:
>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>> 
>> 
>> Multi column in pandas is:
>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>> 
>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>> directly it would be similar (?):
>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>> 'float')
>> 
>> 
>> *Rename* :
>> In Pandas:
>> df.rename(columns={...})
>> While I do the following in PySpark:

Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Timothee Hunter
To add more details to what Reynold mentioned. As you said, there is going
to be some slight differences in any case between Pandas and Spark in any
case, simply because Spark needs to know the return types of the functions.
In your case, you would need to slightly refactor your apply method to the
following (in python 3) to add type hints:

```
def f(x) -> float: return x * 3.0
df['col3'] = df['col1'].apply(f)
```

This has the benefit of keeping your code fully compliant with both pandas
and pyspark. We will share more information in the future.

Tim

On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon  wrote:

> BTW, I am working on the documentation related with this subject at
> https://issues.apache.org/jira/browse/SPARK-26022 to describe the
> difference
>
> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성:
>
>> We have some early stuff there but not quite ready to talk about it in
>> public yet (I hope soon though). Will shoot you a separate email on it.
>>
>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> Thanks for the reply Reynold - Has this shim project started ?
>>> I'd love to contribute to it - as it looks like I have started making a
>>> bunch of helper functions to do something similar for my current task and
>>> would prefer not doing it in isolation.
>>> Was considering making a git repo and pushing stuff there just today
>>> morning. But if there's already folks working on it - I'd prefer
>>> collaborating.
>>>
>>> Note - I'm not recommending we make the logical plan mutable (as I am
>>> scared of that too!). I think there are other ways of handling that - but
>>> we can go into details later.
>>>
>>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin 
>>> wrote:
>>>
 We have been thinking about some of these issues. Some of them are
 harder to do, e.g. Spark DataFrames are fundamentally immutable, and making
 the logical plan mutable is a significant deviation from the current
 paradigm that might confuse the hell out of some users. We are considering
 building a shim layer as a separate project on top of Spark (so we can make
 rapid releases based on feedback) just to test this out and see how well it
 could work in practice.

 On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
 abdealikoth...@gmail.com> wrote:

> Hi,
> I was doing some spark to pandas (and vice versa) conversion because
> some of the pandas codes we have don't work on huge data. And some spark
> codes work very slow on small data.
>
> It was nice to see that pyspark had some similar syntax for the common
> pandas operations that the python community is used to.
>
> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
> Column selects: df[['col1', 'col2']]
> Row Filters: df[df['col1'] < 3.0]
>
> I was wondering about a bunch of other functions in pandas which
> seemed common. And thought there must've been a discussion about it in the
> community - hence started this thread.
>
> I was wondering whether there has been discussion on adding the
> following functions:
>
> *Column setters*:
> In Pandas:
> df['col3'] = df['col1'] * 3.0
> While I do the following in PySpark:
> df = df.withColumn('col3', df['col1'] * 3.0)
>
> *Column apply()*:
> In Pandas:
> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
> While I do the following in PySpark:
> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
> df['col1']))
>
> I understand that this one cannot be as simple as in pandas due to the
> output-type that's needed here. But could be done like:
> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>
> Multi column in pandas is:
> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
> Maybe this can be done in pyspark as or if we can send a
> pyspark.sql.Row directly it would be similar (?):
> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 *
> 3.0), 'float')
>
> *Rename*:
> In Pandas:
> df.rename(columns={...})
> While I do the following in PySpark:
> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>
> *To Dictionary*:
> In Pandas:
> df.to_dict(orient='list')
> While I do the following in PySpark:
> {f.name: [row[i] for row in df.collect()] for i, f in
> enumerate(df.schema.fields)}
>
> I thought I'd start the discussion with these and come back to some of
> the others I see that could be helpful.
>
> *Note*: (with the column functions in mind) I understand the concept
> of the DataFrame cannot be modified. And I am not suggesting we change 
> that
> nor any underlying principle. Just trying to add syntactic sugar here.
>
>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Hyukjin Kwon
BTW, I am working on the documentation related with this subject at
https://issues.apache.org/jira/browse/SPARK-26022 to describe the difference

2019년 3월 26일 (화) 오후 3:34, Reynold Xin 님이 작성:

> We have some early stuff there but not quite ready to talk about it in
> public yet (I hope soon though). Will shoot you a separate email on it.
>
> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari 
> wrote:
>
>> Thanks for the reply Reynold - Has this shim project started ?
>> I'd love to contribute to it - as it looks like I have started making a
>> bunch of helper functions to do something similar for my current task and
>> would prefer not doing it in isolation.
>> Was considering making a git repo and pushing stuff there just today
>> morning. But if there's already folks working on it - I'd prefer
>> collaborating.
>>
>> Note - I'm not recommending we make the logical plan mutable (as I am
>> scared of that too!). I think there are other ways of handling that - but
>> we can go into details later.
>>
>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>>
>>> We have been thinking about some of these issues. Some of them are
>>> harder to do, e.g. Spark DataFrames are fundamentally immutable, and making
>>> the logical plan mutable is a significant deviation from the current
>>> paradigm that might confuse the hell out of some users. We are considering
>>> building a shim layer as a separate project on top of Spark (so we can make
>>> rapid releases based on feedback) just to test this out and see how well it
>>> could work in practice.
>>>
>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>>> abdealikoth...@gmail.com> wrote:
>>>
 Hi,
 I was doing some spark to pandas (and vice versa) conversion because
 some of the pandas codes we have don't work on huge data. And some spark
 codes work very slow on small data.

 It was nice to see that pyspark had some similar syntax for the common
 pandas operations that the python community is used to.

 GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
 Column selects: df[['col1', 'col2']]
 Row Filters: df[df['col1'] < 3.0]

 I was wondering about a bunch of other functions in pandas which seemed
 common. And thought there must've been a discussion about it in the
 community - hence started this thread.

 I was wondering whether there has been discussion on adding the
 following functions:

 *Column setters*:
 In Pandas:
 df['col3'] = df['col1'] * 3.0
 While I do the following in PySpark:
 df = df.withColumn('col3', df['col1'] * 3.0)

 *Column apply()*:
 In Pandas:
 df['col3'] = df['col1'].apply(lambda x: x * 3.0)
 While I do the following in PySpark:
 df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
 df['col1']))

 I understand that this one cannot be as simple as in pandas due to the
 output-type that's needed here. But could be done like:
 df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')

 Multi column in pandas is:
 df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
 Maybe this can be done in pyspark as or if we can send a
 pyspark.sql.Row directly it would be similar (?):
 df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 *
 3.0), 'float')

 *Rename*:
 In Pandas:
 df.rename(columns={...})
 While I do the following in PySpark:
 df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])

 *To Dictionary*:
 In Pandas:
 df.to_dict(orient='list')
 While I do the following in PySpark:
 {f.name: [row[i] for row in df.collect()] for i, f in
 enumerate(df.schema.fields)}

 I thought I'd start the discussion with these and come back to some of
 the others I see that could be helpful.

 *Note*: (with the column functions in mind) I understand the concept
 of the DataFrame cannot be modified. And I am not suggesting we change that
 nor any underlying principle. Just trying to add syntactic sugar here.




Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We have some early stuff there but not quite ready to talk about it in
public yet (I hope soon though). Will shoot you a separate email on it.

On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari 
wrote:

> Thanks for the reply Reynold - Has this shim project started ?
> I'd love to contribute to it - as it looks like I have started making a
> bunch of helper functions to do something similar for my current task and
> would prefer not doing it in isolation.
> Was considering making a git repo and pushing stuff there just today
> morning. But if there's already folks working on it - I'd prefer
> collaborating.
>
> Note - I'm not recommending we make the logical plan mutable (as I am
> scared of that too!). I think there are other ways of handling that - but
> we can go into details later.
>
> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> Hi,
>>> I was doing some spark to pandas (and vice versa) conversion because
>>> some of the pandas codes we have don't work on huge data. And some spark
>>> codes work very slow on small data.
>>>
>>> It was nice to see that pyspark had some similar syntax for the common
>>> pandas operations that the python community is used to.
>>>
>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>> Column selects: df[['col1', 'col2']]
>>> Row Filters: df[df['col1'] < 3.0]
>>>
>>> I was wondering about a bunch of other functions in pandas which seemed
>>> common. And thought there must've been a discussion about it in the
>>> community - hence started this thread.
>>>
>>> I was wondering whether there has been discussion on adding the
>>> following functions:
>>>
>>> *Column setters*:
>>> In Pandas:
>>> df['col3'] = df['col1'] * 3.0
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>
>>> *Column apply()*:
>>> In Pandas:
>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
>>> df['col1']))
>>>
>>> I understand that this one cannot be as simple as in pandas due to the
>>> output-type that's needed here. But could be done like:
>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>
>>> Multi column in pandas is:
>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>>> directly it would be similar (?):
>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>>> 'float')
>>>
>>> *Rename*:
>>> In Pandas:
>>> df.rename(columns={...})
>>> While I do the following in PySpark:
>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>
>>> *To Dictionary*:
>>> In Pandas:
>>> df.to_dict(orient='list')
>>> While I do the following in PySpark:
>>> {f.name: [row[i] for row in df.collect()] for i, f in
>>> enumerate(df.schema.fields)}
>>>
>>> I thought I'd start the discussion with these and come back to some of
>>> the others I see that could be helpful.
>>>
>>> *Note*: (with the column functions in mind) I understand the concept of
>>> the DataFrame cannot be modified. And I am not suggesting we change that
>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>
>>>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
Thanks for the reply Reynold - Has this shim project started ?
I'd love to contribute to it - as it looks like I have started making a
bunch of helper functions to do something similar for my current task and
would prefer not doing it in isolation.
Was considering making a git repo and pushing stuff there just today
morning. But if there's already folks working on it - I'd prefer
collaborating.

Note - I'm not recommending we make the logical plan mutable (as I am
scared of that too!). I think there are other ways of handling that - but
we can go into details later.

On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:

> We have been thinking about some of these issues. Some of them are harder
> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
> logical plan mutable is a significant deviation from the current paradigm
> that might confuse the hell out of some users. We are considering building
> a shim layer as a separate project on top of Spark (so we can make rapid
> releases based on feedback) just to test this out and see how well it could
> work in practice.
>
> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari 
> wrote:
>
>> Hi,
>> I was doing some spark to pandas (and vice versa) conversion because some
>> of the pandas codes we have don't work on huge data. And some spark codes
>> work very slow on small data.
>>
>> It was nice to see that pyspark had some similar syntax for the common
>> pandas operations that the python community is used to.
>>
>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>> Column selects: df[['col1', 'col2']]
>> Row Filters: df[df['col1'] < 3.0]
>>
>> I was wondering about a bunch of other functions in pandas which seemed
>> common. And thought there must've been a discussion about it in the
>> community - hence started this thread.
>>
>> I was wondering whether there has been discussion on adding the following
>> functions:
>>
>> *Column setters*:
>> In Pandas:
>> df['col3'] = df['col1'] * 3.0
>> While I do the following in PySpark:
>> df = df.withColumn('col3', df['col1'] * 3.0)
>>
>> *Column apply()*:
>> In Pandas:
>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>> While I do the following in PySpark:
>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>>
>> I understand that this one cannot be as simple as in pandas due to the
>> output-type that's needed here. But could be done like:
>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>
>> Multi column in pandas is:
>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>> directly it would be similar (?):
>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>> 'float')
>>
>> *Rename*:
>> In Pandas:
>> df.rename(columns={...})
>> While I do the following in PySpark:
>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>
>> *To Dictionary*:
>> In Pandas:
>> df.to_dict(orient='list')
>> While I do the following in PySpark:
>> {f.name: [row[i] for row in df.collect()] for i, f in
>> enumerate(df.schema.fields)}
>>
>> I thought I'd start the discussion with these and come back to some of
>> the others I see that could be helpful.
>>
>> *Note*: (with the column functions in mind) I understand the concept of
>> the DataFrame cannot be modified. And I am not suggesting we change that
>> nor any underlying principle. Just trying to add syntactic sugar here.
>>
>>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We have been thinking about some of these issues. Some of them are harder
to do, e.g. Spark DataFrames are fundamentally immutable, and making the
logical plan mutable is a significant deviation from the current paradigm
that might confuse the hell out of some users. We are considering building
a shim layer as a separate project on top of Spark (so we can make rapid
releases based on feedback) just to test this out and see how well it could
work in practice.

On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari 
wrote:

> Hi,
> I was doing some spark to pandas (and vice versa) conversion because some
> of the pandas codes we have don't work on huge data. And some spark codes
> work very slow on small data.
>
> It was nice to see that pyspark had some similar syntax for the common
> pandas operations that the python community is used to.
>
> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
> Column selects: df[['col1', 'col2']]
> Row Filters: df[df['col1'] < 3.0]
>
> I was wondering about a bunch of other functions in pandas which seemed
> common. And thought there must've been a discussion about it in the
> community - hence started this thread.
>
> I was wondering whether there has been discussion on adding the following
> functions:
>
> *Column setters*:
> In Pandas:
> df['col3'] = df['col1'] * 3.0
> While I do the following in PySpark:
> df = df.withColumn('col3', df['col1'] * 3.0)
>
> *Column apply()*:
> In Pandas:
> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
> While I do the following in PySpark:
> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>
> I understand that this one cannot be as simple as in pandas due to the
> output-type that's needed here. But could be done like:
> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>
> Multi column in pandas is:
> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
> directly it would be similar (?):
> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
> 'float')
>
> *Rename*:
> In Pandas:
> df.rename(columns={...})
> While I do the following in PySpark:
> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>
> *To Dictionary*:
> In Pandas:
> df.to_dict(orient='list')
> While I do the following in PySpark:
> {f.name: [row[i] for row in df.collect()] for i, f in
> enumerate(df.schema.fields)}
>
> I thought I'd start the discussion with these and come back to some of the
> others I see that could be helpful.
>
> *Note*: (with the column functions in mind) I understand the concept of
> the DataFrame cannot be modified. And I am not suggesting we change that
> nor any underlying principle. Just trying to add syntactic sugar here.
>
>


PySpark syntax vs Pandas syntax

2019-03-26 Thread Abdeali Kothari
Hi,
I was doing some spark to pandas (and vice versa) conversion because some
of the pandas codes we have don't work on huge data. And some spark codes
work very slow on small data.

It was nice to see that pyspark had some similar syntax for the common
pandas operations that the python community is used to.

GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
Column selects: df[['col1', 'col2']]
Row Filters: df[df['col1'] < 3.0]

I was wondering about a bunch of other functions in pandas which seemed
common. And thought there must've been a discussion about it in the
community - hence started this thread.

I was wondering whether there has been discussion on adding the following
functions:

*Column setters*:
In Pandas:
df['col3'] = df['col1'] * 3.0
While I do the following in PySpark:
df = df.withColumn('col3', df['col1'] * 3.0)

*Column apply()*:
In Pandas:
df['col3'] = df['col1'].apply(lambda x: x * 3.0)
While I do the following in PySpark:
df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))

I understand that this one cannot be as simple as in pandas due to the
output-type that's needed here. But could be done like:
df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')

Multi column in pandas is:
df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
directly it would be similar (?):
df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
'float')

*Rename*:
In Pandas:
df.rename(columns={...})
While I do the following in PySpark:
df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])

*To Dictionary*:
In Pandas:
df.to_dict(orient='list')
While I do the following in PySpark:
{f.name: [row[i] for row in df.collect()] for i, f in
enumerate(df.schema.fields)}

I thought I'd start the discussion with these and come back to some of the
others I see that could be helpful.

*Note*: (with the column functions in mind) I understand the concept of the
DataFrame cannot be modified. And I am not suggesting we change that nor
any underlying principle. Just trying to add syntactic sugar here.