BTW, I am working on the documentation related with this subject at https://issues.apache.org/jira/browse/SPARK-26022 to describe the difference
2019년 3월 26일 (화) 오후 3:34, Reynold Xin <r...@databricks.com>님이 작성: > We have some early stuff there but not quite ready to talk about it in > public yet (I hope soon though). Will shoot you a separate email on it. > > On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <abdealikoth...@gmail.com> > wrote: > >> Thanks for the reply Reynold - Has this shim project started ? >> I'd love to contribute to it - as it looks like I have started making a >> bunch of helper functions to do something similar for my current task and >> would prefer not doing it in isolation. >> Was considering making a git repo and pushing stuff there just today >> morning. But if there's already folks working on it - I'd prefer >> collaborating. >> >> Note - I'm not recommending we make the logical plan mutable (as I am >> scared of that too!). I think there are other ways of handling that - but >> we can go into details later. >> >> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote: >> >>> We have been thinking about some of these issues. Some of them are >>> harder to do, e.g. Spark DataFrames are fundamentally immutable, and making >>> the logical plan mutable is a significant deviation from the current >>> paradigm that might confuse the hell out of some users. We are considering >>> building a shim layer as a separate project on top of Spark (so we can make >>> rapid releases based on feedback) just to test this out and see how well it >>> could work in practice. >>> >>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < >>> abdealikoth...@gmail.com> wrote: >>> >>>> Hi, >>>> I was doing some spark to pandas (and vice versa) conversion because >>>> some of the pandas codes we have don't work on huge data. And some spark >>>> codes work very slow on small data. >>>> >>>> It was nice to see that pyspark had some similar syntax for the common >>>> pandas operations that the python community is used to. >>>> >>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show() >>>> Column selects: df[['col1', 'col2']] >>>> Row Filters: df[df['col1'] < 3.0] >>>> >>>> I was wondering about a bunch of other functions in pandas which seemed >>>> common. And thought there must've been a discussion about it in the >>>> community - hence started this thread. >>>> >>>> I was wondering whether there has been discussion on adding the >>>> following functions: >>>> >>>> *Column setters*: >>>> In Pandas: >>>> df['col3'] = df['col1'] * 3.0 >>>> While I do the following in PySpark: >>>> df = df.withColumn('col3', df['col1'] * 3.0) >>>> >>>> *Column apply()*: >>>> In Pandas: >>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0) >>>> While I do the following in PySpark: >>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')( >>>> df['col1'])) >>>> >>>> I understand that this one cannot be as simple as in pandas due to the >>>> output-type that's needed here. But could be done like: >>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float') >>>> >>>> Multi column in pandas is: >>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0) >>>> Maybe this can be done in pyspark as or if we can send a >>>> pyspark.sql.Row directly it would be similar (?): >>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * >>>> 3.0), 'float') >>>> >>>> *Rename*: >>>> In Pandas: >>>> df.rename(columns={...}) >>>> While I do the following in PySpark: >>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns]) >>>> >>>> *To Dictionary*: >>>> In Pandas: >>>> df.to_dict(orient='list') >>>> While I do the following in PySpark: >>>> {f.name: [row[i] for row in df.collect()] for i, f in >>>> enumerate(df.schema.fields)} >>>> >>>> I thought I'd start the discussion with these and come back to some of >>>> the others I see that could be helpful. >>>> >>>> *Note*: (with the column functions in mind) I understand the concept >>>> of the DataFrame cannot be modified. And I am not suggesting we change that >>>> nor any underlying principle. Just trying to add syntactic sugar here. >>>> >>>>