We have some early stuff there but not quite ready to talk about it in public yet (I hope soon though). Will shoot you a separate email on it.
On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari <abdealikoth...@gmail.com> wrote: > Thanks for the reply Reynold - Has this shim project started ? > I'd love to contribute to it - as it looks like I have started making a > bunch of helper functions to do something similar for my current task and > would prefer not doing it in isolation. > Was considering making a git repo and pushing stuff there just today > morning. But if there's already folks working on it - I'd prefer > collaborating. > > Note - I'm not recommending we make the logical plan mutable (as I am > scared of that too!). I think there are other ways of handling that - but > we can go into details later. > > On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <r...@databricks.com> wrote: > >> We have been thinking about some of these issues. Some of them are harder >> to do, e.g. Spark DataFrames are fundamentally immutable, and making the >> logical plan mutable is a significant deviation from the current paradigm >> that might confuse the hell out of some users. We are considering building >> a shim layer as a separate project on top of Spark (so we can make rapid >> releases based on feedback) just to test this out and see how well it could >> work in practice. >> >> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < >> abdealikoth...@gmail.com> wrote: >> >>> Hi, >>> I was doing some spark to pandas (and vice versa) conversion because >>> some of the pandas codes we have don't work on huge data. And some spark >>> codes work very slow on small data. >>> >>> It was nice to see that pyspark had some similar syntax for the common >>> pandas operations that the python community is used to. >>> >>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show() >>> Column selects: df[['col1', 'col2']] >>> Row Filters: df[df['col1'] < 3.0] >>> >>> I was wondering about a bunch of other functions in pandas which seemed >>> common. And thought there must've been a discussion about it in the >>> community - hence started this thread. >>> >>> I was wondering whether there has been discussion on adding the >>> following functions: >>> >>> *Column setters*: >>> In Pandas: >>> df['col3'] = df['col1'] * 3.0 >>> While I do the following in PySpark: >>> df = df.withColumn('col3', df['col1'] * 3.0) >>> >>> *Column apply()*: >>> In Pandas: >>> df['col3'] = df['col1'].apply(lambda x: x * 3.0) >>> While I do the following in PySpark: >>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')( >>> df['col1'])) >>> >>> I understand that this one cannot be as simple as in pandas due to the >>> output-type that's needed here. But could be done like: >>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float') >>> >>> Multi column in pandas is: >>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0) >>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row >>> directly it would be similar (?): >>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0), >>> 'float') >>> >>> *Rename*: >>> In Pandas: >>> df.rename(columns={...}) >>> While I do the following in PySpark: >>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns]) >>> >>> *To Dictionary*: >>> In Pandas: >>> df.to_dict(orient='list') >>> While I do the following in PySpark: >>> {f.name: [row[i] for row in df.collect()] for i, f in >>> enumerate(df.schema.fields)} >>> >>> I thought I'd start the discussion with these and come back to some of >>> the others I see that could be helpful. >>> >>> *Note*: (with the column functions in mind) I understand the concept of >>> the DataFrame cannot be modified. And I am not suggesting we change that >>> nor any underlying principle. Just trying to add syntactic sugar here. >>> >>>