On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu> wrote:
> I dont know much about Python style, but I think the point Wes made about
> usability on the JIRA is pretty powerful. IMHO the number of methods on a
> Spark DataFrame might not be much more compared to Pandas. Given that it
> looks like users are okay with the possibility of collisions in Pandas I
> think sticking (1) is not a bad idea.
>

This is true for interactive work. Spark's DataFrames can handle
really large datasets, which might be used in production workflows. So
I think it is reasonable for us to care more about compatibility
issues than Pandas.

> Also is it possible to detect such collisions in Python ? A (4)th option
> might be to detect that `df` contains a column named `name` and print a
> warning in `df.name` which tells the user that the method is overriding the
> column.

Maybe we can inspect the frame `df.name` gets called and warn users in
`df.select(df.name)` but not in `name = df.name`. This could be tricky
to implement.

-Xiangrui

>
> Thanks
> Shivaram
>
>
> On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Hi all,
>>
>> In PySpark, a DataFrame column can be referenced using df["abcd"]
>> (__getitem__) and df.abcd (__getattr__). There is a discussion on
>> SPARK-7035 on compatibility issues with the __getattr__ approach, and
>> I want to collect more inputs on this.
>>
>> Basically, if in the future we introduce a new method to DataFrame, it
>> may break user code that uses the same attr to reference a column or
>> silently changes its behavior. For example, if we add name() to
>> DataFrame in the next release, all existing code using `df.name` to
>> reference a column called "name" will break. If we add `name()` as a
>> property instead of a method, all existing code using `df.name` may
>> still work but with a different meaning. `df.select(df.name)` no
>> longer selects the column called "name" but the column that has the
>> same name as `df.name`.
>>
>> There are several proposed solutions:
>>
>> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
>> latter that is future proof. This is the current solution in master
>> (https://github.com/apache/spark/pull/5971). But I think users may be
>> still unaware of the compatibility issue and prefer `df.abcd` to
>> `df["abcd"]` because the former could be auto-completed.
>> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
>> JIRA page: "I actually dragged my feet on the _getattr_ issue for
>> several months back in the day, then finally added it (and tab
>> completion in IPython with _dir_), and immediately noticed a huge
>> quality-of-life improvement when using pandas for actual (esp.
>> interactive) work."
>> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
>> df["abcd"] would be future proof, and df.abcd_ could be
>> auto-completed. The tradeoff is apparently the extra "_" appearing in
>> the code.
>>
>> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
>> Thanks!
>>
>> Best,
>> Xiangrui
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to