Re: Dataframe's .drop in PySpark doesn't accept Column

Olivier Girardot Sun, 31 May 2015 09:58:02 -0700

I understand the rational, but when you need to reference, for example when
using a join, some column which name is not unique, it can be confusing in
terms of API.
However I figured out that you can use a "qualified" name for the column
using the *other-dataframe.column_name* syntax, maybe we just need to
document this well...



Le dim. 31 mai 2015 à 12:18, 范文臣 <cloud0...@163.com> a écrit :

> `Column` in `DataFrame` is a general concept. `field1` is a column, `field
> + 1` is a column, `field1 < field2` is also a column. For API like
> `select`, it should accept `Column` as we need general expressions. But for
> `drop`, we can only drop exist columns which is not general expression. So
> I think it makes sense to only allow String in `drop` as column name.
>
>
>
> At 2015-05-31 02:41:52, "Reynold Xin" <r...@databricks.com> wrote:
>
> Name resolution is not as easy I think.  Wenchen can maybe give you some
> advice on resolution about this one.
>
>
> On Sat, May 30, 2015 at 9:37 AM, Yijie Shen <henry.yijies...@gmail.com>
> wrote:
>
>> I think just match the Column’s expr as UnresolvedAttribute and use
>> UnresolvedAttribute’s name to match schema’s field name is enough.
>>
>> Seems no need to regard expr as a more general one. :)
>>
>> On May 30, 2015 at 11:14:05 PM, Girardot Olivier (
>> o.girar...@lateral-thoughts.com) wrote:
>>
>> Jira done : https://issues.apache.org/jira/browse/SPARK-7969
>> I've already started working on it but it's less trivial than it seems
>> because I don't exactly now the inner workings of the catalog,
>> and how to get the qualified name of a column to match it against the
>> schema/catalog.
>>
>> Regards,
>>
>> Olivier.
>>
>>  Le sam. 30 mai 2015 à 09:54, Reynold Xin <r...@databricks.com> a écrit :
>>
>>> Yea would be great to support a Column. Can you create a JIRA, and
>>> possibly a pull request?
>>>
>>>
>>> On Fri, May 29, 2015 at 2:45 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> Actually, the Scala API too is only based on column name
>>>>
>>>>  Le ven. 29 mai 2015 à 11:23, Olivier Girardot <
>>>> o.girar...@lateral-thoughts.com> a écrit :
>>>>
>>>>> Hi,
>>>>> Testing a bit more 1.4, it seems that the .drop() method in PySpark
>>>>> doesn't seem to accept a Column as input datatype :
>>>>>
>>>>>
>>>>> *    .join(only_the_best, only_the_best.pol_no == df.pol_no,
>>>>> "inner").drop(only_the_best.pol_no)\* File
>>>>> "/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py", line
>>>>> 1225, in drop
>>>>> jdf = self._jdf.drop(colName)
>>>>> File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py",
>>>>> line 523, in __call__
>>>>> (new_args, temp_args) = self._get_args(args)
>>>>> File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py",
>>>>> line 510, in _get_args
>>>>> temp_arg = converter.convert(arg, self.gateway_client)
>>>>> File
>>>>> "/usr/local/lib/python2.7/site-packages/py4j/java_collections.py", line
>>>>> 490, in convert
>>>>> for key in object.keys():
>>>>> TypeError: 'Column' object is not callable
>>>>>
>>>>> It doesn't seem very consistent with rest of the APIs - and is
>>>>> especially annoying when executing joins - because drop("my_key") is not a
>>>>> qualified reference to the column.
>>>>>
>>>>> What do you think about changing that ? or what is the best practice
>>>>> as a workaround ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>
>>>
>

Re: Dataframe's .drop in PySpark doesn't accept Column

Reply via email to