I understand the rational, but when you need to reference, for example when using a join, some column which name is not unique, it can be confusing in terms of API. However I figured out that you can use a "qualified" name for the column using the *other-dataframe.column_name* syntax, maybe we just need to document this well...
Le dim. 31 mai 2015 à 12:18, 范文臣 <cloud0...@163.com> a écrit : > `Column` in `DataFrame` is a general concept. `field1` is a column, `field > + 1` is a column, `field1 < field2` is also a column. For API like > `select`, it should accept `Column` as we need general expressions. But for > `drop`, we can only drop exist columns which is not general expression. So > I think it makes sense to only allow String in `drop` as column name. > > > > At 2015-05-31 02:41:52, "Reynold Xin" <r...@databricks.com> wrote: > > Name resolution is not as easy I think. Wenchen can maybe give you some > advice on resolution about this one. > > > On Sat, May 30, 2015 at 9:37 AM, Yijie Shen <henry.yijies...@gmail.com> > wrote: > >> I think just match the Column’s expr as UnresolvedAttribute and use >> UnresolvedAttribute’s name to match schema’s field name is enough. >> >> Seems no need to regard expr as a more general one. :) >> >> On May 30, 2015 at 11:14:05 PM, Girardot Olivier ( >> o.girar...@lateral-thoughts.com) wrote: >> >> Jira done : https://issues.apache.org/jira/browse/SPARK-7969 >> I've already started working on it but it's less trivial than it seems >> because I don't exactly now the inner workings of the catalog, >> and how to get the qualified name of a column to match it against the >> schema/catalog. >> >> Regards, >> >> Olivier. >> >> Le sam. 30 mai 2015 à 09:54, Reynold Xin <r...@databricks.com> a écrit : >> >>> Yea would be great to support a Column. Can you create a JIRA, and >>> possibly a pull request? >>> >>> >>> On Fri, May 29, 2015 at 2:45 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> Actually, the Scala API too is only based on column name >>>> >>>> Le ven. 29 mai 2015 à 11:23, Olivier Girardot < >>>> o.girar...@lateral-thoughts.com> a écrit : >>>> >>>>> Hi, >>>>> Testing a bit more 1.4, it seems that the .drop() method in PySpark >>>>> doesn't seem to accept a Column as input datatype : >>>>> >>>>> >>>>> * .join(only_the_best, only_the_best.pol_no == df.pol_no, >>>>> "inner").drop(only_the_best.pol_no)\* File >>>>> "/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.py", line >>>>> 1225, in drop >>>>> jdf = self._jdf.drop(colName) >>>>> File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", >>>>> line 523, in __call__ >>>>> (new_args, temp_args) = self._get_args(args) >>>>> File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", >>>>> line 510, in _get_args >>>>> temp_arg = converter.convert(arg, self.gateway_client) >>>>> File >>>>> "/usr/local/lib/python2.7/site-packages/py4j/java_collections.py", line >>>>> 490, in convert >>>>> for key in object.keys(): >>>>> TypeError: 'Column' object is not callable >>>>> >>>>> It doesn't seem very consistent with rest of the APIs - and is >>>>> especially annoying when executing joins - because drop("my_key") is not a >>>>> qualified reference to the column. >>>>> >>>>> What do you think about changing that ? or what is the best practice >>>>> as a workaround ? >>>>> >>>>> Regards, >>>>> >>>>> Olivier. >>>>> >>>> >>> >