Cheolsoo Park created SPARK-8908:
------------------------------------

             Summary: Calling distinct() with parentheses throws error in Scala 
DataFrame
                 Key: SPARK-8908
                 URL: https://issues.apache.org/jira/browse/SPARK-8908
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0, 1.5.0
            Reporter: Cheolsoo Park
            Priority: Minor


To reproduce, please call {{distinct()}} on DataFrame in spark-shell. For eg,
{code}
scala> sqlContext.table("my_table").distinct()

<console>:19: error: not enough arguments for method apply: (colName: 
String)org.apache.spark.sql.Column in class DataFrame.
Unspecified value parameter colName.
{code}
This is confusing because {{distinct}} in DataFrame is an alias of 
{{dropDuplicates}}, and both {{dropDuplicates}} and {{dropDuplicates()}} work.

Here is the summary-
||Scala code||Works||
|DF.distinct|Y|
|DF.distinct()|N|
|DF.dropDuplicates|Y|
|DF.dropDuplicates()|Y|

Looking at the definition of {{distinct}}, it's missing {{()}}-
{code}
override def distinct: DataFrame = dropDuplicates()
{code}
As a result, what seems happening is as follows-
{code}
distinct()
=> dropDuplicates()()
=> DataFrame() // because dropDuplicates() returns DF
=> DataFrame.apply() // fails because apply() takes a column parameter
{code}
I can verify that adding {{()}} to the definition makes both {{distinct}} and 
{{distinct()}} work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to