[ https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-3519. ------------------------------- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2383 [https://github.com/apache/spark/pull/2383] > PySpark RDDs are missing the distinct(n) method > ----------------------------------------------- > > Key: SPARK-3519 > URL: https://issues.apache.org/jira/browse/SPARK-3519 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 1.1.0 > Reporter: Nicholas Chammas > Assignee: Matthew Farrellee > Fix For: 1.2.0 > > > {{distinct()}} works but {{distinct(N)}} doesn't. > {code} > >>> sc.parallelize([1,1,2]).distinct() > PythonRDD[15] at RDD at PythonRDD.scala:43 > >>> sc.parallelize([1,1,2]).distinct(2) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > TypeError: distinct() takes exactly 1 argument (2 given) > {code} > The PySpark docs only call out [the {{distinct()}} > signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct], > but the programming guide [includes the {{distinct(N)}} > signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations] > as well. > {quote} > {noformat} > distinct([numTasks])) Return a new dataset that contains the distinct > elements of the source dataset. > {noformat} > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org