[ 
https://issues.apache.org/jira/browse/SPARK-17681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526929#comment-15526929
 ] 

Josh Rosen commented on SPARK-17681:
------------------------------------

I don't think that the current behavior is wrong. If {{drop()}} behaved as you 
suggest then I think we would have some weird anomalies when both adding and 
dropping columns. For instance, the following two examples currently return 
equivalent DataFrames:

{code}
scala> val df = Seq((1,2)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.drop("a").drop("b").withColumn("newCol", expr("1")).show()
+------+
|newCol|
+------+
|     1|
+------+

scala> df.withColumn("newCol", expr("1")).drop("a").drop("b").show()
+------+
|newCol|
+------+
|     1|
+------+
{code}

Under your suggested semantics, the first DataFrame would become empty after 
dropping both columns (since collecting that would return zero rows), so it 
would mean that either the two results would differ according to the order of 
the {{drop}} and {{withColumn}} calls or the {{withColumn}} call would have 
taken a DataFrame with zero rows and increased the number of rows (which 
doesn't make sense).

If dropping a column doesn't change the number of rows when going from 2 
columns to 1 then for consistency it should also not affect the number of rows 
when going from 1 column to none.

Therefore, I'm inclined to say that this is not an issue, but I'm curious to 
hear if you have a rationale for why this should behave differently.

> Empty DataFrame with non-zero rows after using drop
> ---------------------------------------------------
>
>                 Key: SPARK-17681
>                 URL: https://issues.apache.org/jira/browse/SPARK-17681
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.6.0, 2.0.0
>            Reporter: Ian Hellstrom
>
> It is possible to have a {{DataFrame}} with no columns to have a non-zero 
> number of rows, even though the contents are empty:
> {code}
> val df = Seq((1,2)).toDF("a", "b")
> df.drop("a").drop("b").count
> {code}
> The problem is also present in 2.0.0:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> val conf = new SparkConf()
> val sc = new SparkContext("local", "demo", conf)
> val ss = SparkSession.builder.getOrCreate()
> import ss.implicits._
> case class Data(a: Int, b: Int)
> val rdd = sc.parallelize(List(Data(1,2)))
> val ds = ss.createDataset(rdd)
> ds.drop("a").drop("b").count
> {code}
> In both the pre-2.0 and 2.0 releases the returned number is 1 instead of 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to