[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886192#comment-16886192 ] David Lo commented on SPARK-19477: -- [~cloud_fan] Could you please elaborate on your suggested workaround? `ds.map(equality)` isn't compiling for me. Like the original poster, I am interested in dropping columns that are not defined in a case class when doing a DataFrame to Dataset[case class] conversion. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake >Priority: Major > Labels: bulk-closed > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862562#comment-15862562 ] Wenchen Fan commented on SPARK-19477: - a simple workaround will be `ds.map(equality)`. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862560#comment-15862560 ] Don Drake commented on SPARK-19477: --- How does lazy apply here? If I read/create a dataframe with extra columns, then do ds = df.as[XYZ], then immediately ds.write.parquet("file"), the write trigger should enable any lazy functionality if I understand this. Do you have a suggested workaround? I'm currently retrieving the encoder for the case class to get the schema, then calling ds.select() on the columns from the schema. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861918#comment-15861918 ] Michael Armbrust commented on SPARK-19477: -- If a lot of people are confused by this being lazy we can change it (didn't we already change it in 1.6 -> 2.0 in the other direction?). It would have to be configurable though, since removing columns could be a breaking change. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795 ] Christophe Préaud commented on SPARK-19477: --- As a workaround, mapping the dataset with *identity* fix the issue: ```scala scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ ``` > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856329#comment-15856329 ] Wenchen Fan commented on SPARK-19477: - I think the most confusing part maybe the fact that `Dataset.as[T]` is lazy. In your example, there is a `df` with 4 columns, and when you call `df.as[F]`, it still has 4 columns. But when you do `df.as[F].map(x => x)`, it has 3 columns. Maybe it's more intuitive to make `Dataset.as[T]` eagerly change the underlying plan to adjust columns? cc [~marmbrus] [~lian cheng] > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855677#comment-15855677 ] Song Jun commented on SPARK-19477: -- thanks, I got it~ > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1582#comment-1582 ] Wenchen Fan commented on SPARK-19477: - The type T of Dataset and the schema of Dataset should be treated differently, as you can apply both relational operations(select, where, ...) and typed operations (map, filter, ...). When you call `show`, it prints data in columnar format, so it's more like a relational operation and should respect the schema instead of type T/encoder. > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org