[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888246#comment-16888246 ] Alexander Tronchin-James commented on SPARK-14543: -- OK, thanks! > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888222#comment-16888222 ] Ryan Blue commented on SPARK-14543: --- {{byName}} was never added to Apache Spark. The change was rejected, so it is only available in Netflix's Spark branch. I resolved this with "later" because we are including by-name resolution in the DSv2 work. The replacement for {{DataFrameWriter}} will default to name-based resolution. > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887508#comment-16887508 ] Alexander Tronchin-James commented on SPARK-14543: -- Did the calling syntax change for this? I'm using 2.4.x and can't find anything about .byName on writers in the docs, but maybe I'm just bad at searching the docs... > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829854#comment-15829854 ] Gopal Nagar commented on SPARK-14543: - I am facing similar issue in Spark 1.6.1. Please advise if this issue present in Spark 1.6.1 as well. > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236315#comment-15236315 ] Apache Spark commented on SPARK-14543: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/12313 > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue > > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org