Ryan Blue created SPARK-14543:
---------------------------------

             Summary: SQL/Hive insertInto has unexpected results
                 Key: SPARK-14543
                 URL: https://issues.apache.org/jira/browse/SPARK-14543
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Ryan Blue


The Hive write path adds a pre-insertion cast (projection) to reconcile 
incoming data columns with the outgoing table schema. Columns are matched by 
position and casts are inserted to reconcile the two column schemas.

When columns aren't correctly aligned, this causes unexpected results. I ran 
into this by not using a correct {{partitionBy}} call (addressed by 
SPARK-14459), which caused an error message that an int could not be cast to an 
array. However, if the columns are vaguely compatible, for example string and 
float, then no error or warning is produced and data is written to the wrong 
columns using unexpected casts (string -> bigint -> float).

A real-world use case that will hit this is when a table definition changes by 
adding a column in the middle of a table. Spark SQL statements that copied from 
that table to a destination table will then map the columns differently but 
insert casts that mask the problem. The last column's data will be dropped 
without a reliable warning for the user.

This highlights a few problems:
* Too many or too few incoming data columns should cause an AnalysisException 
to be thrown
* Only "safe" casts should be inserted automatically, like int -> long, using 
UpCast
* Pre-insertion casts currently ignore extra columns by using zip
* The pre-insertion cast logic differs between Hive's MetastoreRelation and 
LogicalRelation

Also, I think there should be an option to match input data to output columns 
by name. The API allows operations on tables, which hide the column resolution 
problem. It's easy to copy from one table to another without listing the 
columns, and in the API it is common to work with columns by name rather than 
by position. I think the API should add a way to match columns by name, which 
is closer to what users expect. I propose adding something like this:

{code}
CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to