[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-04-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6644:
--
Description: 
In Hive, the schema of a partition may differ from the table schema. For 
example, we may add new columns to the table after importing existing 
partitions. When using {{spark-sql}} to query the data in a partition whose 
schema is different from the table schema, problems may arise. Part of them 
have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
However, after adding new column(s) to the table, when inserting data into old 
partitions, values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
testData.registerTempTable(testData)

sql(DROP TABLE IF EXISTS table_with_partition )
sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value FROM testData)

// Add new columns to the table
sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
sql(INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT 
key, value, 'test', 1.11 FROM testData)

sql(SELECT * FROM table_with_partition WHERE ds = 
'1').collect().foreach(println)  
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if we add new column, and put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In Hive, the schema of a partition may differ from the table schema. For 
 example, we may add new columns to the table after importing existing 
 partitions. When using {{spark-sql}} to query the data in a partition whose 
 schema is different from the table schema, problems may arise. Part of them 
 have been solved in [PR #4289|https://github.com/apache/spark/pull/4289]. 
 However, after adding new column(s) to the table, when inserting data into 
 old partitions, values of newly added columns are all {{NULL}}.
 The following snippet can be used to reproduce this issue:
 {code}
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) 
 PARTITIONED by (ds string) location '${tmpDir.toURI.toString}')
 sql(INSERT OVERWRITE TABLE table_with_partition 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

//inititi
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )

 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
 //inititi
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println) 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

//inititi
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )

 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize(
  (1 to 10).map(i = TestData(i, i.toString))).toDF()
testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]



 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 some problems is solved(https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println)
  
 result : 
 [1,1,null,null,1]
 [2,2,null,null,1]
  
 result we expect:
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]
 This bug will cause the wrong queyr number ,when we query like that: 
 select  

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if we add new column, and put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if we add new column, and put new data into the old partition schema,new 
 column value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Summary: [SPARK-SQL]when the partition schema does not match table 
schema(ADD COLUMN), new column value is NULL  (was: [SPARK-SQL]when the 
partition schema does not match table schema(ADD COLUMN), new column is NULL)

 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 some problems is solved(https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize(
   (1 to 10).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println)
  
 result : 
 [1,1,null,null,1]
 [2,2,null,null,1]
  
 result we expect:
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition schema,new 
 column value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds