[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

//inititi
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )

 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
 //inititi
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println) 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

//inititi
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )

 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong queyr number ,when we query like that: 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize(
  (1 to 10).map(i = TestData(i, i.toString))).toDF()
testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]



 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 some problems is solved(https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println)
  
 result : 
 [1,1,null,null,1]
 [2,2,null,null,1]
  
 result we expect:
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]
 This bug will cause the wrong queyr number ,when we query like that: 
 select  

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if we add new column, and put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if we add new column, and put new data into the old partition schema,new 
 column value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by 

[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Summary: [SPARK-SQL]when the partition schema does not match table 
schema(ADD COLUMN), new column value is NULL  (was: [SPARK-SQL]when the 
partition schema does not match table schema(ADD COLUMN), new column is NULL)

 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 some problems is solved(https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition,new column 
 value is NULL
 [According to the following steps]:
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize(
   (1 to 10).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value FROM testData)
 // add column to table
  sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
  sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
  sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
 key,value,'test',1.11 FROM testData)
  sql(select * from table_with_partition where ds='1' 
 ).collect().foreach(println)
  
 result : 
 [1,1,null,null,1]
 [2,2,null,null,1]
  
 result we expect:
 [1,1,test,1.11,1]
 [2,2,test,1.11,1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column is NULL

2015-03-31 Thread dongxu (JIRA)
dongxu created SPARK-6644:
-

 Summary: [SPARK-SQL]when the partition schema does not match table 
schema(ADD COLUMN), new column is NULL
 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu


In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
some problems is solved(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:

case class TestData(key: Int, value: String)
val testData = TestHive.sparkContext.parallelize(
  (1 to 10).map(i = TestData(i, i.toString))).toDF()
testData.registerTempTable(testData)
 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)
// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)
 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
result : 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL

2015-03-31 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-6644:
--
Description: 
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition schema,new 
column value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL

  was:
In hive,the schema of partition may be difference from the table schema. For 
example, we add new column. When we use spark-sql to query the data of 
partition which schema is difference from the table schema.
Some problems have been solved at PR4289 
(https://github.com/apache/spark/pull/4289), 
but if you add a new column,put new data into the old partition,new column 
value is NULL

[According to the following steps]:
--
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, 
i.toString))).toDF()
  testData.registerTempTable(testData)

 sql(DROP TABLE IF EXISTS table_with_partition )
 sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value FROM testData)

// add column to table
 sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string))
 sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) 
 sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') SELECT 
key,value,'test',1.11 FROM testData)

 sql(select * from table_with_partition where ds='1' 
).collect().foreach(println)  
 
-
result: 
[1,1,null,null,1]
[2,2,null,null,1]
 
result we expect:
[1,1,test,1.11,1]
[2,2,test,1.11,1]

This bug will cause the wrong query number ,when we query : 

select  count(1)  from  table_with_partition  where   key1  is not NULL


 [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), 
 new column value is NULL
 --

 Key: SPARK-6644
 URL: https://issues.apache.org/jira/browse/SPARK-6644
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: dongxu

 In hive,the schema of partition may be difference from the table schema. For 
 example, we add new column. When we use spark-sql to query the data of 
 partition which schema is difference from the table schema.
 Some problems have been solved at PR4289 
 (https://github.com/apache/spark/pull/4289), 
 but if you add a new column,put new data into the old partition schema,new 
 column value is NULL
 [According to the following steps]:
 --
 case class TestData(key: Int, value: String)
 val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = 
 TestData(i, i.toString))).toDF()
   testData.registerTempTable(testData)
  sql(DROP TABLE IF EXISTS table_with_partition )
  sql(sCREATE  TABLE  IF NOT EXISTS  table_with_partition(key int,value 
 string) PARTITIONED by (ds 

[jira] [Updated] (SPARK-5616) Add examples for PySpark API

2015-02-08 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-5616:
--
Description: 
PySpark API examples are less than Spark scala API. For example:  

1.Broadcast: how to use broadcast operation API
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.

  was:
PySpark API examples are less than Spark scala API. For example:  

1.Boardcast: how to use boardcast operation APi 
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.


 Add examples for PySpark API
 

 Key: SPARK-5616
 URL: https://issues.apache.org/jira/browse/SPARK-5616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: dongxu
Priority: Minor
  Labels: examples, pyspark, python

 PySpark API examples are less than Spark scala API. For example:  
 1.Broadcast: how to use broadcast operation API
 2.Module: how to import a other python file in zip file.
 Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5616) Add examples for PySpark API

2015-02-05 Thread dongxu (JIRA)
dongxu created SPARK-5616:
-

 Summary: Add examples for PySpark API
 Key: SPARK-5616
 URL: https://issues.apache.org/jira/browse/SPARK-5616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: dongxu
 Fix For: 1.3.0


PySpark API examples are less than Spark scala API. For example:  

1.Boardcast: how to use boardcast operation APi 
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5527) Add standalone document configiration to explain how to make cluster conf file consistency

2015-02-02 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-5527:
--
Description: 
We must make all node conf file consistent when we start our standalone 
cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker on 
each machine.
I see this code at $SPARK_HOME/sbin/spark-daemon.sh

   if [ $SPARK_MASTER !=  ]; then
  echo rsync from $SPARK_MASTER
  rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
--exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
fi

I think we better mention it at document .

  was:
We need make all node conf file consistency when we start our standalone 
cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker on 
each machine.
I see this code at $SPARK_HOME/sbin/spark-daemon.sh

   if [ $SPARK_MASTER !=  ]; then
  echo rsync from $SPARK_MASTER
  rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
--exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
fi

I think we better mention it at document .


 Add standalone document configiration to explain  how to make cluster conf 
 file consistency
 ---

 Key: SPARK-5527
 URL: https://issues.apache.org/jira/browse/SPARK-5527
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0, 1.3.0
Reporter: dongxu
Priority: Minor
  Labels: docuentation, starter
   Original Estimate: 10m
  Remaining Estimate: 10m

 We must make all node conf file consistent when we start our standalone 
 cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker 
 on each machine.
 I see this code at $SPARK_HOME/sbin/spark-daemon.sh
if [ $SPARK_MASTER !=  ]; then
   echo rsync from $SPARK_MASTER
   rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
 --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
 fi
 I think we better mention it at document .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5527) Improvements to standalone doc. - how to make cluster conf file consistency

2015-02-02 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-5527:
--
Summary: Improvements to standalone doc. - how to make cluster conf file 
consistency  (was: Add standalone document configiration to explain  how to 
make cluster conf file consistency)

 Improvements to standalone doc. - how to make cluster conf file consistency
 ---

 Key: SPARK-5527
 URL: https://issues.apache.org/jira/browse/SPARK-5527
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0, 1.3.0
Reporter: dongxu
Priority: Minor
  Labels: docuentation, starter
   Original Estimate: 10m
  Remaining Estimate: 10m

 We must make all node conf file consistent when we start our standalone 
 cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker 
 on each machine.
 I see this code at $SPARK_HOME/sbin/spark-daemon.sh
if [ $SPARK_MASTER !=  ]; then
   echo rsync from $SPARK_MASTER
   rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
 --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
 fi
 I think we better mention it at document .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5527) Improvements to standalone doc. - how to sync cluster conf file

2015-02-02 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-5527:
--
Summary: Improvements to standalone doc. - how to sync cluster conf file  
(was: Improvements to standalone doc. - how to make cluster conf file 
consistency)

 Improvements to standalone doc. - how to sync cluster conf file
 ---

 Key: SPARK-5527
 URL: https://issues.apache.org/jira/browse/SPARK-5527
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0, 1.3.0
Reporter: dongxu
Priority: Minor
  Labels: docuentation, starter
   Original Estimate: 10m
  Remaining Estimate: 10m

 We must make all node conf file consistent when we start our standalone 
 cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker 
 on each machine.
 I see this code at $SPARK_HOME/sbin/spark-daemon.sh
if [ $SPARK_MASTER !=  ]; then
   echo rsync from $SPARK_MASTER
   rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
 --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
 fi
 I think we better mention it at document .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5527) Add standalone document configiration to explain how to make cluster conf file consistency

2015-02-02 Thread dongxu (JIRA)
dongxu created SPARK-5527:
-

 Summary: Add standalone document configiration to explain  how to 
make cluster conf file consistency
 Key: SPARK-5527
 URL: https://issues.apache.org/jira/browse/SPARK-5527
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0, 1.3.0
Reporter: dongxu
Priority: Minor


We need make all node conf file consistency when we start our standalone 
cluster. For example, we set   SPARK_WORKER_INSTANCES=2 to start 2 worker on 
each machine.
I see this code at $SPARK_HOME/sbin/spark-daemon.sh

   if [ $SPARK_MASTER !=  ]; then
  echo rsync from $SPARK_MASTER
  rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' 
--exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME
fi

I think we better mention it at document .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-11-02 Thread dongxu (JIRA)
dongxu created SPARK-4201:
-

 Summary: Can't use concat() on partition column in where condition 
(Hive compatibility problem)
 Key: SPARK-4201
 URL: https://issues.apache.org/jira/browse/SPARK-4201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.0.0
 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1
Reporter: dongxu
Priority: Minor


The team used hive to query,we try to  move it to spark-sql.
when I search sentences like that. 
select count(1) from  gulfstream_day_driver_base_2 where  
concat(year,month,day) = '20140929';
It can't work ,but it work well in hive.
I have to rewrite the sql to  select count(1) from  
gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
There are some error logs.
14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Exchange SinglePartition
 Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 16 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
 HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 20 more
Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at 

[jira] [Updated] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-11-02 Thread dongxu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-4201:
--
Description: 
The team used hive to query,we try to  move it to spark-sql.
when I search sentences like that. 
select count(1) from  gulfstream_day_driver_base_2 where  
concat(year,month,day) = '20140929';
It can't work ,but it work well in hive.
I have to rewrite the sql to  select count(1) from  
gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
There are some error log.
14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Exchange SinglePartition
 Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 16 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
 HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 20 more
Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)