[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println)
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) //inititi sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1'
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong queyr number ,when we query like that: select
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if we add new column, and put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Summary: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL (was: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column is NULL) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column is NULL
dongxu created SPARK-6644: - Summary: [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column is NULL Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. some problems is solved(https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize( (1 to 10).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) result : [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6644) [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL
[ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-6644: -- Description: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL was: In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' ) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value FROM testData) // add column to table sql(ALTER TABLE table_with_partition ADD COLUMNS(key1 string)) sql(ALTER TABLE table_with_partition ADD COLUMNS(destlng double)) sql(INSERT OVERWRITE TABLE table_with_partition partition (ds='1') SELECT key,value,'test',1.11 FROM testData) sql(select * from table_with_partition where ds='1' ).collect().foreach(println) - result: [1,1,null,null,1] [2,2,null,null,1] result we expect: [1,1,test,1.11,1] [2,2,test,1.11,1] This bug will cause the wrong query number ,when we query : select count(1) from table_with_partition where key1 is not NULL [SPARK-SQL]when the partition schema does not match table schema(ADD COLUMN), new column value is NULL -- Key: SPARK-6644 URL: https://issues.apache.org/jira/browse/SPARK-6644 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: dongxu In hive,the schema of partition may be difference from the table schema. For example, we add new column. When we use spark-sql to query the data of partition which schema is difference from the table schema. Some problems have been solved at PR4289 (https://github.com/apache/spark/pull/4289), but if you add a new column,put new data into the old partition schema,new column value is NULL [According to the following steps]: -- case class TestData(key: Int, value: String) val testData = TestHive.sparkContext.parallelize((1 to 2).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) sql(DROP TABLE IF EXISTS table_with_partition ) sql(sCREATE TABLE IF NOT EXISTS table_with_partition(key int,value string) PARTITIONED by (ds
[jira] [Updated] (SPARK-5616) Add examples for PySpark API
[ https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-5616: -- Description: PySpark API examples are less than Spark scala API. For example: 1.Broadcast: how to use broadcast operation API 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. was: PySpark API examples are less than Spark scala API. For example: 1.Boardcast: how to use boardcast operation APi 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. Add examples for PySpark API Key: SPARK-5616 URL: https://issues.apache.org/jira/browse/SPARK-5616 Project: Spark Issue Type: Improvement Components: PySpark Reporter: dongxu Priority: Minor Labels: examples, pyspark, python PySpark API examples are less than Spark scala API. For example: 1.Broadcast: how to use broadcast operation API 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5616) Add examples for PySpark API
dongxu created SPARK-5616: - Summary: Add examples for PySpark API Key: SPARK-5616 URL: https://issues.apache.org/jira/browse/SPARK-5616 Project: Spark Issue Type: Improvement Components: PySpark Reporter: dongxu Fix For: 1.3.0 PySpark API examples are less than Spark scala API. For example: 1.Boardcast: how to use boardcast operation APi 2.Module: how to import a other python file in zip file. Add more examples for freshman who wanna use PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5527) Add standalone document configiration to explain how to make cluster conf file consistency
[ https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-5527: -- Description: We must make all node conf file consistent when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . was: We need make all node conf file consistency when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . Add standalone document configiration to explain how to make cluster conf file consistency --- Key: SPARK-5527 URL: https://issues.apache.org/jira/browse/SPARK-5527 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0, 1.3.0 Reporter: dongxu Priority: Minor Labels: docuentation, starter Original Estimate: 10m Remaining Estimate: 10m We must make all node conf file consistent when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5527) Improvements to standalone doc. - how to make cluster conf file consistency
[ https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-5527: -- Summary: Improvements to standalone doc. - how to make cluster conf file consistency (was: Add standalone document configiration to explain how to make cluster conf file consistency) Improvements to standalone doc. - how to make cluster conf file consistency --- Key: SPARK-5527 URL: https://issues.apache.org/jira/browse/SPARK-5527 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0, 1.3.0 Reporter: dongxu Priority: Minor Labels: docuentation, starter Original Estimate: 10m Remaining Estimate: 10m We must make all node conf file consistent when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5527) Improvements to standalone doc. - how to sync cluster conf file
[ https://issues.apache.org/jira/browse/SPARK-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-5527: -- Summary: Improvements to standalone doc. - how to sync cluster conf file (was: Improvements to standalone doc. - how to make cluster conf file consistency) Improvements to standalone doc. - how to sync cluster conf file --- Key: SPARK-5527 URL: https://issues.apache.org/jira/browse/SPARK-5527 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0, 1.3.0 Reporter: dongxu Priority: Minor Labels: docuentation, starter Original Estimate: 10m Remaining Estimate: 10m We must make all node conf file consistent when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5527) Add standalone document configiration to explain how to make cluster conf file consistency
dongxu created SPARK-5527: - Summary: Add standalone document configiration to explain how to make cluster conf file consistency Key: SPARK-5527 URL: https://issues.apache.org/jira/browse/SPARK-5527 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.0, 1.3.0 Reporter: dongxu Priority: Minor We need make all node conf file consistency when we start our standalone cluster. For example, we set SPARK_WORKER_INSTANCES=2 to start 2 worker on each machine. I see this code at $SPARK_HOME/sbin/spark-daemon.sh if [ $SPARK_MASTER != ]; then echo rsync from $SPARK_MASTER rsync -a -e ssh --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $SPARK_MASTER/ $SPARK_HOME fi I think we better mention it at document . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
dongxu created SPARK-4201: - Summary: Can't use concat() on partition column in where condition (Hive compatibility problem) Key: SPARK-4201 URL: https://issues.apache.org/jira/browse/SPARK-4201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.0.0 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1 Reporter: dongxu Priority: Minor The team used hive to query,we try to move it to spark-sql. when I search sentences like that. select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'; It can't work ,but it work well in hive. I have to rewrite the sql to select count(1) from gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. There are some error logs. 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 16 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 20 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at
[jira] [Updated] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
[ https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-4201: -- Description: The team used hive to query,we try to move it to spark-sql. when I search sentences like that. select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'; It can't work ,but it work well in hive. I have to rewrite the sql to select count(1) from gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. There are some error log. 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 16 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 20 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)