[jira] [Assigned] (HUDI-5835) spark cannot read mor table after execute update statement
[ https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng reassigned HUDI-5835: -- Assignee: Tao Meng > spark cannot read mor table after execute update statement > -- > > Key: HUDI-5835 > URL: https://issues.apache.org/jira/browse/HUDI-5835 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.13.0 >Reporter: Tao Meng >Assignee: Tao Meng >Priority: Blocker > > avro schema create by sparksql miss avro name and namespace, > This will lead the read schema and write schema of the log file to be > incompatible > > {code:java} > // code placeholder > spark.sql( >s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long, > | ff decimal(38, 10) > |) using hudi > | location '${tablePath.toString}' > | tblproperties ( > | type = 'mor', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") > checkAnswer(s"select id, name, price, ts from $tableName")( > Seq(1, "a1", 10.0, 1000) > ) > spark.sql(s"update $tableName set price = 22 where id = 1") > checkAnswer(s"select id, name, price, ts from $tableName")( failed > Seq(1, "a1", 22.0, 1000) > ) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement
[ https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5835: --- Description: avro schema create by sparksql miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible {code:java} // code placeholder spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( failed Seq(1, "a1", 22.0, 1000) ) {code} was: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible {code:java} // code placeholdertest("Test Add Column and Update Table") { withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } {code} > spark cannot read mor table after execute update statement > -- > > Key: HUDI-5835 > URL: https://issues.apache.org/jira/browse/HUDI-5835 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.13.0 >Reporter: Tao Meng >Priority: Blocker > > avro schema create by sparksql miss avro name and namespace, > This will lead the read schema and write schema of the log file to be > incompatible > > {code:java} > // code placeholder > spark.sql( >s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long, > | ff decimal(38, 10) > |) using hudi > | location '${tablePath.toString}' > | tblproperties ( > | type = 'mor', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") > checkAnswer(s"select id, name, price, ts from $tableName")( > Seq(1, "a1", 10.0, 1000) > ) > spark.sql(s"update $tableName set price = 22 where id = 1") > checkAnswer(s"select id, name, price, ts from $tableName")( failed > Seq(1, "a1", 22.0, 1000) > ) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement
[ https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5835: --- Description: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible {code:java} // code placeholdertest("Test Add Column and Update Table") { withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } {code} was: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible {code:java} // code placeholder {code} test("Test Add Column and Update Table") \{ withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } > spark cannot read mor table after execute update statement > -- > > Key: HUDI-5835 > URL: https://issues.apache.org/jira/browse/HUDI-5835 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.13.0 >Reporter: Tao Meng >Priority: Blocker > > avro schema create by sparksql will miss avro name and namespace, > This will lead the read schema and write schema of the log file to be > incompatible > > {code:java} > // code placeholdertest("Test Add Column and Update Table") { withTempDir { > tmp => val tableName = generateTableName //spark.sql("SET > hoodie.datasource.read.extract.partition.values.from.path=true") val > tablePath = new Path(tmp.getCanonicalPath, tableName) // create table > spark.sql( s""" |create table $tableName ( | id int, | name string, | price > double, | ts long, | ff decimal(38, 10) |) using hudi | location > '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = > 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") > checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", > 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") > checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", > 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") > checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, > "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName > set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, > ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement
[ https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5835: --- Description: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible {code:java} // code placeholder {code} test("Test Add Column and Update Table") \{ withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } was: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible test("Test Add Column and Update Table") { withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } > spark cannot read mor table after execute update statement > -- > > Key: HUDI-5835 > URL: https://issues.apache.org/jira/browse/HUDI-5835 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.13.0 >Reporter: Tao Meng >Priority: Blocker > > avro schema create by sparksql will miss avro name and namespace, > This will lead the read schema and write schema of the log file to be > incompatible > > {code:java} > // code placeholder > {code} > test("Test Add Column and Update Table") \{ withTempDir { tmp => val > tableName = generateTableName //spark.sql("SET > hoodie.datasource.read.extract.partition.values.from.path=true") val > tablePath = new Path(tmp.getCanonicalPath, tableName) // create table > spark.sql( s""" |create table $tableName ( | id int, | name string, | price > double, | ts long, | ff decimal(38, 10) |) using hudi | location > '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = > 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") > checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", > 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") > checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", > 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") > checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, > "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName > set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, > ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement
[ https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5835: --- Description: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible test("Test Add Column and Update Table") { withTempDir { tmp => val tableName = generateTableName //spark.sql("SET hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } } was: avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible ``` test("Test upsert table") { withTempDir { tmp => val tableName = generateTableName val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) } } ``` > spark cannot read mor table after execute update statement > -- > > Key: HUDI-5835 > URL: https://issues.apache.org/jira/browse/HUDI-5835 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Affects Versions: 0.13.0 >Reporter: Tao Meng >Priority: Blocker > > avro schema create by sparksql will miss avro name and namespace, > This will lead the read schema and write schema of the log file to be > incompatible > > test("Test Add Column and Update Table") { > withTempDir { tmp => > val tableName = generateTableName > //spark.sql("SET > hoodie.datasource.read.extract.partition.values.from.path=true") > val tablePath = new Path(tmp.getCanonicalPath, tableName) > // create table > spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long, > | ff decimal(38, 10) > |) using hudi > | location '${tablePath.toString}' > | tblproperties ( > | type = 'mor', > | primaryKey = 'id', > | preCombineField = 'ts' > | ) > """.stripMargin) > // insert data to table > spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") > checkAnswer(s"select id, name, price, ts from $tableName")( > Seq(1, "a1", 10.0, 1000) > ) > spark.sql(s"update $tableName set price = 22 where id = 1") > checkAnswer(s"select id, name, price, ts from $tableName")( > Seq(1, "a1", 22.0, 1000) > ) > spark.sql(s"alter table $tableName add column new_col1 int") > checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( > Seq(1, "a1", 22.0, 1000, null) > ) > // update and check > spark.sql(s"update $tableName set price = price * 2 where id = 1") > checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( > Seq(1, "a1", 44.0, 1000, null) > ) > } > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5835) spark cannot read mor table after execute update statement
Tao Meng created HUDI-5835: -- Summary: spark cannot read mor table after execute update statement Key: HUDI-5835 URL: https://issues.apache.org/jira/browse/HUDI-5835 Project: Apache Hudi Issue Type: Bug Components: spark Affects Versions: 0.13.0 Reporter: Tao Meng avro schema create by sparksql will miss avro name and namespace, This will lead the read schema and write schema of the log file to be incompatible ``` test("Test upsert table") { withTempDir { tmp => val tableName = generateTableName val tablePath = new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 22.0, 1000) ) } } ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5294) Support type change for schema on read enable + reconcile schema
Tao Meng created HUDI-5294: -- Summary: Support type change for schema on read enable + reconcile schema Key: HUDI-5294 URL: https://issues.apache.org/jira/browse/HUDI-5294 Project: Apache Hudi Issue Type: Improvement Reporter: Tao Meng Fix For: 0.12.2 https://github.com/apache/hudi/issues/7283 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5194) fix schema evolution bugs
[ https://issues.apache.org/jira/browse/HUDI-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5194: --- Description: # Fix the bug, history schema files cannot be cleaned by FileBasedInternalSchemaStorageManager # Fix the bug, schema evolution cannot worked very well on non-batch read mode under spark3.1x # optimize implement for compaction. > fix schema evolution bugs > - > > Key: HUDI-5194 > URL: https://issues.apache.org/jira/browse/HUDI-5194 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Tao Meng >Priority: Major > > # Fix the bug, history schema files cannot be cleaned by > FileBasedInternalSchemaStorageManager > # Fix the bug, schema evolution cannot worked very well on non-batch read > mode under spark3.1x > # optimize implement for compaction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-5194) fix schema evolution bugs
[ https://issues.apache.org/jira/browse/HUDI-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-5194: --- Issue Type: Bug (was: New Feature) > fix schema evolution bugs > - > > Key: HUDI-5194 > URL: https://issues.apache.org/jira/browse/HUDI-5194 > Project: Apache Hudi > Issue Type: Bug >Reporter: Tao Meng >Priority: Major > > # Fix the bug, history schema files cannot be cleaned by > FileBasedInternalSchemaStorageManager > # Fix the bug, schema evolution cannot worked very well on non-batch read > mode under spark3.1x > # optimize implement for compaction. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5194) fix schema evolution bugs
Tao Meng created HUDI-5194: -- Summary: fix schema evolution bugs Key: HUDI-5194 URL: https://issues.apache.org/jira/browse/HUDI-5194 Project: Apache Hudi Issue Type: New Feature Reporter: Tao Meng -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5000) Support schema evolution for Hive
[ https://issues.apache.org/jira/browse/HUDI-5000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng reassigned HUDI-5000: -- Assignee: Tao Meng > Support schema evolution for Hive > - > > Key: HUDI-5000 > URL: https://issues.apache.org/jira/browse/HUDI-5000 > Project: Apache Hudi > Issue Type: New Feature > Components: hive >Reporter: Tao Meng >Assignee: Tao Meng >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Support of schema evolution presented in RFC-33 when read by hive > Operations: add column, rename column, change type of column, drop column -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5000) Support schema evolution for Hive
[ https://issues.apache.org/jira/browse/HUDI-5000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng reassigned HUDI-5000: -- Assignee: (was: Tao Meng) > Support schema evolution for Hive > - > > Key: HUDI-5000 > URL: https://issues.apache.org/jira/browse/HUDI-5000 > Project: Apache Hudi > Issue Type: New Feature > Components: hive >Reporter: Tao Meng >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > > Support of schema evolution presented in RFC-33 when read by hive > Operations: add column, rename column, change type of column, drop column -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5000) Support schema evolution for Hive
Tao Meng created HUDI-5000: -- Summary: Support schema evolution for Hive Key: HUDI-5000 URL: https://issues.apache.org/jira/browse/HUDI-5000 Project: Apache Hudi Issue Type: New Feature Components: hive Reporter: Tao Meng Fix For: 0.13.0 Support of schema evolution presented in RFC-33 when read by hive Operations: add column, rename column, change type of column, drop column -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4898) for mor table, presto/hive shoud respect payload class during merge parquet file and log file
Tao Meng created HUDI-4898: -- Summary: for mor table, presto/hive shoud respect payload class during merge parquet file and log file Key: HUDI-4898 URL: https://issues.apache.org/jira/browse/HUDI-4898 Project: Apache Hudi Issue Type: Bug Components: hive, trino-presto Affects Versions: 0.12.0 Environment: hadoop 3.2.x hive3.1.x presto Reporter: Tao Meng Fix For: 0.12.2 for mor table, presto/hive will ignore payload class during merge parquet file and log file. Line115 in RealtimeCompactedRecordReader ``` // TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This is required since the // deltaRecord may not be a full record and needs values of columns from the parquet Option rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key)); ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-1675) Externalize all Hudi configurations
[ https://issues.apache.org/jira/browse/HUDI-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng closed HUDI-1675. -- Resolution: Fixed close it, as Hudi currently has this capability > Externalize all Hudi configurations > --- > > Key: HUDI-1675 > URL: https://issues.apache.org/jira/browse/HUDI-1675 > Project: Apache Hudi > Issue Type: Improvement > Components: configs >Affects Versions: 0.9.0 >Reporter: tao meng >Priority: Minor > Fix For: 0.12.0 > > > # Externalize all Hudi configurations (separate configuration file) > # Save table related properties into hoodie.properties file. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"
[ https://issues.apache.org/jira/browse/HUDI-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550983#comment-17550983 ] Tao Meng commented on HUDI-4184: I suggest moving the schema from hoodie.properties. 1) we has already stored latest schema in our commit file. 2) This schema will only be written once, if schema evolution happen in the future. The schema from hoodie.properties will be inconsistent with current table schema. > Creating external table in Spark SQL modifies "hoodie.properties" > - > > Key: HUDI-4184 > URL: https://issues.apache.org/jira/browse/HUDI-4184 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Sagar Sumit >Priority: Critical > > My setup was like following: > # There's a table existing in one AWS account > # I'm trying to access that table from Spark SQL from _another_ AWS account > that only has Read permissions to the bucket with the table. > # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries > to modify "hoodie.properties" file for whatever reason, even though i'm not > modifying the table and just trying to create table in the catalog. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-3921) Fixed schema evolution cannot work with HUDI-3855
Tao Meng created HUDI-3921: -- Summary: Fixed schema evolution cannot work with HUDI-3855 Key: HUDI-3921 URL: https://issues.apache.org/jira/browse/HUDI-3921 Project: Apache Hudi Issue Type: Bug Reporter: Tao Meng Fix For: 0.11.0 Fixed schema evolution cannot work with HUDI-3855 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong
[ https://issues.apache.org/jira/browse/HUDI-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng closed HUDI-1816. -- Resolution: Not A Problem not a problem, close it > when query incr view of hudi table by using spark-sql, the query result is > wrong > > > Key: HUDI-1816 > URL: https://issues.apache.org/jira/browse/HUDI-1816 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.8.0 > Environment: spark2.4.5 hive 3.1.1hadoop 3.1.1 >Reporter: tao meng >Priority: Critical > Fix For: 0.11.0 > > > test step1: > create a partitioned hudi table (mor / cow) > val base_data = spark.read.parquet("/tmp/tb_base") > val upsert_data = spark.read.parquet("/tmp/tb_upsert") > base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, > MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, > "col2").option(RECORDKEY_FIELD_OPT_KEY, > "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, > "col0").option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, > "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, > "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, > "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, > "testdb").option(HIVE_TABLE_OPT_KEY, > "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, > "false").option("hoodie.bulkinsert.shuffle.parallelism", > 4).option("hoodie.insert.shuffle.parallelism", > 4).option("hoodie.upsert.shuffle.parallelism", > 4).option("hoodie.delete.shuffle.parallelism", > 4).option("hoodie.datasource.write.hive_style_partitioning", > "true").option(TABLE_NAME, > "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par") > upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, > MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, > "col2").option(RECORDKEY_FIELD_OPT_KEY, > "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, > "col0").option(KEYGENERATOR_CLASS_OPT_KEY, > "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, > "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, > "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, > "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, > "testdb").option(HIVE_TABLE_OPT_KEY, > "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, > "false").option("hoodie.bulkinsert.shuffle.parallelism", > 4).option("hoodie.insert.shuffle.parallelism", > 4).option("hoodie.upsert.shuffle.parallelism", > 4).option("hoodie.delete.shuffle.parallelism", > 4).option("hoodie.datasource.write.hive_style_partitioning", > "true").option(TABLE_NAME, > "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par") > query incr view by sparksql: > set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330; > set hoodie.tb_test_mor_par.consume.max.commits=3; > select > _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from > testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order > by primary_key; > +---+---+++++ > |_hoodie_commit_time|primary_key|col0|col1|col6 |col7| > +---+---+++++ > |20210420155738 |20 |77 |sC |158788760400|739 | > |20210420155738 |21 |66 |ps |160979049700|61 | > |20210420155738 |22 |47 |1P |158460042900|835 | > |20210420155738 |23 |36 |5K |160763480800|538 | > |20210420155738 |24 |1 |BA |160685711300|775 | > |20210420155738 |24 |101 |BA |160685711300|775 | > |20210420155738 |24 |100 |BA |160685711300|775 | > |20210420155738 |24 |102 |BA |160685711300|775 | > +---+---+++++ > > primary key 24 is repeated. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3408) fixed the bug that BUCKET_INDEX cannot process special characters
[ https://issues.apache.org/jira/browse/HUDI-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng closed HUDI-3408. -- Resolution: Won't Fix This is not a common problem, let me close it > fixed the bug that BUCKET_INDEX cannot process special characters > - > > Key: HUDI-3408 > URL: https://issues.apache.org/jira/browse/HUDI-3408 > Project: Apache Hudi > Issue Type: Bug > Components: core >Affects Versions: 0.10.1 > Environment: spark3.1.1 >Reporter: Tao Meng >Priority: Major > Labels: pull-request-available > > BucketIdentifier use split(":") to split recordKeyName and recordKeyValue, > if current recordKeyValue is ":::" , above split give a wrong result. > test("test Bucket") { > withTempDir { tmp => > Seq("mor").foreach { tableType => > val tableName = generateTableName > val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}" > spark.sql( > s""" > |create table $tableName ( > | id int, comb int, col0 int, col1 bigint, col2 float, col3 double, col4 > decimal(10,4), col5 string, col6 date, col7 timestamp, col8 boolean, col9 > binary, par date > |) using hudi > | location '$tablePath' > | options ( > | type = '$tableType', > | primaryKey = 'id,col0,col5', > | preCombineField = 'comb', > | hoodie.index.type = 'BUCKET', > | hoodie.bucket.index.num.buckets = '900' > | ) > | partitioned by (par) > """.stripMargin) > spark.sql( > s""" > | insert into $tableName values > | (1,1,11,11,101.01,1001.0001,11.0001,':::','2021-12-25','2021-12-25 > 12:01:01',true,'a01','2021-12-25'), > | > (2,2,12,12,102.02,1002.0002,12.0002,'a02','2021-12-25','2021-12-25 > 12:02:02',true,'a02','2021-12-25'), > | > (3,3,13,13,103.03,1003.0003,13.0003,'a03','2021-12-25','2021-12-25 > 12:03:03',false,'a03','2021-12-25'), > | > (4,4,14,14,104.04,1004.0004,14.0004,'a04','2021-12-26','2021-12-26 > 12:04:04',true,'a04','2021-12-26'), > | > (5,5,15,15,105.05,1005.0005,15.0005,'a05','2021-12-26','2021-12-26 > 12:05:05',false,'a05','2021-12-26') > |""".stripMargin) > } > } > } > > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.hudi.index.bucket.BucketIdentifier.lambda$getBucketId$2(BucketIdentifier.java:46) > at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321) > at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:46) > at > org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:36) > at > org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:80) > at > org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:70) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:125) > ... 25 more > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3742) Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression
Tao Meng created HUDI-3742: -- Summary: Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression Key: HUDI-3742 URL: https://issues.apache.org/jira/browse/HUDI-3742 Project: Apache Hudi Issue Type: Improvement Components: spark Reporter: Tao Meng Fix For: 0.11.0 now we disable parquet enableVectorizedReader for mor incremental read, and set "spark.sql.parquet.recordLevelFilter.enabled" = "true" to achieve data filter which is slow -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3719) High performance costs of AvroSerializer in Datasource writing
Tao Meng created HUDI-3719: -- Summary: High performance costs of AvroSerializer in Datasource writing Key: HUDI-3719 URL: https://issues.apache.org/jira/browse/HUDI-3719 Project: Apache Hudi Issue Type: Bug Components: spark Reporter: Tao Meng Fix For: 0.11.0 https://github.com/apache/hudi/issues/5107 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column
[ https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Meng updated HUDI-3646: --- Description: now, when we use sparksql to update hudi table, we find that hudi will change the nullability attribute of a column eg: {code:java} // code placeholder val tableName = generateTableName val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}" // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long |) using hudi | location '$tablePath' | options ( | type = '$tableType', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") spark.sql(s"select * from $tableName").printSchema() // update data spark.sql(s"update $tableName set price = 20 where id = 1") spark.sql(s"select * from $tableName").printSchema() {code} |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = true)* |-- ts: long (nullable = true) |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = false )* |-- ts: long (nullable = true) the nullable attribute of price has been changed to false, This is not the result we want was: now, when we use sparksql to update hudi table, we find that hudi will change the nullability attribute of a column eg: ``` test("Test Update Table") { withTempDir { tmp => Seq("cow", "mor").foreach {tableType => val tableName = generateTableName val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}" // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long |) using hudi | location '$tablePath' | options ( | type = '$tableType', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") spark.sql(s"select * from $tableName").printSchema() checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) // update data spark.sql(s"update $tableName set price = 20 where id = 1") spark.sql(s"select * from $tableName").printSchema() checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 20.0, 1000) ) // update data spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 40.0, 1000) ) } } } } ``` |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = true)* |-- ts: long (nullable = true) |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = false)* |-- ts: long (nullable = true) the nullable attribute of price has been changed to false, This is not the result we want > The Hudi update syntax should not modify the nullability attribute of a column > -- > > Key: HUDI-3646 > URL: https://issues.apache.org/jira/browse/HUDI-3646 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.10.1 > Environment: spark3.1.2 >Reporter: Tao Meng >Priority: Minor > Fix For: 0.12.0 > > > now, when we use sparksql to update hudi table, we find that hudi will > change the nullability attribute of a column > eg: > {code:java} > // code placeholder > val tableName = generateTableName > val tablePath = s"${new Path(tmp.getCanonicalPath, > tableName).toUri.toString}" > // create table > spark.sql( >s""" > |create table $tableName ( > | id int, >
[jira] [Created] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column
Tao Meng created HUDI-3646: -- Summary: The Hudi update syntax should not modify the nullability attribute of a column Key: HUDI-3646 URL: https://issues.apache.org/jira/browse/HUDI-3646 Project: Apache Hudi Issue Type: Bug Components: spark-sql Affects Versions: 0.10.1 Environment: spark3.1.2 Reporter: Tao Meng Fix For: 0.12.0 now, when we use sparksql to update hudi table, we find that hudi will change the nullability attribute of a column eg: ``` test("Test Update Table") { withTempDir { tmp => Seq("cow", "mor").foreach {tableType => val tableName = generateTableName val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}" // create table spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long |) using hudi | location '$tablePath' | options ( | type = '$tableType', | primaryKey = 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000") spark.sql(s"select * from $tableName").printSchema() checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 10.0, 1000) ) // update data spark.sql(s"update $tableName set price = 20 where id = 1") spark.sql(s"select * from $tableName").printSchema() checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 20.0, 1000) ) // update data spark.sql(s"update $tableName set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 40.0, 1000) ) } } } } ``` |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = true)* |-- ts: long (nullable = true) |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- id: integer (nullable = true) |-- name: string (nullable = true) *|-- price: double (nullable = false)* |-- ts: long (nullable = true) the nullable attribute of price has been changed to false, This is not the result we want -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3603) Support read DateType for hive2/hive3
Tao Meng created HUDI-3603: -- Summary: Support read DateType for hive2/hive3 Key: HUDI-3603 URL: https://issues.apache.org/jira/browse/HUDI-3603 Project: Apache Hudi Issue Type: Bug Components: hive Affects Versions: 0.10.1 Reporter: Tao Meng Fix For: 0.11.0 now hudi only support read dateType for hive2, we should support read DateType for both hive2 and hive3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy
[ https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501953#comment-17501953 ] Tao Meng commented on HUDI-3355: [~suryaprasanna] if you have free time, can you try this pr [https://github.com/apache/hudi/pull/4962] preWrite: we record the pending clustering commits preCommit: we choose the completed clustering commits which complete during this operation write time. and check the conflict with this pr, c5 should be failed , since it have conflict with c3 > Issue with out of order commits in the timeline when ingestion writers using > SparkAllowUpdateStrategy > - > > Key: HUDI-3355 > URL: https://issues.apache.org/jira/browse/HUDI-3355 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Assignee: tao meng >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Out of order commits can happen between two commits C1 and C2. If timestamp > of C2 is greater than C1's and completed before C1. > In our use case, we are running clustering in async, and want ingestion > writers to be given preference over clustering. > Following are the configs used by the ingestion writer > {noformat} > "hoodie.clustering.updates.strategy": > "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy" > "hoodie.clustering.rollback.pending.replacecommit.on.conflict": > false{noformat} > This would allow ingestion writers to ignore pending replacecommits on the > timeline and continue writing. > > Consider the following scenario > {code:java} > At instant1 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.inflight -> Started > {code} > {code:java} > At instant2 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit -> Completed {code} > {code:java} > At instant3 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight -> Started{code} > {code:java} > At instant4 > C1.commit > C2.commit > C3.replacecommit -> Completed > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight(continuing) {code} > {code:java} > At instant5 > C1.commit > C2.commit > C3.replacecommit > C4.commit (lastSuccessfulCommit seen by C5) > C5.commit -> Completed (It has conflict with C3 but since it has lower > timestamp than C4, C3 is not considered during conflict resolution){code} > > Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though > the C3 is the one that is committed last. > Ideally when sorting the timeline we should consider the transition times. > So, timeline should look something like, > {code:java} > C1.commit > C2.commit > C4.commit(lastSuccessfulCommit seen by C5) > C3.replacecommit > C5.inflight{code} > So, in this case when the C5 is about to complete, it will consider all the > commits that are completed after C4 which will be C3. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy
[ https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501653#comment-17501653 ] Tao Meng commented on HUDI-3355: [~suryaprasanna] I tried to understand the problem,do you mean that: 1) start time: c3(clustering) < c4 < c5 2) c4 is not conflict with c3(clustering), so c4 can be commit successfully 3) c5 is conflict with c3(but hudi does not check that),so c5 commit successfuly(in fact c5 should be failed) > Issue with out of order commits in the timeline when ingestion writers using > SparkAllowUpdateStrategy > - > > Key: HUDI-3355 > URL: https://issues.apache.org/jira/browse/HUDI-3355 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Assignee: tao meng >Priority: Blocker > Fix For: 0.11.0 > > > Out of order commits can happen between two commits C1 and C2. If timestamp > of C2 is greater than C1's and completed before C1. > In our use case, we are running clustering in async, and want ingestion > writers to be given preference over clustering. > Following are the configs used by the ingestion writer > {noformat} > "hoodie.clustering.updates.strategy": > "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy" > "hoodie.clustering.rollback.pending.replacecommit.on.conflict": > false{noformat} > This would allow ingestion writers to ignore pending replacecommits on the > timeline and continue writing. > > Consider the following scenario > {code:java} > At instant1 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.inflight -> Started > {code} > {code:java} > At instant2 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit -> Completed {code} > {code:java} > At instant3 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight -> Started{code} > {code:java} > At instant4 > C1.commit > C2.commit > C3.replacecommit -> Completed > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight(continuing) {code} > {code:java} > At instant5 > C1.commit > C2.commit > C3.replacecommit > C4.commit (lastSuccessfulCommit seen by C5) > C5.commit -> Completed (It has conflict with C3 but since it has lower > timestamp than C4, C3 is not considered during conflict resolution){code} > > Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though > the C3 is the one that is committed last. > Ideally when sorting the timeline we should consider the transition times. > So, timeline should look something like, > {code:java} > C1.commit > C2.commit > C4.commit(lastSuccessfulCommit seen by C5) > C3.replacecommit > C5.inflight{code} > So, in this case when the C5 is about to complete, it will consider all the > commits that are completed after C4 which will be C3. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy
[ https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501645#comment-17501645 ] Tao Meng commented on HUDI-3355: no problem > Issue with out of order commits in the timeline when ingestion writers using > SparkAllowUpdateStrategy > - > > Key: HUDI-3355 > URL: https://issues.apache.org/jira/browse/HUDI-3355 > Project: Apache Hudi > Issue Type: Bug >Reporter: Surya Prasanna Yalla >Assignee: tao meng >Priority: Blocker > Fix For: 0.11.0 > > > Out of order commits can happen between two commits C1 and C2. If timestamp > of C2 is greater than C1's and completed before C1. > In our use case, we are running clustering in async, and want ingestion > writers to be given preference over clustering. > Following are the configs used by the ingestion writer > {noformat} > "hoodie.clustering.updates.strategy": > "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy" > "hoodie.clustering.rollback.pending.replacecommit.on.conflict": > false{noformat} > This would allow ingestion writers to ignore pending replacecommits on the > timeline and continue writing. > > Consider the following scenario > {code:java} > At instant1 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.inflight -> Started > {code} > {code:java} > At instant2 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit -> Completed {code} > {code:java} > At instant3 > C1.commit > C2.commit > C3.replacecommit.inflight > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight -> Started{code} > {code:java} > At instant4 > C1.commit > C2.commit > C3.replacecommit -> Completed > C4.commit (lastSuccessfulCommit seen by C5) > C5.inflight(continuing) {code} > {code:java} > At instant5 > C1.commit > C2.commit > C3.replacecommit > C4.commit (lastSuccessfulCommit seen by C5) > C5.commit -> Completed (It has conflict with C3 but since it has lower > timestamp than C4, C3 is not considered during conflict resolution){code} > > Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though > the C3 is the one that is committed last. > Ideally when sorting the timeline we should consider the transition times. > So, timeline should look something like, > {code:java} > C1.commit > C2.commit > C4.commit(lastSuccessfulCommit seen by C5) > C3.replacecommit > C5.inflight{code} > So, in this case when the C5 is about to complete, it will consider all the > commits that are completed after C4 which will be C3. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2762) Ensure hive can query insert only logs in MOR
[ https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494502#comment-17494502 ] Tao Meng commented on HUDI-2762: [~alexey.kudinkin] i This problem is hive's problem. Hive will filter out all the file which startwith . see org.apache.hadoop.hive.common.FileUtils in hive > Ensure hive can query insert only logs in MOR > - > > Key: HUDI-2762 > URL: https://issues.apache.org/jira/browse/HUDI-2762 > Project: Apache Hudi > Issue Type: Task > Components: hive >Reporter: Rajesh Mahindra >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > Currently, we are able to query MOR tables that have base parquet files with > inserts an logs files with updates. However, we are currently unable to query > tables with insert only log files. Both _ro and _rt tables are returning 0 > rows. However, hms does create the table and partitions for the table. > > One sample table is here: > [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/®ion=us-east-2] > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2762) Ensure hive can query insert only logs in MOR
[ https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494502#comment-17494502 ] Tao Meng edited comment on HUDI-2762 at 2/18/22, 9:59 AM: -- [~alexey.kudinkin] This problem is hive's problem. Hive will filter out all the file which startwith "." see org.apache.hadoop.hive.common.FileUtils in hive was (Author: mengtao): [~alexey.kudinkin] i This problem is hive's problem. Hive will filter out all the file which startwith . see org.apache.hadoop.hive.common.FileUtils in hive > Ensure hive can query insert only logs in MOR > - > > Key: HUDI-2762 > URL: https://issues.apache.org/jira/browse/HUDI-2762 > Project: Apache Hudi > Issue Type: Task > Components: hive >Reporter: Rajesh Mahindra >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > Currently, we are able to query MOR tables that have base parquet files with > inserts an logs files with updates. However, we are currently unable to query > tables with insert only log files. Both _ro and _rt tables are returning 0 > rows. However, hms does create the table and partitions for the table. > > One sample table is here: > [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/®ion=us-east-2] > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3408) fixed the bug that BUCKET_INDEX cannot process special characters
Tao Meng created HUDI-3408: -- Summary: fixed the bug that BUCKET_INDEX cannot process special characters Key: HUDI-3408 URL: https://issues.apache.org/jira/browse/HUDI-3408 Project: Apache Hudi Issue Type: Bug Components: core Affects Versions: 0.10.1 Environment: spark3.1.1 Reporter: Tao Meng BucketIdentifier use split(":") to split recordKeyName and recordKeyValue, if current recordKeyValue is ":::" , above split give a wrong result. test("test Bucket") { withTempDir { tmp => Seq("mor").foreach { tableType => val tableName = generateTableName val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}" spark.sql( s""" |create table $tableName ( | id int, comb int, col0 int, col1 bigint, col2 float, col3 double, col4 decimal(10,4), col5 string, col6 date, col7 timestamp, col8 boolean, col9 binary, par date |) using hudi | location '$tablePath' | options ( | type = '$tableType', | primaryKey = 'id,col0,col5', | preCombineField = 'comb', | hoodie.index.type = 'BUCKET', | hoodie.bucket.index.num.buckets = '900' | ) | partitioned by (par) """.stripMargin) spark.sql( s""" | insert into $tableName values | (1,1,11,11,101.01,1001.0001,11.0001,':::','2021-12-25','2021-12-25 12:01:01',true,'a01','2021-12-25'), | (2,2,12,12,102.02,1002.0002,12.0002,'a02','2021-12-25','2021-12-25 12:02:02',true,'a02','2021-12-25'), | (3,3,13,13,103.03,1003.0003,13.0003,'a03','2021-12-25','2021-12-25 12:03:03',false,'a03','2021-12-25'), | (4,4,14,14,104.04,1004.0004,14.0004,'a04','2021-12-26','2021-12-26 12:04:04',true,'a04','2021-12-26'), | (5,5,15,15,105.05,1005.0005,15.0005,'a05','2021-12-26','2021-12-26 12:05:05',false,'a05','2021-12-26') |""".stripMargin) } } } Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.hudi.index.bucket.BucketIdentifier.lambda$getBucketId$2(BucketIdentifier.java:46) at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321) at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:46) at org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:36) at org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:80) at org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:70) at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:125) ... 25 more -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution
[ https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489291#comment-17489291 ] Tao Meng commented on HUDI-3347: [~shivnarayan] This problem seems to be caused by the wrong version of hive jar package i try to reproduce the problem with hive2.3/hive3.1 , but no problem happen. i checkout the hive source code, org.apache.hadoop.hive.metastore.IMetaStoreClient.alter_table_with_environmentContext this Interface has not changed since hive2.0. could you pls paste your hive/spark version number > Updating table schema fails w/ hms mode w/ schema evolution > --- > > Key: HUDI-3347 > URL: https://issues.apache.org/jira/browse/HUDI-3347 > Project: Apache Hudi > Issue Type: Task > Components: hive-sync >Reporter: sivabalan narayanan >Assignee: tao meng >Priority: Critical > Fix For: 0.11.0 > > > When table schema got upgraded with a new batch of write, hms mode sync > fails. > > steps to reproduce using our docker demo set up. > I used 0.10.0 to test this out. > adhoc-1 > {code:java} > $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] > --driver-class-path $HADOOP_CONF_DIR --conf > spark.sql.hive.convertMetastoreParquet=false --deploy-mode client > --driver-memory 1G --executor-memory 3G --num-executors 1 --packages > org.apache.spark:spark-avro_2.11:2.4.4 {code} > {code:java} > import java.sql.Timestamp > import spark.implicits._import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val df1 = Seq( > ("row1", 1, "part1" ,1578283932000L ), > ("row2", 1, "part1", 1578283942000L) > ).toDF("row", "ppath", "preComb","eventTime") > df1.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Overwrite). > save("/tmp/hudi_timestamp_tbl1") > {code} > // evol schema > {code:java} > val df2 = Seq( > ("row1", 1, "part1" ,1678283932000L, "abcd" ), > ("row2", 1, "part1", 1678283942000L, "defg") > ).toDF("row", "ppath", "preComb", "eventTime", "randomStr") > df2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Append). > save("/tmp/hudi_timestamp_tbl1") > {code} > stacktrace > {code:java} > scala> df2.write.format("hudi"). > | options(getQuickstartWriteConfigs). > | option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > | option(RECORDKEY_FIELD_OPT_KEY, "row"). > | option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > | > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > | >
[jira] [Comment Edited] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution
[ https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484721#comment-17484721 ] Tao Meng edited comment on HUDI-3347 at 1/31/22, 2:51 PM: -- [~shivnarayan] yes, will do it. Pls assign it to me, thanks was (Author: mengtao): yes, will do it. > Updating table schema fails w/ hms mode w/ schema evolution > --- > > Key: HUDI-3347 > URL: https://issues.apache.org/jira/browse/HUDI-3347 > Project: Apache Hudi > Issue Type: Task > Components: hive-sync >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.11.0 > > > When table schema got upgraded with a new batch of write, hms mode sync > fails. > > steps to reproduce using our docker demo set up. > I used 0.10.0 to test this out. > adhoc-1 > {code:java} > $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] > --driver-class-path $HADOOP_CONF_DIR --conf > spark.sql.hive.convertMetastoreParquet=false --deploy-mode client > --driver-memory 1G --executor-memory 3G --num-executors 1 --packages > org.apache.spark:spark-avro_2.11:2.4.4 {code} > {code:java} > import java.sql.Timestamp > import spark.implicits._import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val df1 = Seq( > ("row1", 1, "part1" ,1578283932000L ), > ("row2", 1, "part1", 1578283942000L) > ).toDF("row", "ppath", "preComb","eventTime") > df1.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Overwrite). > save("/tmp/hudi_timestamp_tbl1") > {code} > // evol schema > {code:java} > val df2 = Seq( > ("row1", 1, "part1" ,1678283932000L, "abcd" ), > ("row2", 1, "part1", 1678283942000L, "defg") > ).toDF("row", "ppath", "preComb", "eventTime", "randomStr") > df2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Append). > save("/tmp/hudi_timestamp_tbl1") > {code} > stacktrace > {code:java} > scala> df2.write.format("hudi"). > | options(getQuickstartWriteConfigs). > | option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > | option(RECORDKEY_FIELD_OPT_KEY, "row"). > | option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > | > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > | > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > | > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > | > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00
[jira] [Commented] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution
[ https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484721#comment-17484721 ] Tao Meng commented on HUDI-3347: yes, will do it. > Updating table schema fails w/ hms mode w/ schema evolution > --- > > Key: HUDI-3347 > URL: https://issues.apache.org/jira/browse/HUDI-3347 > Project: Apache Hudi > Issue Type: Task > Components: hive-sync >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.11.0 > > > When table schema got upgraded with a new batch of write, hms mode sync > fails. > > steps to reproduce using our docker demo set up. > I used 0.10.0 to test this out. > adhoc-1 > {code:java} > $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] > --driver-class-path $HADOOP_CONF_DIR --conf > spark.sql.hive.convertMetastoreParquet=false --deploy-mode client > --driver-memory 1G --executor-memory 3G --num-executors 1 --packages > org.apache.spark:spark-avro_2.11:2.4.4 {code} > {code:java} > import java.sql.Timestamp > import spark.implicits._import org.apache.hudi.QuickstartUtils._ > import scala.collection.JavaConversions._ > import org.apache.spark.sql.SaveMode._ > import org.apache.hudi.DataSourceReadOptions._ > import org.apache.hudi.DataSourceWriteOptions._ > import org.apache.hudi.config.HoodieWriteConfig._ > val df1 = Seq( > ("row1", 1, "part1" ,1578283932000L ), > ("row2", 1, "part1", 1578283942000L) > ).toDF("row", "ppath", "preComb","eventTime") > df1.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Overwrite). > save("/tmp/hudi_timestamp_tbl1") > {code} > // evol schema > {code:java} > val df2 = Seq( > ("row1", 1, "part1" ,1678283932000L, "abcd" ), > ("row2", 1, "part1", 1678283942000L, "defg") > ).toDF("row", "ppath", "preComb", "eventTime", "randomStr") > df2.write.format("hudi"). > options(getQuickstartWriteConfigs). > option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > option(RECORDKEY_FIELD_OPT_KEY, "row"). > option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > option("hoodie.datasource.hive_sync.mode","hms"). > option("hoodie.datasource.hive_sync.database","default"). > option("hoodie.datasource.hive_sync.table","timestamp_tbl1"). > > option("hoodie.datasource.hive_sync.partition_fields","year,month,day"). > option("hoodie.datasource.hive_sync.enable","true"). > option(TABLE_NAME, "timestamp_tbl1"). > mode(Append). > save("/tmp/hudi_timestamp_tbl1") > {code} > stacktrace > {code:java} > scala> df2.write.format("hudi"). > | options(getQuickstartWriteConfigs). > | option(PRECOMBINE_FIELD_OPT_KEY, "preComb"). > | option(RECORDKEY_FIELD_OPT_KEY, "row"). > | option(PARTITIONPATH_FIELD_OPT_KEY, "ppath"). > | > option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > | > option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS"). > | > option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd"). > | > option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00"). > | option("hoodie.datasource.hive_sync.mode","hms"). > | option("hoodie.datasource.hive_sync.database","defa
[jira] [Commented] (HUDI-2873) Support optimize data layout by sql and make the build more fast
[ https://issues.apache.org/jira/browse/HUDI-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477743#comment-17477743 ] Tao Meng commented on HUDI-2873: [~shibei] do you have wechat, pls add me 1037817390 > Support optimize data layout by sql and make the build more fast > > > Key: HUDI-2873 > URL: https://issues.apache.org/jira/browse/HUDI-2873 > Project: Apache Hudi > Issue Type: Task > Components: Performance, spark >Reporter: tao meng >Assignee: shibei >Priority: Critical > Labels: sev:high > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2873) Support optimize data layout by sql and make the build more fast
[ https://issues.apache.org/jira/browse/HUDI-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477214#comment-17477214 ] Tao Meng commented on HUDI-2873: [~alexey.kudinkin] [~shibei] 1) support optimize data by sparksql, just like dela lake : OPTIMIZE xx_table ZORDER/HILBERT by col1, col2; 2) introduce a new write operation to rewrite table data directly , At present, The performance of clustering operation is slightly worse than that of direct overwrite > Support optimize data layout by sql and make the build more fast > > > Key: HUDI-2873 > URL: https://issues.apache.org/jira/browse/HUDI-2873 > Project: Apache Hudi > Issue Type: Task > Components: Performance, spark >Reporter: tao meng >Priority: Major > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2645) Rewrite Zoptimize and other files in scala into Java
[ https://issues.apache.org/jira/browse/HUDI-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477205#comment-17477205 ] Tao Meng commented on HUDI-2645: [~shibei] vc means we should better remove scala files in spark-client-module, now we have already rewrite most zoptimize codes by java, only RangeSample is also a scala file > Rewrite Zoptimize and other files in scala into Java > > > Key: HUDI-2645 > URL: https://issues.apache.org/jira/browse/HUDI-2645 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinoth Chandar >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3237) ALTER TABLE column type change fails select query
[ https://issues.apache.org/jira/browse/HUDI-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475398#comment-17475398 ] Tao Meng commented on HUDI-3237: [~biyan900...@gmail.com] [~xushiyan] i agree that we should close this ability on change column type. I am currently working on rfc-33, this feature has already supported in our work > ALTER TABLE column type change fails select query > - > > Key: HUDI-3237 > URL: https://issues.apache.org/jira/browse/HUDI-3237 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Major > Fix For: 0.11.0 > > Attachments: image-2022-01-13-17-04-09-038.png > > > {code:sql} > create table if not exists cow_nonpt_nonpcf_tbl ( > id int, > name string, > price double > ) using hudi > options ( > type = 'cow', > primaryKey = 'id' > ); > insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20; > DESC cow_nonpt_nonpcf_tbl; > -- shows id int > ALTER TABLE cow_nonpt_nonpcf_tbl change column id id bigint; > DESC cow_nonpt_nonpcf_tbl; > -- shows id bigint > -- this works fine so far > select * from cow_nonpt_nonpcf_tbl; > -- throws exception > {code} > {code} > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > file:///opt/spark-warehouse/cow_nonpt_nonpcf_tbl/ff3c68e6-84d4-4a8a-8bc8-cc58736847aa-0_0-7-7_20220112182401452.parquet. > Column: [id], Expected: bigint, Found: INT32 > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > ... 20 more > {code} > reported while testing on 0.10.1-rc1 (spark 3.0.3, 3.1.2) -- This message was sent by Atlassian Jira (v8
[jira] [Commented] (HUDI-2874) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use hive/presto
[ https://issues.apache.org/jira/browse/HUDI-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474530#comment-17474530 ] Tao Meng commented on HUDI-2874: [~shivnarayan] yes we have areadly fixed this issue in 0.10 > hudi should remove the temp file which create by > HoodieMergedLogRecordScanner, when we use hive/presto > -- > > Key: HUDI-2874 > URL: https://issues.apache.org/jira/browse/HUDI-2874 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.9.0 > Environment: hive3.1.1 > hadoop3.1.1 >Reporter: tao meng >Priority: Major > Fix For: 0.11.0 > > > when we use hive/presto to query mor table > hudi should remove the temp file which create by HoodieMergedLogRecordScanner > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3164) CTAS fails w/ UnsupportedOperationException when trying to modify immutable map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp
[ https://issues.apache.org/jira/browse/HUDI-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468646#comment-17468646 ] tao meng commented on HUDI-3164: [~shivnarayan] leesf has already solved this problem see https://issues.apache.org/jira/browse/HUDI-3140 > CTAS fails w/ UnsupportedOperationException when trying to modify immutable > map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp > - > > Key: HUDI-3164 > URL: https://issues.apache.org/jira/browse/HUDI-3164 > Project: Apache Hudi > Issue Type: Task >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.11.0, 0.10.1 > > > CTAS fails w/ UnsupportedOperationException when trying to modify immutable > map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp. > with spark3.2 master. > > {code:java} > val s = """ > create table catalog_sales > USING HUDI > options ( > type = 'cow', > primaryKey = 'cs_item_sk,cs_order_number' > ) > LOCATION 'file:///tmp/catalog_sales_hudi' > PARTITIONED BY (cs_sold_date_sk) > AS SELECT * FROM catalog_sales_ext2 {code} > stacktrace: > {code:java} > java.lang.UnsupportedOperationException > at java.util.Collections$UnmodifiableMap.put(Collections.java:1459) > at > org.apache.hudi.DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp(DataSourceUtils.java:323) > at > org.apache.hudi.spark3.internal.DefaultSource.getTable(DefaultSource.java:59) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83) > at > org.apache.spark.sql.DataFrameWriter.getTable$1(DataFrameWriter.scala:280) > at > org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:296) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247) > at > org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:478) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159) > at > org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:109) > at > org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:91) > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3096) fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark
Tao Meng created HUDI-3096: -- Summary: fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark Key: HUDI-3096 URL: https://issues.apache.org/jira/browse/HUDI-3096 Project: Apache Hudi Issue Type: Bug Components: Flink Integration Affects Versions: 0.10.0 Environment: flink 1.13.1 spark 3.1.1 Reporter: Tao Meng Fix For: 0.11.0 now, flink will write decimalType as byte[] when spark read that decimal Type, if spark find the precision of current decimal is small spark treat it as int/long which caused the fllow error: Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://x/tmp/hudi/hudi_x/46d44c57-aa43-41e2-a8aa-76dcc9dac7e4_0-4-0_20211221201230.parquet. Column: [c7], Expected: decimal(10,4), Found: BINARY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:517) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2059) When log exists in mor table, clustering is triggered. The query result shows that the update record in log is lost
[ https://issues.apache.org/jira/browse/HUDI-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458858#comment-17458858 ] tao meng commented on HUDI-2059: [~shivnarayan] as HUDI-2170 merged, it's ok to close > When log exists in mor table, clustering is triggered. The query result > shows that the update record in log is lost > > > Key: HUDI-2059 > URL: https://issues.apache.org/jira/browse/HUDI-2059 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.8.0 > Environment: hadoop 3.1.1 > spark3.1.1/spark2.4.5 > hive3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available, sev:high > Fix For: 0.11.0 > > > When log exists in mor table, and clustering is triggered. The query result > shows that the update record of log is lost。 > the reason of this problem is that: hoodie use HoodieFileSliceReader to read > table data and then do clustering. HoodieFileSliceReader call > HoodieMergedLogRecordScanner. > processNextRecord to merge update values and old valuse, when call that > function old values is reserved update values is discarded, this is wrong。 > test step: > // step1 : create hudi mor table > val df = spark.range(0, 1000).toDF("keyid") > .withColumn("col3", expr("keyid")) > .withColumn("age", lit(1)) > .withColumn("p", lit(2)) > df.write.format("hudi"). > option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). > option(PRECOMBINE_FIELD_OPT_KEY, "col3"). > option(RECORDKEY_FIELD_OPT_KEY, "keyid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "p"). > option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert"). > option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, > classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). > option("hoodie.insert.shuffle.parallelism", "4"). > option("hoodie.upsert.shuffle.parallelism", "4"). > option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") > .mode(SaveMode.Overwrite).save(basePath) > // step2, update age where keyid < 5 to produce log files > val df1 = spark.range(0, 5).toDF("keyid") > .withColumn("col3", expr("keyid")) > .withColumn("age", lit(1 + 1000)) > .withColumn("p", lit(2)) > df1.write.format("hudi"). > option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). > option(PRECOMBINE_FIELD_OPT_KEY, "col3"). > option(RECORDKEY_FIELD_OPT_KEY, "keyid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "p"). > option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"). > option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, > classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). > option("hoodie.insert.shuffle.parallelism", "4"). > option("hoodie.upsert.shuffle.parallelism", "4"). > option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") > .mode(SaveMode.Append).save(basePath) > // step3, do cluster inline > val df2 = spark.range(6, 10).toDF("keyid") > .withColumn("col3", expr("keyid")) > .withColumn("age", lit(1 + 2000)) > .withColumn("p", lit(2)) > df2.write.format("hudi"). > option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). > option(PRECOMBINE_FIELD_OPT_KEY, "col3"). > option(RECORDKEY_FIELD_OPT_KEY, "keyid"). > option(PARTITIONPATH_FIELD_OPT_KEY, "p"). > option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"). > option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, > classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). > option("hoodie.insert.shuffle.parallelism", "4"). > option("hoodie.upsert.shuffle.parallelism", "4"). > option("hoodie.parquet.small.file.limit", "0"). > option("hoodie.clustering.inline", "true"). > option("hoodie.clustering.inline.max.commits", "1"). > option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824"). > option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). > option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") > .mode(SaveMode.Append).save(basePath) > spark.read.format("hudi") > .load(basePath).select("age").where("keyid = 0").show(100, false) > +---+ > |age| > +---+ > |1 | > +—+ > the result is wrong, since we update the value of age to 1001 at step 2. > > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3001) clean up temp marker directory when finish bootstrap operation.
tao meng created HUDI-3001: -- Summary: clean up temp marker directory when finish bootstrap operation. Key: HUDI-3001 URL: https://issues.apache.org/jira/browse/HUDI-3001 Project: Apache Hudi Issue Type: Bug Affects Versions: 0.10.0 Environment: spark2.4.5 hadoop 3.1.1 Reporter: tao meng Fix For: 0.11.0 now, when we finished bootstrap operation, the temp marker directory has not been deleted. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
[ https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2966: --- Priority: Minor (was: Major) > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner > --- > > Key: HUDI-2966 > URL: https://issues.apache.org/jira/browse/HUDI-2966 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: tao meng >Priority: Minor > Fix For: 0.11.0 > > > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When > the query is completed。 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
[ https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2966: --- Priority: Major (was: Minor) > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner > --- > > Key: HUDI-2966 > URL: https://issues.apache.org/jira/browse/HUDI-2966 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: tao meng >Priority: Major > Fix For: 0.11.0 > > > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When > the query is completed。 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
tao meng created HUDI-2966: -- Summary: Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner Key: HUDI-2966 URL: https://issues.apache.org/jira/browse/HUDI-2966 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: tao meng Fix For: 0.11.0 Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When the query is completed。 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2958) Automatically set spark.sql.parquet.writelegacyformat. When using bulkinsert to insert data will contains decimal Type.
tao meng created HUDI-2958: -- Summary: Automatically set spark.sql.parquet.writelegacyformat. When using bulkinsert to insert data will contains decimal Type. Key: HUDI-2958 URL: https://issues.apache.org/jira/browse/HUDI-2958 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: tao meng Fix For: 0.11.0 Now by default ParquetWriteSupport will write DecimalType to parquet as int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(), but AvroParquetReader which used by HoodieParquetReader cannot support read int32/int64 as DecimalType. this will lead follow error Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41) at org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75) .. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2958) Automatically set spark.sql.parquet.writelegacyformat; When using bulkinsert to insert data which contains decimal Type.
[ https://issues.apache.org/jira/browse/HUDI-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2958: --- Summary: Automatically set spark.sql.parquet.writelegacyformat; When using bulkinsert to insert data which contains decimal Type. (was: Automatically set spark.sql.parquet.writelegacyformat. When using bulkinsert to insert data will contains decimal Type.) > Automatically set spark.sql.parquet.writelegacyformat; When using bulkinsert > to insert data which contains decimal Type. > > > Key: HUDI-2958 > URL: https://issues.apache.org/jira/browse/HUDI-2958 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: tao meng >Priority: Minor > Fix For: 0.11.0 > > > Now by default ParquetWriteSupport will write DecimalType to parquet as > int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(), > but AvroParquetReader which used by HoodieParquetReader cannot support read > int32/int64 as DecimalType. this will lead follow error > Caused by: java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41) > at > org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75) > .. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2901) Fixed the bug clustering jobs are not running in parallel
tao meng created HUDI-2901: -- Summary: Fixed the bug clustering jobs are not running in parallel Key: HUDI-2901 URL: https://issues.apache.org/jira/browse/HUDI-2901 Project: Apache Hudi Issue Type: Bug Components: Performance Affects Versions: 0.9.0 Environment: spark2.4.5 Reporter: tao meng Assignee: tao meng Fix For: 0.11.0 Fixed the bug clustering jobs are not running in parasllel。 [https://github.com/apache/hudi/issues/4135] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2876) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use presto
tao meng created HUDI-2876: -- Summary: hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use presto Key: HUDI-2876 URL: https://issues.apache.org/jira/browse/HUDI-2876 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Environment: hive3.1.1 hadoop3.1.1 Reporter: tao meng Fix For: 0.11.0 now when we read mor table by hive/presto,. HoodieMergedLogRecordScanner will create a tmp dir for read log. this file will be deleted only when the jvm exit. for presto/ hive local query, the jvm will not exit。 with more and more query happen, the tmp dir will become more and more larger。 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2874) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use hive/presto
tao meng created HUDI-2874: -- Summary: hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use hive/presto Key: HUDI-2874 URL: https://issues.apache.org/jira/browse/HUDI-2874 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.9.0 Environment: hive3.1.1 hadoop3.1.1 Reporter: tao meng Fix For: 0.11.0 when we use hive/presto to query mor table hudi should remove the temp file which create by HoodieMergedLogRecordScanner -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2873) support optimize data layout by sql and make the build more fast
tao meng created HUDI-2873: -- Summary: support optimize data layout by sql and make the build more fast Key: HUDI-2873 URL: https://issues.apache.org/jira/browse/HUDI-2873 Project: Apache Hudi Issue Type: Sub-task Components: Performance, Spark Integration Reporter: tao meng Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2778) Optimize statistics collection related codes and add more docs for z-order
tao meng created HUDI-2778: -- Summary: Optimize statistics collection related codes and add more docs for z-order Key: HUDI-2778 URL: https://issues.apache.org/jira/browse/HUDI-2778 Project: Apache Hudi Issue Type: Improvement Reporter: tao meng Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2758) remove redundant code in the HoodieRealtimeInputFormatUtils.getRealtimeSplits
tao meng created HUDI-2758: -- Summary: remove redundant code in the HoodieRealtimeInputFormatUtils.getRealtimeSplits Key: HUDI-2758 URL: https://issues.apache.org/jira/browse/HUDI-2758 Project: Apache Hudi Issue Type: Improvement Components: Hive Integration Reporter: tao meng Fix For: 0.10.0 remove redundant code in the HoodieRealtimeInputFormatUtils.getRealtimeSplits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2697) Minor changes about hbase index config.
tao meng created HUDI-2697: -- Summary: Minor changes about hbase index config. Key: HUDI-2697 URL: https://issues.apache.org/jira/browse/HUDI-2697 Project: Apache Hudi Issue Type: Bug Components: configs Affects Versions: 0.9.0 Reporter: tao meng Fix For: 0.10.0 now when we use hbase index。 According to the official document ,hoodie.hbase.index.update.partition.path, hoodie.index.hbase.put.batch.size.autocompute are optional , however when we not specify the above parameters。 NPE problem will happen。 java.lang.NullPointerException at org.apache.hudi.config.HoodieWriteConfig.getHbaseIndexUpdatePartitionPath(HoodieWriteConfig.java:1353) at org.apache.hudi.index.hbase.SparkHoodieHBaseIndex.lambda$locationTagFunction$eda54cbe$1(SparkHoodieHBaseIndex.java:212) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2676) Hudi should synchronize owner information to hudi _rt/_ro table。
tao meng created HUDI-2676: -- Summary: Hudi should synchronize owner information to hudi _rt/_ro table。 Key: HUDI-2676 URL: https://issues.apache.org/jira/browse/HUDI-2676 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Affects Versions: 0.9.0 Environment: hudi 0.9.0 spark3.1.1 hive3.1.1 hadoop3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 hudi rt/ro table missing owner information, If permission protection is enabled for the query engine, the verification of the protection group will fail。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2674) hudi hive reader should not print read values
[ https://issues.apache.org/jira/browse/HUDI-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2674: --- Summary: hudi hive reader should not print read values (was: hudi hive reader should not log read values) > hudi hive reader should not print read values > - > > Key: HUDI-2674 > URL: https://issues.apache.org/jira/browse/HUDI-2674 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.9.0 > Environment: hudi 0.9.0 > hive 3.1.1 > hadoop 3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Critical > Fix For: 0.10.0 > > > now when we use hive to query hudi table and set > hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; > all read values will be print. This can lead to performance problems and data > security problems, > as: > xxx 20:10:45,045 | INFO | main | Reading from record reader | > HoodieCombineRealtimeRecordReader.java:69 > xx 20:10:45,045 | INFO | main | "values_0.158268513314199_10": > \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"16","type5":"IntWritable","value6":"16jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"} > | HoodieCombineRealtimeRecordReader.java:70 > xxx 20:10:45,045 | INFO | main | Reading from record reader | > HoodieCombineRealtimeRecordReader.java:69 > xxx 20:10:45,045 | INFO | main | "values_0.16924293134429924_10": > \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"96","type5":"IntWritable","value6":"96jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"} > | HoodieCombineRealtimeRecordReader.java:70 > 2021-11-02 20:10:45,045 | INFO | main | Reading from record reader | > HoodieCombineRealtimeRecordReader.java:69 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2674) hudi hive reader should not log read values
tao meng created HUDI-2674: -- Summary: hudi hive reader should not log read values Key: HUDI-2674 URL: https://issues.apache.org/jira/browse/HUDI-2674 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.9.0 Environment: hudi 0.9.0 hive 3.1.1 hadoop 3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 now when we use hive to query hudi table and set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; all read values will be print. This can lead to performance problems and data security problems, as: xxx 20:10:45,045 | INFO | main | Reading from record reader | HoodieCombineRealtimeRecordReader.java:69 xx 20:10:45,045 | INFO | main | "values_0.158268513314199_10": \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"16","type5":"IntWritable","value6":"16jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"} | HoodieCombineRealtimeRecordReader.java:70 xxx 20:10:45,045 | INFO | main | Reading from record reader | HoodieCombineRealtimeRecordReader.java:69 xxx 20:10:45,045 | INFO | main | "values_0.16924293134429924_10": \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"96","type5":"IntWritable","value6":"96jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"} | HoodieCombineRealtimeRecordReader.java:70 2021-11-02 20:10:45,045 | INFO | main | Reading from record reader | HoodieCombineRealtimeRecordReader.java:69 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2560) Introduce id_based schema to support full schema evolution
[ https://issues.apache.org/jira/browse/HUDI-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2560: --- Description: Introduce id_based schema to support full schema evolution. (was: +Introduce id_based schema to support full schema evolution+) > Introduce id_based schema to support full schema evolution > -- > > Key: HUDI-2560 > URL: https://issues.apache.org/jira/browse/HUDI-2560 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: tao meng >Priority: Major > Fix For: 0.10.0 > > > Introduce id_based schema to support full schema evolution. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2560) Introduce id_based schema to support full schema evolution
[ https://issues.apache.org/jira/browse/HUDI-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2560: --- Description: +Introduce id_based schema to support full schema evolution+ > Introduce id_based schema to support full schema evolution > -- > > Key: HUDI-2560 > URL: https://issues.apache.org/jira/browse/HUDI-2560 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: tao meng >Priority: Major > Fix For: 0.10.0 > > > +Introduce id_based schema to support full schema evolution+ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2560) Introduce id_based schema to support full schema evolution
tao meng created HUDI-2560: -- Summary: Introduce id_based schema to support full schema evolution Key: HUDI-2560 URL: https://issues.apache.org/jira/browse/HUDI-2560 Project: Apache Hudi Issue Type: Sub-task Components: Common Core Reporter: tao meng Fix For: 0.10.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2429) Full schema evolution
tao meng created HUDI-2429: -- Summary: Full schema evolution Key: HUDI-2429 URL: https://issues.apache.org/jira/browse/HUDI-2429 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up
[ https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2214: --- Description: residual temporary files after clustering are not cleaned up // test step step1: do clustering val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList val inputDF1: Dataset[Row] = spark.read.json(spark.sparkContext.parallelize(records1, 2)) inputDF1.write.format("org.apache.hudi") .options(commonOpts) .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) // option for clustering .option("hoodie.parquet.small.file.limit", "0") .option("hoodie.clustering.inline", "true") .option("hoodie.clustering.inline.max.commits", "1") .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824") .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") .option("hoodie.clustering.plan.strategy.max.bytes.per.group", Long.MaxValue.toString) .option("hoodie.clustering.plan.strategy.target.file.max.bytes", String.valueOf(12 *1024 * 1024L)) .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon") .mode(SaveMode.Overwrite) .save(basePath) step2: check the temp dir, we find /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 {color} is not cleaned up. was: residual temporary files after clustering are not cleaned up > residual temporary files after clustering are not cleaned up > > > Key: HUDI-2214 > URL: https://issues.apache.org/jira/browse/HUDI-2214 > Project: Apache Hudi > Issue Type: Bug > Components: Cleaner >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Fix For: 0.10.0 > > > residual temporary files after clustering are not cleaned up > // test step > step1: do clustering > val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList > val inputDF1: Dataset[Row] = > spark.read.json(spark.sparkContext.parallelize(records1, 2)) > inputDF1.write.format("org.apache.hudi") > .options(commonOpts) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) > // option for clustering > .option("hoodie.parquet.small.file.limit", "0") > .option("hoodie.clustering.inline", "true") > .option("hoodie.clustering.inline.max.commits", "1") > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > "1073741824") > .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600") > .option("hoodie.clustering.plan.strategy.max.bytes.per.group", > Long.MaxValue.toString) > .option("hoodie.clustering.plan.strategy.target.file.max.bytes", > String.valueOf(12 *1024 * 1024L)) > .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, > begin_lon") > .mode(SaveMode.Overwrite) > .save(basePath) > step2: check the temp dir, we find > /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty > {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208 > {color} > is not cleaned up. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2214) residual temporary files after clustering are not cleaned up
tao meng created HUDI-2214: -- Summary: residual temporary files after clustering are not cleaned up Key: HUDI-2214 URL: https://issues.apache.org/jira/browse/HUDI-2214 Project: Apache Hudi Issue Type: Bug Components: Cleaner Affects Versions: 0.8.0 Environment: spark3.1.1 hadoop3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 residual temporary files after clustering are not cleaned up -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2116) sync 10w partitions to hive by using HiveSyncTool lead to the oom of hive MetaStore
tao meng created HUDI-2116: -- Summary: sync 10w partitions to hive by using HiveSyncTool lead to the oom of hive MetaStore Key: HUDI-2116 URL: https://issues.apache.org/jira/browse/HUDI-2116 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Environment: hive3.1.1 hadoop 3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 when we try to sync 10w partitions to hive by using HiveSyncTool lead to the oom of hive MetaStore。 here is a stress test for HiveSyncTool env: hive metastore -Xms16G -Xmx16G hive.metastore.client.socket.timeout=10800 ||partitionNum||time consume|| |100|37s| |1000|168s| |5000|1830s| |1|timeout| |10|hive metastore oom| HiveSyncTools sync all partitions to hive metastore at once。 when the partitions num is large ,it puts a lot of pressure on hive metastore。 for large partition num we should support batch sync 。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2102) support hilbert curve for hudi
tao meng created HUDI-2102: -- Summary: support hilbert curve for hudi Key: HUDI-2102 URL: https://issues.apache.org/jira/browse/HUDI-2102 Project: Apache Hudi Issue Type: Sub-task Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 support hilbert curve for hudi to optimze hudi query -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2101) support z-order for hudi
tao meng created HUDI-2101: -- Summary: support z-order for hudi Key: HUDI-2101 URL: https://issues.apache.org/jira/browse/HUDI-2101 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 support z-order for hudi to optimze the query -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2100) Support Space curve for hudi
tao meng created HUDI-2100: -- Summary: Support Space curve for hudi Key: HUDI-2100 URL: https://issues.apache.org/jira/browse/HUDI-2100 Project: Apache Hudi Issue Type: New Feature Components: Spark Integration Reporter: tao meng Assignee: tao meng Fix For: 0.10.0 supoort space curve to optimize the cluster of hudi file to improve query performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2099) hive lock which state is WATING should be released, otherwise this hive lock will be locked forever
tao meng created HUDI-2099: -- Summary: hive lock which state is WATING should be released, otherwise this hive lock will be locked forever Key: HUDI-2099 URL: https://issues.apache.org/jira/browse/HUDI-2099 Project: Apache Hudi Issue Type: Bug Components: Common Core Affects Versions: 0.8.0 Environment: spark3.1.1 hive3.1.1 hadoop3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.8.0 when we acquire hive lock failed and the lock state is WATING, we should release this WATING lock; otherwise this hive lock will be locked forever。 test step: use hive lock to control concurrent write for hudi, let‘s call this lock hive_lock start three writers to write hudi table by using hive_lock concurrently, one of the writer will failed to acquire hive lock due to competition issues。 *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object LockResponse(lockid:76, state:WAITING)* start another writer to write hudi table by using same hive_lock, then we find hive_lock is locked forever, we have no way to acquire it *Exception in thread "main" org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object LockResponse(lockid:87, state:WAITING)* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2098) add Hdfs file lock for HUDI
tao meng created HUDI-2098: -- Summary: add Hdfs file lock for HUDI Key: HUDI-2098 URL: https://issues.apache.org/jira/browse/HUDI-2098 Project: Apache Hudi Issue Type: Bug Components: Usability, Utilities Affects Versions: 0.8.0 Environment: spark3.1.1 hive3.1.1 hadoop3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.9.0 now hudi support hive/zk lock for concurrency write, we introduce a new lock type hdfs lock -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2098) add Hdfs file lock for HUDI
[ https://issues.apache.org/jira/browse/HUDI-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-2098: --- Component/s: (was: Utilities) > add Hdfs file lock for HUDI > --- > > Key: HUDI-2098 > URL: https://issues.apache.org/jira/browse/HUDI-2098 > Project: Apache Hudi > Issue Type: Bug > Components: Usability >Affects Versions: 0.8.0 > Environment: spark3.1.1 > hive3.1.1 > hadoop3.1.1 >Reporter: tao meng >Assignee: tao meng >Priority: Minor > Labels: features > Fix For: 0.9.0 > > > now hudi support hive/zk lock for concurrency write, we introduce a new lock > type hdfs lock > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2090) when hudi metadata is enabled, use different user to query table, the query will failed
tao meng created HUDI-2090: -- Summary: when hudi metadata is enabled, use different user to query table, the query will failed Key: HUDI-2090 URL: https://issues.apache.org/jira/browse/HUDI-2090 Project: Apache Hudi Issue Type: Bug Components: Common Core Affects Versions: 0.8.0 Reporter: tao meng Assignee: tao meng Fix For: 0.9.0 when hudi metadata is enabled, use different user to query table, the query will failed. The user permissions of the temporary directory generated by DiskBasedMap are incorrect. This directory only has permissions for the user of current operation, and other users have no permissions to access it, which leads to this problem test step: step1: create hudi table with metadata enabled. step1: create two user(omm,user2) step2: f1) use omm to query hudi table DiskBasedMap will generate view_map with permissions drwx--. 2) then user user2 to query hudi table now user2 has no right to access view_map which created by omm, the exception will throws: org.apache.hudi.exception.HoodieIOException: IOException when creating ExternalSplillableMap at /tmp/view_map -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2089) fix the bug that metatable cannot support non_partition table
tao meng created HUDI-2089: -- Summary: fix the bug that metatable cannot support non_partition table Key: HUDI-2089 URL: https://issues.apache.org/jira/browse/HUDI-2089 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Affects Versions: 0.8.0 Environment: spark3.1.1 hive3.1.1 hadoop 3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.9.0 now, we found that when we enable metable for non_partition hudi table, the follow error occur: org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table.org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table. at org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:447) at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:433) at org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:187) we use hudi 0.8, but we also find this problem in latest code of hudi test step: val df = spark.range(0, 1000).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("age", lit(1)) .withColumn("p", lit(2)) df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""). option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert"). option("hoodie.insert.shuffle.parallelism", "4"). option("hoodie.metadata.enable", "true"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Overwrite).save(basePath) // upsert same record again df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""). option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"). option("hoodie.insert.shuffle.parallelism", "4"). option("hoodie.metadata.enable", "true"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Append).save(basePath) org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table.org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata table. at org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:447) at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:433) at org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:187) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:564) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:230) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:162) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2086) redo the logical of mor_incremental_view for hive
tao meng created HUDI-2086: -- Summary: redo the logical of mor_incremental_view for hive Key: HUDI-2086 URL: https://issues.apache.org/jira/browse/HUDI-2086 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Environment: spark3.1.1 hive3.1.1 hadoop3.1.1 os: suse Reporter: tao meng now ,There are some problems with mor_incremental_view for hive。 For example, 1):*hudi cannot read the lastest incremental datas which are stored by logs* think that: create a mor table with bulk_insert, and then do upsert for this table, no we want to query the latest incremental data by hive/sparksql, however the lastest incremental datas are stored by logs, when we do query nothings will return step1: prepare data val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, x+"jack", Random.nextInt(2))).toDF() .withColumn("col3", expr("keyid + 3000")) .withColumn("p", lit(1)) step2: do bulk_insert mergePartitionTable(df, 4, "default", "inc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step3: do upsert mergePartitionTable(df, 4, "default", "inc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step4: check the lastest commit time and do query spark.sql("set hoodie.inc.consume.mode=INCREMENTAL") spark.sql("set hoodie.inc.consume.max.commits=1") spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935") spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > '20210628103935' order by keyid").show(100, false) +-++ |keyid|col3| +-++ +-++ 2):*if we do insert_over_write/insert_over_write_table for hudi mor table, the incr query result is wrong when we want to query the data before insert_overwrite/insert_overwrite_table* step1: do bulk_insert mergePartitionTable(df, 4, "default", "overInc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") now the commits is [20210628160614.deltacommit ] step2: do insert_overwrite_table mergePartitionTable(df, 4, "default", "overInc", tableType = DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table") now the commits is [20210628160614.deltacommit, 20210628160923.replacecommit ] step3: query the data before insert_overwrite_table spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL") spark.sql("set hoodie.overInc.consume.max.commits=1") spark.sql("set hoodie.overInc.consume.start.timestamp=0") spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > '0' order by keyid").show(100, false) +-++ |keyid|col3| +-++ +-++ 3) *hive/presto/flink cannot read file groups which has only logs* when we use hbase/inmemory as index, mor table will produce log files instead of parquet file, but now hive/presto cannot read those files since those files are log files. *HUDI-2048* mentions this problem. however when we use spark data source to executre incremental query, there is no such problem above。keep the logical of mor_incremental_view for hive as the same logicl as spark dataSource is necessary。 we redo the logical of mor_incremental_view for hive,to solve above problems and keep the logical of mor_incremental_view as the same logicl as spark dataSource -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2086) redo the logical of mor_incremental_view for hive
[ https://issues.apache.org/jira/browse/HUDI-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng reassigned HUDI-2086: -- Assignee: tao meng > redo the logical of mor_incremental_view for hive > - > > Key: HUDI-2086 > URL: https://issues.apache.org/jira/browse/HUDI-2086 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration > Environment: spark3.1.1 > hive3.1.1 > hadoop3.1.1 > os: suse >Reporter: tao meng >Assignee: tao meng >Priority: Major > > now ,There are some problems with mor_incremental_view for hive。 > For example, > 1):*hudi cannot read the lastest incremental datas which are stored by logs* > think that: create a mor table with bulk_insert, and then do upsert for this > table, > no we want to query the latest incremental data by hive/sparksql, however > the lastest incremental datas are stored by logs, when we do query nothings > will return > step1: prepare data > val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, > x+"jack", Random.nextInt(2))).toDF() > .withColumn("col3", expr("keyid + 3000")) > .withColumn("p", lit(1)) > step2: do bulk_insert > mergePartitionTable(df, 4, "default", "inc", tableType = > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > step3: do upsert > mergePartitionTable(df, 4, "default", "inc", tableType = > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") > step4: check the lastest commit time and do query > spark.sql("set hoodie.inc.consume.mode=INCREMENTAL") > spark.sql("set hoodie.inc.consume.max.commits=1") > spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935") > spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > > '20210628103935' order by keyid").show(100, false) > +-++ > |keyid|col3| > +-++ > +-++ > > 2):*if we do insert_over_write/insert_over_write_table for hudi mor table, > the incr query result is wrong when we want to query the data before > insert_overwrite/insert_overwrite_table* > step1: do bulk_insert > mergePartitionTable(df, 4, "default", "overInc", tableType = > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") > now the commits is > [20210628160614.deltacommit ] > step2: do insert_overwrite_table > mergePartitionTable(df, 4, "default", "overInc", tableType = > DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table") > now the commits is > [20210628160614.deltacommit, 20210628160923.replacecommit ] > step3: query the data before insert_overwrite_table > spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL") > spark.sql("set hoodie.overInc.consume.max.commits=1") > spark.sql("set hoodie.overInc.consume.start.timestamp=0") > spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > > '0' order by keyid").show(100, false) > +-++ > |keyid|col3| > +-++ > +-++ > > 3) *hive/presto/flink cannot read file groups which has only logs* > when we use hbase/inmemory as index, mor table will produce log files instead > of parquet file, but now hive/presto cannot read those files since those > files are log files. > *HUDI-2048* mentions this problem. > > however when we use spark data source to executre incremental query, there is > no such problem above。keep the logical of mor_incremental_view for hive as > the same logicl as spark dataSource is necessary。 > we redo the logical of mor_incremental_view for hive,to solve above problems > and keep the logical of mor_incremental_view as the same logicl as spark > dataSource > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2059) When log exists in mor table, clustering is triggered. The query result shows that the update record in log is lost
tao meng created HUDI-2059: -- Summary: When log exists in mor table, clustering is triggered. The query result shows that the update record in log is lost Key: HUDI-2059 URL: https://issues.apache.org/jira/browse/HUDI-2059 Project: Apache Hudi Issue Type: Bug Affects Versions: 0.8.0 Environment: hadoop 3.1.1 spark3.1.1/spark2.4.5 hive3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.9.0 When log exists in mor table, and clustering is triggered. The query result shows that the update record of log is lost。 the reason of this problem is that: hoodie use HoodieFileSliceReader to read table data and then do clustering. HoodieFileSliceReader call HoodieMergedLogRecordScanner. processNextRecord to merge update values and old valuse, when call that function old values is reserved update values is discarded, this is wrong。 test step: // step1 : create hudi mor table val df = spark.range(0, 1000).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("age", lit(1)) .withColumn("p", lit(2)) df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). option(PRECOMBINE_FIELD_OPT_KEY, "col3"). option(RECORDKEY_FIELD_OPT_KEY, "keyid"). option(PARTITIONPATH_FIELD_OPT_KEY, "p"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert"). option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). option("hoodie.insert.shuffle.parallelism", "4"). option("hoodie.upsert.shuffle.parallelism", "4"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Overwrite).save(basePath) // step2, update age where keyid < 5 to produce log files val df1 = spark.range(0, 5).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("age", lit(1 + 1000)) .withColumn("p", lit(2)) df1.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). option(PRECOMBINE_FIELD_OPT_KEY, "col3"). option(RECORDKEY_FIELD_OPT_KEY, "keyid"). option(PARTITIONPATH_FIELD_OPT_KEY, "p"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"). option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). option("hoodie.insert.shuffle.parallelism", "4"). option("hoodie.upsert.shuffle.parallelism", "4"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Append).save(basePath) // step3, do cluster inline val df2 = spark.range(6, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("age", lit(1 + 2000)) .withColumn("p", lit(2)) df2.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL). option(PRECOMBINE_FIELD_OPT_KEY, "col3"). option(RECORDKEY_FIELD_OPT_KEY, "keyid"). option(PARTITIONPATH_FIELD_OPT_KEY, "p"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"). option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName). option("hoodie.insert.shuffle.parallelism", "4"). option("hoodie.upsert.shuffle.parallelism", "4"). option("hoodie.parquet.small.file.limit", "0"). option("hoodie.clustering.inline", "true"). option("hoodie.clustering.inline.max.commits", "1"). option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). option("hoodie.clustering.plan.strategy.max.bytes.per.group", Long.MaxValue.toString) .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Append).save(basePath) spark.read.format("hudi") .load(basePath).select("age").where("keyid = 0").show(100, false) +---+ |age| +---+ |1 | +—+ the result is wrong, since we update the value of age to 1001 at step 2. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2058) support incremental query for insert_overwrite_table/insert_overwrite operation on cow table
tao meng created HUDI-2058: -- Summary: support incremental query for insert_overwrite_table/insert_overwrite operation on cow table Key: HUDI-2058 URL: https://issues.apache.org/jira/browse/HUDI-2058 Project: Apache Hudi Issue Type: Bug Components: Incremental Pull Affects Versions: 0.8.0 Environment: hadoop 3.1.1 spark3.1.1 hive 3.1.1 Reporter: tao meng Assignee: tao meng Fix For: 0.9.0 when incremental query contains multiple commit before and after replacecommit, and the query result contains the data of the old file. Notice: mor table is ok, only cow table has this problem. when query incr_view for cow table, replacecommit is ignored which lead the wrong result. test step: step1: create dataFrame val df = spark.range(0, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("age", lit(1)) .withColumn("p", lit(2)) step2: insert df to a empty hoodie table df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""). option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert"). option("hoodie.insert.shuffle.parallelism", "4"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Overwrite).save(basePath) step3: do insert_overwrite df.write.format("hudi"). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, ""). option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert_overwrite_table"). option("hoodie.insert.shuffle.parallelism", "4"). option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .mode(SaveMode.Append).save(basePath) step4: query incrematal table spark.read.format("hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "") .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY, currentCommits(0)) .load(basePath).select("keyid").orderBy("keyid").show(100, false) result: the result contains old data +-+ |keyid| +-+ |0 | |0 | |1 | |1 | |2 | |2 | |3 | |3 | |4 | |4 | |5 | |5 | |6 | |6 | |7 | |7 | |8 | |8 | |9 | |9 | +-+ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-1676) Support SQL with spark3
[ https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng closed HUDI-1676. -- Resolution: Fixed > Support SQL with spark3 > --- > > Key: HUDI-1676 > URL: https://issues.apache.org/jira/browse/HUDI-1676 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.9.0 > > > 1、support CTAS for spark3 > 3、support INSERT for spark3 > 4、support merge、update、delete without RowKey constraint for spark3 > 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1676) Support SQL with spark3
[ https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng reassigned HUDI-1676: -- Assignee: tao meng > Support SQL with spark3 > --- > > Key: HUDI-1676 > URL: https://issues.apache.org/jira/browse/HUDI-1676 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: tao meng >Assignee: tao meng >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.9.0 > > > 1、support CTAS for spark3 > 3、support INSERT for spark3 > 4、support merge、update、delete without RowKey constraint for spark3 > 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1676) Support SQL with spark3
[ https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359310#comment-17359310 ] tao meng commented on HUDI-1676: [~pzw2018] great works > Support SQL with spark3 > --- > > Key: HUDI-1676 > URL: https://issues.apache.org/jira/browse/HUDI-1676 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.9.0 > > > 1、support CTAS for spark3 > 3、support INSERT for spark3 > 4、support merge、update、delete without RowKey constraint for spark3 > 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1817) when query incr view of hudi table by using spark-sql. the result is wrong
tao meng created HUDI-1817: -- Summary: when query incr view of hudi table by using spark-sql. the result is wrong Key: HUDI-1817 URL: https://issues.apache.org/jira/browse/HUDI-1817 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Environment: spark2.4.5 hive 3.1.1 hadoop 3.1.1 Reporter: tao meng Fix For: 0.9.0 create hudi table (mor or cow) val base_data = spark.read.parquet("/tmp/tb_base") val upsert_data = spark.read.parquet("/tmp/tb_upsert") base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par") upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par") query incr view by sparksql: set hoodie.tb_test_mor_par.consume.mode=INCREMENTAL; set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330; set hoodie.tb_test_mor_par.consume.max.commits=3; select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order by primary_key; +---+---+++++ |_hoodie_commit_time|primary_key|col0|col1|col6 |col7| +---+---+++++ |20210420155738 |20 |77 |sC |158788760400|739 | |20210420155738 |21 |66 |ps |160979049700|61 | |20210420155738 |22 |47 |1P |158460042900|835 | |20210420155738 |23 |36 |5K |160763480800|538 | |20210420155738 |24 |1 |BA |160685711300|775 | |20210420155738 |24 |101 |BA |160685711300|775 | |20210420155738 |24 |100 |BA |160685711300|775 | |20210420155738 |24 |102 |BA |160685711300|775 | +---+---+++++ the primary_key is repeated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong
tao meng created HUDI-1816: -- Summary: when query incr view of hudi table by using spark-sql, the query result is wrong Key: HUDI-1816 URL: https://issues.apache.org/jira/browse/HUDI-1816 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Environment: spark2.4.5 hive 3.1.1hadoop 3.1.1 Reporter: tao meng Fix For: 0.9.0 test step1: create a partitioned hudi table (mor / cow) val base_data = spark.read.parquet("/tmp/tb_base") val upsert_data = spark.read.parquet("/tmp/tb_upsert") base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par") upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par") query incr view by sparksql: set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330; set hoodie.tb_test_mor_par.consume.max.commits=3; select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order by primary_key; +---+---+++++ |_hoodie_commit_time|primary_key|col0|col1|col6 |col7| +---+---+++++ |20210420155738 |20 |77 |sC |158788760400|739 | |20210420155738 |21 |66 |ps |160979049700|61 | |20210420155738 |22 |47 |1P |158460042900|835 | |20210420155738 |23 |36 |5K |160763480800|538 | |20210420155738 |24 |1 |BA |160685711300|775 | |20210420155738 |24 |101 |BA |160685711300|775 | |20210420155738 |24 |100 |BA |160685711300|775 | |20210420155738 |24 |102 |BA |160685711300|775 | +---+---+++++ primary key 24 is repeated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1783) support Huawei Cloud Object Storage
tao meng created HUDI-1783: -- Summary: support Huawei Cloud Object Storage Key: HUDI-1783 URL: https://issues.apache.org/jira/browse/HUDI-1783 Project: Apache Hudi Issue Type: Bug Components: Common Core Reporter: tao meng Fix For: 0.9.0 add support for Huawei Cloud Object Storage -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions 2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 test step: step1: val df = spark.range(0, 10).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // create hoodie table hive_14b merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") notice: bulk_insert will produce 4 files in hoodie table step2: val df = spark.range(9, 12).toDF("keyid") .withColumn("col3", expr("keyid")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String] ("sb1", "rz"))) .withColumn("a2", lit(Array[String] ("sb1", "rz"))) // upsert table merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") now : we have four base files and one log file in hoodie table step3: spark-sql/beeline: select count(col3) from hive_14b_rt; then the query failed. was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealti
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(Re
[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
[ https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1722: --- Description: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok sparksql can avoid this problem by set spark.hadoop.cloneConf=true which is not recommended in spark, however hive has no way to avoid this problem。 was: HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompacted
[jira] [Created] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE
tao meng created HUDI-1722: -- Summary: hive beeline/spark-sql query specified field on mor table occur NPE Key: HUDI-1722 URL: https://issues.apache.org/jira/browse/HUDI-1722 Project: Apache Hudi Issue Type: Bug Components: Hive Integration, Spark Integration Affects Versions: 0.7.0 Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1 Reporter: tao meng Fix For: 0.9.0 HUDI-892 introduce this problem。 this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。 Consider the following questions: we have four getRecordReaders: reader1(its hoodieRealtimeSplit contains no log files) reader2 (its hoodieRealtimeSplit contains log files) reader3(its hoodieRealtimeSplit contains log files) reader4(its hoodieRealtimeSplit contains no log files) now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf) which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions Caused by: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) ... 24 more Caused by: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522) Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
[ https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1719: --- Description: now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: +-+++-+ |p|p1|p2|keyid| +-+++-+ |0|0|6|0| |0|0|6|1| |0|0|6|2| |0|0|6|3| |0|0|6|4| |0|0|6|4| |0|0|6|0| |0|0|6|3| |0|0|6|2| |0|0|6|1| +-+++-+ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 was: now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 1 | ++-+-++ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 > hive on spark/mr,Incremental query of the mor table, the partition field is > incorrect > - > > Key: HUDI-1719 > URL: https://issues.apache.org/jira/browse/HUDI-1719 > Project: Apache Hudi > Issue Type: Bug > Components: Hive
[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed
[ https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1718: --- Description: HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) was: HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7))
[jira] [Created] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError
tao meng created HUDI-1720: -- Summary: when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError Key: HUDI-1720 URL: https://issues.apache.org/jira/browse/HUDI-1720 Project: Apache Hudi Issue Type: Bug Components: Hive Integration, Spark Integration Affects Versions: 0.7.0, 0.8.0 Reporter: tao meng Fix For: 0.9.0 now RealtimeCompactedRecordReader.next deal with delete record by recursion, see: [https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106] however when the log file contains many delete record, the logcial of RealtimeCompactedRecordReader.next will lead stackOverflowError test step: step1: val df = spark.range(0, 100).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert 100w row (keyid from 0 to 100) merge(df, 4, "default", "hive_9b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 90).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // delete 90w row (keyid from 0 to 90) delete(df, 4, "default", "hive_9b") step3: query on beeline/spark-sql : select count(col3) from hive_9b_rt 2021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | INFO | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1, | Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child : java.lang.StackOverflowError at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39) at org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344) at org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(Realtim
[jira] [Created] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
tao meng created HUDI-1719: -- Summary: hive on spark/mr,Incremental query of the mor table, the partition field is incorrect Key: HUDI-1719 URL: https://issues.apache.org/jira/browse/HUDI-1719 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.7.0, 0.8.0 Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1 Reporter: tao meng Fix For: 0.9.0 now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the mor table. when we have some small files in different partitions, HoodieCombineHiveInputFormat will combine those small file readers. HoodieCombineHiveInputFormat build partition field base on the first file reader in it, however now HoodieCombineHiveInputFormat holds other file readers which come from different partitions. When switching readers, we should update ioctx test env: spark2.4.5, hadoop 3.1.1, hive 3.1.1 test step: step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // create hudi table which has three level partitions p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert current table merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' and `keyid` < 5; query result: ++-+-++ | p | p1 | p2 | keyid | ++-+-++ | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 1 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 4 | | 0 | 0 | 6 | 0 | | 0 | 0 | 6 | 3 | | 0 | 0 | 6 | 2 | | 0 | 0 | 6 | 1 | ++-+-++ this result is wrong, since the second step we insert new data in table which p2=7, however in the query result we cannot find p2=7, all p2= 6 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed
tao meng created HUDI-1718: -- Summary: when query incr view of mor table which has Multi level partitions, the query failed Key: HUDI-1718 URL: https://issues.apache.org/jira/browse/HUDI-1718 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.7.0, 0.8.0 Reporter: tao meng Fix For: 0.9.0 HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use "/" to join muit1 partitions. there exists some gap, so modify HoodieCombineHiveInputFormat's logical test env spark2.4.5, hadoop 3.1.1, hive 3.1.1 step1: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(6)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // bulk_insert df, partition by p,p1,p2 merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 1).toDF("keyid") .withColumn("col3", expr("keyid + 1000")) .withColumn("p", lit(0)) .withColumn("p1", lit(0)) .withColumn("p2", lit(7)) .withColumn("a1", lit(Array[String]("sb1", "rz"))) .withColumn("a2", lit(Array[String]("sb1", "rz"))) // upsert table hive8b merge(df, 4, "default", "hive_8b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert") step3: start hive beeline: set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; set hoodie.hive_8b.consume.mode=INCREMENTAL; set hoodie.hive_8b.consume.max.commits=3; set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is smaller the earlist commit, so we can query whole commits 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 2021-03-25 14:14:36,036 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: p,p1,p2 at org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268) at org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98) at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177) select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where `_hoodie_commit_time`>'20210325141300' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished
tao meng created HUDI-1688: -- Summary: hudi write should uncache rdd, when the write operation is finnished Key: HUDI-1688 URL: https://issues.apache.org/jira/browse/HUDI-1688 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Affects Versions: 0.7.0 Reporter: tao meng Fix For: 0.8.0 now, hudi improve write performance by cache necessary rdds; however when the write operation is finnished, those cached rdds have not been uncached which waste lots of memory. [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115] https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214 In our environment: step1: insert 100GB data into hudi table by spark (ok) step2: insert another 100GB data into hudi table by spark again (oom ) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1676) Support SQL with spark3
[ https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1676: --- Summary: Support SQL with spark3 (was: Support sql with spark3) > Support SQL with spark3 > --- > > Key: HUDI-1676 > URL: https://issues.apache.org/jira/browse/HUDI-1676 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: tao meng >Priority: Major > Fix For: 0.8.0 > > > 1、support CTAS for spark3 > 3、support INSERT for spark3 > 4、support merge、update、delete without RowKey constraint for spark3 > 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1677) Support Clustering and Metatable for SQL performance
tao meng created HUDI-1677: -- Summary: Support Clustering and Metatable for SQL performance Key: HUDI-1677 URL: https://issues.apache.org/jira/browse/HUDI-1677 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: tao meng Fix For: 0.8.0 1、support Metatable to improve SQL write pefermance 2、support Clustering to SQL read performance -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1676) Support sql with spark3
[ https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1676: --- Summary: Support sql with spark3 (was: support sql with spark3) > Support sql with spark3 > --- > > Key: HUDI-1676 > URL: https://issues.apache.org/jira/browse/HUDI-1676 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: tao meng >Priority: Major > Fix For: 0.8.0 > > > 1、support CTAS for spark3 > 3、support INSERT for spark3 > 4、support merge、update、delete without RowKey constraint for spark3 > 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1676) support sql with spark3
tao meng created HUDI-1676: -- Summary: support sql with spark3 Key: HUDI-1676 URL: https://issues.apache.org/jira/browse/HUDI-1676 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: tao meng Fix For: 0.8.0 1、support CTAS for spark3 3、support INSERT for spark3 4、support merge、update、delete without RowKey constraint for spark3 5、support dataSourceV2 for spark3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1675) Externalize all Hudi configurations
tao meng created HUDI-1675: -- Summary: Externalize all Hudi configurations Key: HUDI-1675 URL: https://issues.apache.org/jira/browse/HUDI-1675 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: tao meng Fix For: 0.8.0 # Externalize all Hudi configurations (separate configuration file) # Save table related properties into hoodie.properties file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd case INT: if (schema.getLogicalType() != null && schema.getLogicalType().getName().equals("date")) { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd if (schema.getLogicalType() != null && schema.getLogicalType().getName() == "date") { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > use beeline/spark-sql execute statement select * from > huditest.bulkinsert_mor_10g_rt where primary_key = 1000; > then the follow error will occur: > _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ > > > Root cause analysis: > hudi use avro format to store log file, avro store DateType as INT(Type is > INT but logcialType is date)。 > when hudi read log file and convert avro INT type record to > writable,logicalType is not respected which lead the dateType will cast to
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table use beeline/spark-sql execute statement select * from huditest.bulkinsert_mor_10g_rt where primary_key = 1000; then the follow error will occur: _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ Root cause analysis: hudi use avro format to store log file, avro store DateType as INT(Type is INT but logcialType is date)。 when hudi read log file and convert avro INT type record to writable,logicalType is not respected which lead the dateType will cast to IntWritable。 seem: [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] Modification plan: when cast avro INT type to writable, logicalType must be considerd if (schema.getLogicalType() != null && schema.getLogicalType().getName() == "date") { return new DateWritable((Integer) value); } else { return new IntWritable((Integer) value); } was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !image-2021-03-05-10-06-11-949.png! > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > use beeline/spark-sql execute statement select * from > huditest.bulkinsert_mor_10g_rt where primary_key = 1000; > then the follow error will occur: > _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_ > > > Root cause analysis: > hudi use avro format to store log file, avro store DateType as INT(Type is > INT but logcialType is date)。 > when hudi read log file and convert avro INT type record to > writable,logicalType is not respected which lead the dateType will cast to > IntWritable。 > seem: > [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169] > > Modification plan: when cast avro INT type to writable, logicalType must be > considerd > if (schema.getLogicalType() != null && schema.getLogicalType().getName() == > "date") { > return new DateWritable((Integer) value); > } else { > return new IntWritable((Integer) value); > } > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType
[ https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tao meng updated HUDI-1662: --- Description: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !image-2021-03-05-10-06-11-949.png! was: step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") step2: prepare update DataFrame with DateType, and upsert into HudiMorTable df_update = sql("select * from huditest.bulkinsert_mor_10g_rt").withColumn("date", lit(Date.valueOf("2020-11-11"))) merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") step3: use hive-beeeline/ spark-sql query mor_rt table !file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3! > Failed to query real-time view use hive/spark-sql when hudi mor table > contains dateType > > > Key: HUDI-1662 > URL: https://issues.apache.org/jira/browse/HUDI-1662 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration >Affects Versions: 0.7.0 > Environment: hive 3.1.1 > spark 2.4.5 > hadoop 3.1.1 > suse os >Reporter: tao meng >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable > df_raw.withColumn("date", lit(Date.valueOf("2020-11-10"))) > merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g") > step2: prepare update DataFrame with DateType, and upsert into HudiMorTable > df_update = sql("select * from > huditest.bulkinsert_mor_10g_rt").withColumn("date", > lit(Date.valueOf("2020-11-11"))) > merge(df_update, "upsert", "huditest.bulkinsert_mor_10g") > > step3: use hive-beeeline/ spark-sql query mor_rt table > !image-2021-03-05-10-06-11-949.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)