[jira] [Assigned] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng reassigned HUDI-5835:
--

Assignee: Tao Meng

> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Assignee: Tao Meng
>Priority: Blocker
>
> avro schema create by sparksql miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> {code:java}
> // code placeholder
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long,
>   |  ff decimal(38, 10)
>   |) using hudi
>   | location '${tablePath.toString}'
>   | tblproperties (
>   |  type = 'mor',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
> checkAnswer(s"select id, name, price, ts from $tableName")(
>   Seq(1, "a1", 10.0, 1000)
> )
> spark.sql(s"update $tableName set price = 22 where id = 1")
> checkAnswer(s"select id, name, price, ts from $tableName")(    failed
>   Seq(1, "a1", 22.0, 1000)
> )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5835:
---
Description: 
avro schema create by sparksql miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 
{code:java}
// code placeholder
 spark.sql(
   s"""
  |create table $tableName (
  |  id int,
  |  name string,
  |  price double,
  |  ts long,
  |  ff decimal(38, 10)
  |) using hudi
  | location '${tablePath.toString}'
  | tblproperties (
  |  type = 'mor',
  |  primaryKey = 'id',
  |  preCombineField = 'ts'
  | )
""".stripMargin)
 spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
checkAnswer(s"select id, name, price, ts from $tableName")(
  Seq(1, "a1", 10.0, 1000)
)
spark.sql(s"update $tableName set price = 22 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(    failed
  Seq(1, "a1", 22.0, 1000)
)

{code}
 

  was:
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 
{code:java}
// code placeholdertest("Test Add Column and Update Table") { withTempDir { tmp 
=> val tableName = generateTableName //spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath 
= new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" 
|create table $tableName ( | id int, | name string, | price double, | ts long, 
| ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | 
tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | 
) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName 
select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set 
price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add 
column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from 
$tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check 
spark.sql(s"update $tableName set price = price * 2 where id = 1") 
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
"a1", 44.0, 1000, null) ) } }

{code}


> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Priority: Blocker
>
> avro schema create by sparksql miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> {code:java}
> // code placeholder
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,
>   |  name string,
>   |  price double,
>   |  ts long,
>   |  ff decimal(38, 10)
>   |) using hudi
>   | location '${tablePath.toString}'
>   | tblproperties (
>   |  type = 'mor',
>   |  primaryKey = 'id',
>   |  preCombineField = 'ts'
>   | )
> """.stripMargin)
>  spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
> checkAnswer(s"select id, name, price, ts from $tableName")(
>   Seq(1, "a1", 10.0, 1000)
> )
> spark.sql(s"update $tableName set price = 22 where id = 1")
> checkAnswer(s"select id, name, price, ts from $tableName")(    failed
>   Seq(1, "a1", 22.0, 1000)
> )
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5835:
---
Description: 
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 
{code:java}
// code placeholdertest("Test Add Column and Update Table") { withTempDir { tmp 
=> val tableName = generateTableName //spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath 
= new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" 
|create table $tableName ( | id int, | name string, | price double, | ts long, 
| ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | 
tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | 
) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName 
select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set 
price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add 
column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from 
$tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check 
spark.sql(s"update $tableName set price = price * 2 where id = 1") 
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
"a1", 44.0, 1000, null) ) } }

{code}

  was:
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 
{code:java}
// code placeholder
{code}
test("Test Add Column and Update Table") \{ withTempDir { tmp => val tableName 
= generateTableName //spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath 
= new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" 
|create table $tableName ( | id int, | name string, | price double, | ts long, 
| ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | 
tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | 
) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName 
select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set 
price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add 
column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from 
$tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check 
spark.sql(s"update $tableName set price = price * 2 where id = 1") 
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
"a1", 44.0, 1000, null) ) } }


> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Priority: Blocker
>
> avro schema create by sparksql will miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> {code:java}
> // code placeholdertest("Test Add Column and Update Table") { withTempDir { 
> tmp => val tableName = generateTableName //spark.sql("SET 
> hoodie.datasource.read.extract.partition.values.from.path=true") val 
> tablePath = new Path(tmp.getCanonicalPath, tableName) // create table 
> spark.sql( s""" |create table $tableName ( | id int, | name string, | price 
> double, | ts long, | ff decimal(38, 10) |) using hudi | location 
> '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 
> 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table 
> spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") 
> checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 
> 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") 
> checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 
> 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") 
> checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
> "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName 
> set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, 
> ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5835:
---
Description: 
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 
{code:java}
// code placeholder
{code}
test("Test Add Column and Update Table") \{ withTempDir { tmp => val tableName 
= generateTableName //spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true") val tablePath 
= new Path(tmp.getCanonicalPath, tableName) // create table spark.sql( s""" 
|create table $tableName ( | id int, | name string, | price double, | ts long, 
| ff decimal(38, 10) |) using hudi | location '${tablePath.toString}' | 
tblproperties ( | type = 'mor', | primaryKey = 'id', | preCombineField = 'ts' | 
) """.stripMargin) // insert data to table spark.sql(s"insert into $tableName 
select 1, 'a1', 10, 1000, 10.0") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 10.0, 1000) ) spark.sql(s"update $tableName set 
price = 22 where id = 1") checkAnswer(s"select id, name, price, ts from 
$tableName")( Seq(1, "a1", 22.0, 1000) ) spark.sql(s"alter table $tableName add 
column new_col1 int") checkAnswer(s"select id, name, price, ts, new_col1 from 
$tableName")( Seq(1, "a1", 22.0, 1000, null) ) // update and check 
spark.sql(s"update $tableName set price = price * 2 where id = 1") 
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
"a1", 44.0, 1000, null) ) } }

  was:
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 

test("Test Add Column and Update Table") {
withTempDir { tmp =>
val tableName = generateTableName

//spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true")

val tablePath = new Path(tmp.getCanonicalPath, tableName)
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| ff decimal(38, 10)
|) using hudi
| location '${tablePath.toString}'
| tblproperties (
| type = 'mor',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)

// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

spark.sql(s"update $tableName set price = 22 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 22.0, 1000)
)

spark.sql(s"alter table $tableName add column new_col1 int")

checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
Seq(1, "a1", 22.0, 1000, null)
)

// update and check
spark.sql(s"update $tableName set price = price * 2 where id = 1")
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
Seq(1, "a1", 44.0, 1000, null)
)
}
}


> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Priority: Blocker
>
> avro schema create by sparksql will miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> {code:java}
> // code placeholder
> {code}
> test("Test Add Column and Update Table") \{ withTempDir { tmp => val 
> tableName = generateTableName //spark.sql("SET 
> hoodie.datasource.read.extract.partition.values.from.path=true") val 
> tablePath = new Path(tmp.getCanonicalPath, tableName) // create table 
> spark.sql( s""" |create table $tableName ( | id int, | name string, | price 
> double, | ts long, | ff decimal(38, 10) |) using hudi | location 
> '${tablePath.toString}' | tblproperties ( | type = 'mor', | primaryKey = 
> 'id', | preCombineField = 'ts' | ) """.stripMargin) // insert data to table 
> spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0") 
> checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 
> 10.0, 1000) ) spark.sql(s"update $tableName set price = 22 where id = 1") 
> checkAnswer(s"select id, name, price, ts from $tableName")( Seq(1, "a1", 
> 22.0, 1000) ) spark.sql(s"alter table $tableName add column new_col1 int") 
> checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")( Seq(1, 
> "a1", 22.0, 1000, null) ) // update and check spark.sql(s"update $tableName 
> set price = price * 2 where id = 1") checkAnswer(s"select id, name, price, 
> ts, new_col1 from $tableName")( Seq(1, "a1", 44.0, 1000, null) ) } }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5835:
---
Description: 
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

 

test("Test Add Column and Update Table") {
withTempDir { tmp =>
val tableName = generateTableName

//spark.sql("SET 
hoodie.datasource.read.extract.partition.values.from.path=true")

val tablePath = new Path(tmp.getCanonicalPath, tableName)
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| ff decimal(38, 10)
|) using hudi
| location '${tablePath.toString}'
| tblproperties (
| type = 'mor',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)

// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

spark.sql(s"update $tableName set price = 22 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 22.0, 1000)
)

spark.sql(s"alter table $tableName add column new_col1 int")

checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
Seq(1, "a1", 22.0, 1000, null)
)

// update and check
spark.sql(s"update $tableName set price = price * 2 where id = 1")
checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
Seq(1, "a1", 44.0, 1000, null)
)
}
}

  was:
avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

```

test("Test upsert table") {
withTempDir { tmp =>
val tableName = generateTableName

val tablePath = new Path(tmp.getCanonicalPath, tableName)
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| ff decimal(38, 10)
|) using hudi
| location '${tablePath.toString}'
| tblproperties (
| type = 'mor',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)

// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

spark.sql(s"update $tableName set price = 22 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 22.0, 1000)
)
}
}

```


> spark cannot read mor table after execute update statement
> --
>
> Key: HUDI-5835
> URL: https://issues.apache.org/jira/browse/HUDI-5835
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Affects Versions: 0.13.0
>Reporter: Tao Meng
>Priority: Blocker
>
> avro schema create by sparksql will miss avro name and namespace, 
> This will lead the read schema and write schema of the log file to be 
> incompatible
>  
> test("Test Add Column and Update Table") {
> withTempDir { tmp =>
> val tableName = generateTableName
> //spark.sql("SET 
> hoodie.datasource.read.extract.partition.values.from.path=true")
> val tablePath = new Path(tmp.getCanonicalPath, tableName)
> // create table
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price double,
> | ts long,
> | ff decimal(38, 10)
> |) using hudi
> | location '${tablePath.toString}'
> | tblproperties (
> | type = 'mor',
> | primaryKey = 'id',
> | preCombineField = 'ts'
> | )
> """.stripMargin)
> // insert data to table
> spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
> checkAnswer(s"select id, name, price, ts from $tableName")(
> Seq(1, "a1", 10.0, 1000)
> )
> spark.sql(s"update $tableName set price = 22 where id = 1")
> checkAnswer(s"select id, name, price, ts from $tableName")(
> Seq(1, "a1", 22.0, 1000)
> )
> spark.sql(s"alter table $tableName add column new_col1 int")
> checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
> Seq(1, "a1", 22.0, 1000, null)
> )
> // update and check
> spark.sql(s"update $tableName set price = price * 2 where id = 1")
> checkAnswer(s"select id, name, price, ts, new_col1 from $tableName")(
> Seq(1, "a1", 44.0, 1000, null)
> )
> }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5835) spark cannot read mor table after execute update statement

2023-02-23 Thread Tao Meng (Jira)
Tao Meng created HUDI-5835:
--

 Summary: spark cannot read mor table after execute update statement
 Key: HUDI-5835
 URL: https://issues.apache.org/jira/browse/HUDI-5835
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Affects Versions: 0.13.0
Reporter: Tao Meng


avro schema create by sparksql will miss avro name and namespace, 

This will lead the read schema and write schema of the log file to be 
incompatible

```

test("Test upsert table") {
withTempDir { tmp =>
val tableName = generateTableName

val tablePath = new Path(tmp.getCanonicalPath, tableName)
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| ff decimal(38, 10)
|) using hudi
| location '${tablePath.toString}'
| tblproperties (
| type = 'mor',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)

// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000, 10.0")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

spark.sql(s"update $tableName set price = 22 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 22.0, 1000)
)
}
}

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5294) Support type change for schema on read enable + reconcile schema

2022-11-29 Thread Tao Meng (Jira)
Tao Meng created HUDI-5294:
--

 Summary: Support type change for schema on read enable + reconcile 
schema
 Key: HUDI-5294
 URL: https://issues.apache.org/jira/browse/HUDI-5294
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Tao Meng
 Fix For: 0.12.2


https://github.com/apache/hudi/issues/7283



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5194) fix schema evolution bugs

2022-11-10 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5194:
---
Description: 
# Fix the bug, history schema files cannot be cleaned by 
FileBasedInternalSchemaStorageManager
 # Fix the bug, schema evolution cannot worked very well on non-batch read mode 
under spark3.1x
 # optimize implement for compaction.

> fix schema evolution bugs
> -
>
> Key: HUDI-5194
> URL: https://issues.apache.org/jira/browse/HUDI-5194
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Tao Meng
>Priority: Major
>
> # Fix the bug, history schema files cannot be cleaned by 
> FileBasedInternalSchemaStorageManager
>  # Fix the bug, schema evolution cannot worked very well on non-batch read 
> mode under spark3.1x
>  # optimize implement for compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5194) fix schema evolution bugs

2022-11-10 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-5194:
---
Issue Type: Bug  (was: New Feature)

> fix schema evolution bugs
> -
>
> Key: HUDI-5194
> URL: https://issues.apache.org/jira/browse/HUDI-5194
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Tao Meng
>Priority: Major
>
> # Fix the bug, history schema files cannot be cleaned by 
> FileBasedInternalSchemaStorageManager
>  # Fix the bug, schema evolution cannot worked very well on non-batch read 
> mode under spark3.1x
>  # optimize implement for compaction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5194) fix schema evolution bugs

2022-11-10 Thread Tao Meng (Jira)
Tao Meng created HUDI-5194:
--

 Summary: fix schema evolution bugs
 Key: HUDI-5194
 URL: https://issues.apache.org/jira/browse/HUDI-5194
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Tao Meng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5000) Support schema evolution for Hive

2022-10-18 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng reassigned HUDI-5000:
--

Assignee: Tao Meng

> Support schema evolution for Hive
> -
>
> Key: HUDI-5000
> URL: https://issues.apache.org/jira/browse/HUDI-5000
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive
>Reporter: Tao Meng
>Assignee: Tao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Support of schema evolution presented in RFC-33 when read by hive
> Operations: add column, rename column, change type of column, drop column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5000) Support schema evolution for Hive

2022-10-18 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng reassigned HUDI-5000:
--

Assignee: (was: Tao Meng)

> Support schema evolution for Hive
> -
>
> Key: HUDI-5000
> URL: https://issues.apache.org/jira/browse/HUDI-5000
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive
>Reporter: Tao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Support of schema evolution presented in RFC-33 when read by hive
> Operations: add column, rename column, change type of column, drop column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5000) Support schema evolution for Hive

2022-10-09 Thread Tao Meng (Jira)
Tao Meng created HUDI-5000:
--

 Summary: Support schema evolution for Hive
 Key: HUDI-5000
 URL: https://issues.apache.org/jira/browse/HUDI-5000
 Project: Apache Hudi
  Issue Type: New Feature
  Components: hive
Reporter: Tao Meng
 Fix For: 0.13.0


Support of schema evolution presented in RFC-33 when read by hive
Operations: add column, rename column, change type of column, drop column



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4898) for mor table, presto/hive shoud respect payload class during merge parquet file and log file

2022-09-22 Thread Tao Meng (Jira)
Tao Meng created HUDI-4898:
--

 Summary: for mor table, presto/hive shoud respect payload class 
during merge parquet file and log file
 Key: HUDI-4898
 URL: https://issues.apache.org/jira/browse/HUDI-4898
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive, trino-presto
Affects Versions: 0.12.0
 Environment: hadoop 3.2.x
hive3.1.x
presto

Reporter: Tao Meng
 Fix For: 0.12.2


for mor table, presto/hive will ignore payload class during merge parquet file 
and log file.

Line115 in RealtimeCompactedRecordReader

```

// TODO(NA): Invoke preCombine here by converting arrayWritable to Avro. This 
is required since the
// deltaRecord may not be a full record and needs values of columns from the 
parquet
Option rec = 
buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-1675) Externalize all Hudi configurations

2022-06-26 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng closed HUDI-1675.
--
Resolution: Fixed

close it, as Hudi currently has this capability

> Externalize all Hudi configurations
> ---
>
> Key: HUDI-1675
> URL: https://issues.apache.org/jira/browse/HUDI-1675
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: configs
>Affects Versions: 0.9.0
>Reporter: tao meng
>Priority: Minor
> Fix For: 0.12.0
>
>
> # Externalize all Hudi configurations (separate configuration file)
>  # Save table related properties into hoodie.properties file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-4184) Creating external table in Spark SQL modifies "hoodie.properties"

2022-06-07 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550983#comment-17550983
 ] 

Tao Meng commented on HUDI-4184:


I suggest moving the schema from hoodie.properties.  

1) we has already  stored  latest schema  in our commit file. 

2) This schema will only be written once, if schema evolution happen in the 
future. The schema from  hoodie.properties will be inconsistent with current 
table schema.

> Creating external table in Spark SQL modifies "hoodie.properties"
> -
>
> Key: HUDI-4184
> URL: https://issues.apache.org/jira/browse/HUDI-4184
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Sagar Sumit
>Priority: Critical
>
> My setup was like following:
>  # There's a table existing in one AWS account
>  # I'm trying to access that table from Spark SQL from _another_ AWS account 
> that only has Read permissions to the bucket with the table.
>  # Now when issuing "CREATE TABLE" Spark SQL command it fails b/c Hudi tries 
> to modify "hoodie.properties" file for whatever reason, even though i'm not 
> modifying the table and just trying to create table in the catalog.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-3921) Fixed schema evolution cannot work with HUDI-3855

2022-04-20 Thread Tao Meng (Jira)
Tao Meng created HUDI-3921:
--

 Summary: Fixed schema evolution cannot work with HUDI-3855
 Key: HUDI-3921
 URL: https://issues.apache.org/jira/browse/HUDI-3921
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Tao Meng
 Fix For: 0.11.0


Fixed schema evolution cannot work with HUDI-3855



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong

2022-03-30 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng closed HUDI-1816.
--
Resolution: Not A Problem

not a problem, close it

> when query incr view of hudi table by using spark-sql, the query result is 
> wrong
> 
>
> Key: HUDI-1816
> URL: https://issues.apache.org/jira/browse/HUDI-1816
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.8.0
> Environment: spark2.4.5   hive 3.1.1hadoop 3.1.1
>Reporter: tao meng
>Priority: Critical
> Fix For: 0.11.0
>
>
> test step1:
> create a partitioned hudi table (mor / cow)
> val base_data = spark.read.parquet("/tmp/tb_base")
> val upsert_data = spark.read.parquet("/tmp/tb_upsert")
> base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
> MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
> "col2").option(RECORDKEY_FIELD_OPT_KEY, 
> "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
> "col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
> "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
> "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
> "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
>  "testdb").option(HIVE_TABLE_OPT_KEY, 
> "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
> "false").option("hoodie.bulkinsert.shuffle.parallelism", 
> 4).option("hoodie.insert.shuffle.parallelism", 
> 4).option("hoodie.upsert.shuffle.parallelism", 
> 4).option("hoodie.delete.shuffle.parallelism", 
> 4).option("hoodie.datasource.write.hive_style_partitioning", 
> "true").option(TABLE_NAME, 
> "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par")
> upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
> MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
> "col2").option(RECORDKEY_FIELD_OPT_KEY, 
> "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
> "col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
> "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
> "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
> "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
>  "testdb").option(HIVE_TABLE_OPT_KEY, 
> "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
> "false").option("hoodie.bulkinsert.shuffle.parallelism", 
> 4).option("hoodie.insert.shuffle.parallelism", 
> 4).option("hoodie.upsert.shuffle.parallelism", 
> 4).option("hoodie.delete.shuffle.parallelism", 
> 4).option("hoodie.datasource.write.hive_style_partitioning", 
> "true").option(TABLE_NAME, 
> "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par")
> query incr view by sparksql:
> set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330;
> set hoodie.tb_test_mor_par.consume.max.commits=3;
> select 
> _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from 
> testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order 
> by primary_key;
> +---+---+++++
> |_hoodie_commit_time|primary_key|col0|col1|col6 |col7|
> +---+---+++++
> |20210420155738 |20 |77 |sC |158788760400|739 |
> |20210420155738 |21 |66 |ps |160979049700|61 |
> |20210420155738 |22 |47 |1P |158460042900|835 |
> |20210420155738 |23 |36 |5K |160763480800|538 |
> |20210420155738 |24 |1 |BA |160685711300|775 |
> |20210420155738 |24 |101 |BA |160685711300|775 |
> |20210420155738 |24 |100 |BA |160685711300|775 |
> |20210420155738 |24 |102 |BA |160685711300|775 |
> +---+---+++++
>  
> primary key 24 is repeated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3408) fixed the bug that BUCKET_INDEX cannot process special characters

2022-03-29 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng closed HUDI-3408.
--
Resolution: Won't Fix

This is not a common problem, let me close it

> fixed the bug that BUCKET_INDEX cannot process special characters
> -
>
> Key: HUDI-3408
> URL: https://issues.apache.org/jira/browse/HUDI-3408
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.10.1
> Environment: spark3.1.1
>Reporter: Tao Meng
>Priority: Major
>  Labels: pull-request-available
>
> BucketIdentifier use split(":") to split  recordKeyName and recordKeyValue,  
> if current recordKeyValue is  ":::"  ,  above split give a wrong result.
> test("test Bucket") {
> withTempDir { tmp =>
> Seq("mor").foreach { tableType =>
> val tableName = generateTableName
> val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
> spark.sql(
> s"""
> |create table $tableName (
> | id int, comb int, col0 int, col1 bigint, col2 float, col3 double, col4 
> decimal(10,4), col5 string, col6 date, col7 timestamp, col8 boolean, col9 
> binary, par date
> |) using hudi
> | location '$tablePath'
> | options (
> | type = '$tableType',
> | primaryKey = 'id,col0,col5',
> | preCombineField = 'comb',
> | hoodie.index.type = 'BUCKET',
> | hoodie.bucket.index.num.buckets = '900'
> | )
> | partitioned by (par)
> """.stripMargin)
> spark.sql(
> s"""
> | insert into $tableName values
> | (1,1,11,11,101.01,1001.0001,11.0001,':::','2021-12-25','2021-12-25 
> 12:01:01',true,'a01','2021-12-25'),
> | 
> (2,2,12,12,102.02,1002.0002,12.0002,'a02','2021-12-25','2021-12-25
>  12:02:02',true,'a02','2021-12-25'),
> | 
> (3,3,13,13,103.03,1003.0003,13.0003,'a03','2021-12-25','2021-12-25
>  12:03:03',false,'a03','2021-12-25'),
> | 
> (4,4,14,14,104.04,1004.0004,14.0004,'a04','2021-12-26','2021-12-26
>  12:04:04',true,'a04','2021-12-26'),
> | 
> (5,5,15,15,105.05,1005.0005,15.0005,'a05','2021-12-26','2021-12-26
>  12:05:05',false,'a05','2021-12-26')
> |""".stripMargin)
> }
> }
> }
>  
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
>     at 
> org.apache.hudi.index.bucket.BucketIdentifier.lambda$getBucketId$2(BucketIdentifier.java:46)
>     at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
>     at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
>     at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>     at 
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
>     at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>     at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>     at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>     at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>     at 
> org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:46)
>     at 
> org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:36)
>     at 
> org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:80)
>     at 
> org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:70)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:125)
>     ... 25 more
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3742) Enable parquet enableVectorizedReader for spark incremental read to prevent pef regression

2022-03-29 Thread Tao Meng (Jira)
Tao Meng created HUDI-3742:
--

 Summary: Enable parquet  enableVectorizedReader for spark 
incremental read to prevent pef regression
 Key: HUDI-3742
 URL: https://issues.apache.org/jira/browse/HUDI-3742
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark
Reporter: Tao Meng
 Fix For: 0.11.0


now we disable parquet  enableVectorizedReader for mor incremental read,

and set "spark.sql.parquet.recordLevelFilter.enabled" = "true"  to achieve data 
filter

which is slow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3719) High performance costs of AvroSerializer in Datasource writing

2022-03-26 Thread Tao Meng (Jira)
Tao Meng created HUDI-3719:
--

 Summary: High performance costs of AvroSerializer in Datasource 
writing
 Key: HUDI-3719
 URL: https://issues.apache.org/jira/browse/HUDI-3719
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark
Reporter: Tao Meng
 Fix For: 0.11.0


https://github.com/apache/hudi/issues/5107



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column

2022-03-16 Thread Tao Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Meng updated HUDI-3646:
---
Description: 
now, when we use sparksql to update hudi table, we find that  hudi will change 
the nullability attribute of a column

eg:
{code:java}
// code placeholder
 val tableName = generateTableName
 val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
 // create table
 spark.sql(
   s"""
  |create table $tableName (
  |  id int,
  |  name string,
  |  price double,
  |  ts long
  |) using hudi
  | location '$tablePath'
  | options (
  |  type = '$tableType',
  |  primaryKey = 'id',
  |  preCombineField = 'ts'
  | )
""".stripMargin)
 // insert data to table
 spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
 spark.sql(s"select * from $tableName").printSchema()

 // update data
 spark.sql(s"update $tableName set price = 20 where id = 1")
 spark.sql(s"select * from $tableName").printSchema() {code}
 

 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = true)*
 |-- ts: long (nullable = true)

 

 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = false )*
 |-- ts: long (nullable = true)
 
the nullable attribute of price has been changed to false, This is not the 
result we want

  was:
now, when we use sparksql to update hudi table, we find that  hudi will change 
the nullability attribute of a column

eg:

```

test("Test Update Table") {
withTempDir { tmp =>
Seq("cow", "mor").foreach {tableType =>
val tableName = generateTableName
val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long
|) using hudi
| location '$tablePath'
| options (
| type = '$tableType',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)
// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
spark.sql(s"select * from $tableName").printSchema()
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

// update data
spark.sql(s"update $tableName set price = 20 where id = 1")
spark.sql(s"select * from $tableName").printSchema()
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 20.0, 1000)
)

// update data
spark.sql(s"update $tableName set price = price * 2 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 40.0, 1000)
)
}
}
}
}

```

 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = true)*
 |-- ts: long (nullable = true)

 
|-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = false)*
 |-- ts: long (nullable = true)

the nullable attribute of price has been changed to false, This is not the 
result we want


> The Hudi update syntax should not modify the nullability attribute of a column
> --
>
> Key: HUDI-3646
> URL: https://issues.apache.org/jira/browse/HUDI-3646
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Affects Versions: 0.10.1
> Environment: spark3.1.2
>Reporter: Tao Meng
>Priority: Minor
> Fix For: 0.12.0
>
>
> now, when we use sparksql to update hudi table, we find that  hudi will 
> change the nullability attribute of a column
> eg:
> {code:java}
> // code placeholder
>  val tableName = generateTableName
>  val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>  // create table
>  spark.sql(
>s"""
>   |create table $tableName (
>   |  id int,

[jira] [Created] (HUDI-3646) The Hudi update syntax should not modify the nullability attribute of a column

2022-03-16 Thread Tao Meng (Jira)
Tao Meng created HUDI-3646:
--

 Summary: The Hudi update syntax should not modify the nullability 
attribute of a column
 Key: HUDI-3646
 URL: https://issues.apache.org/jira/browse/HUDI-3646
 Project: Apache Hudi
  Issue Type: Bug
  Components: spark-sql
Affects Versions: 0.10.1
 Environment: spark3.1.2
Reporter: Tao Meng
 Fix For: 0.12.0


now, when we use sparksql to update hudi table, we find that  hudi will change 
the nullability attribute of a column

eg:

```

test("Test Update Table") {
withTempDir { tmp =>
Seq("cow", "mor").foreach {tableType =>
val tableName = generateTableName
val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
// create table
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long
|) using hudi
| location '$tablePath'
| options (
| type = '$tableType',
| primaryKey = 'id',
| preCombineField = 'ts'
| )
""".stripMargin)
// insert data to table
spark.sql(s"insert into $tableName select 1, 'a1', 10, 1000")
spark.sql(s"select * from $tableName").printSchema()
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 10.0, 1000)
)

// update data
spark.sql(s"update $tableName set price = 20 where id = 1")
spark.sql(s"select * from $tableName").printSchema()
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 20.0, 1000)
)

// update data
spark.sql(s"update $tableName set price = price * 2 where id = 1")
checkAnswer(s"select id, name, price, ts from $tableName")(
Seq(1, "a1", 40.0, 1000)
)
}
}
}
}

```

 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = true)*
 |-- ts: long (nullable = true)

 
|-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 *|-- price: double (nullable = false)*
 |-- ts: long (nullable = true)

the nullable attribute of price has been changed to false, This is not the 
result we want



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3603) Support read DateType for hive2/hive3

2022-03-10 Thread Tao Meng (Jira)
Tao Meng created HUDI-3603:
--

 Summary: Support read  DateType  for hive2/hive3
 Key: HUDI-3603
 URL: https://issues.apache.org/jira/browse/HUDI-3603
 Project: Apache Hudi
  Issue Type: Bug
  Components: hive
Affects Versions: 0.10.1
Reporter: Tao Meng
 Fix For: 0.11.0


now hudi only support read dateType for hive2, we should support read DateType  
for both hive2 and hive3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-06 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501953#comment-17501953
 ] 

Tao Meng commented on HUDI-3355:


[~suryaprasanna] 

if you have free time, can you try this pr 
[https://github.com/apache/hudi/pull/4962]

preWrite:  we record the pending clustering commits

preCommit:  we choose the completed clustering commits which complete during 
this operation write time. and check the conflict

with this pr, c5 should be failed , since it have conflict with c3

> Issue with out of order commits in the timeline when ingestion writers using 
> SparkAllowUpdateStrategy
> -
>
> Key: HUDI-3355
> URL: https://issues.apache.org/jira/browse/HUDI-3355
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Assignee: tao meng
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Out of order commits can happen between two commits C1 and C2. If timestamp 
> of C2 is greater than C1's and completed before C1.
> In our use case, we are running clustering in async, and want ingestion 
> writers to be given preference over clustering.
> Following are the configs used by the ingestion writer
> {noformat}
> "hoodie.clustering.updates.strategy": 
> "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy"
> "hoodie.clustering.rollback.pending.replacecommit.on.conflict": 
> false{noformat}
> This would allow ingestion writers to ignore pending replacecommits on the 
> timeline and continue writing.
>  
> Consider the following scenario
> {code:java}
> At instant1
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.inflight -> Started 
> {code}
> {code:java}
> At instant2
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit -> Completed {code}
> {code:java}
> At instant3
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight -> Started{code}
> {code:java}
> At instant4
> C1.commit
> C2.commit
> C3.replacecommit -> Completed
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight(continuing) {code}
> {code:java}
> At instant5
> C1.commit
> C2.commit
> C3.replacecommit
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.commit -> Completed (It has conflict with C3 but since it has lower 
> timestamp than C4, C3 is not considered during conflict resolution){code}
>  
> Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though 
> the C3 is the one that is committed last.
> Ideally when sorting the timeline we should consider the transition times. 
> So, timeline should look something like,
> {code:java}
> C1.commit
> C2.commit
> C4.commit(lastSuccessfulCommit seen by C5)
> C3.replacecommit 
> C5.inflight{code}
> So, in this case when the C5 is about to complete, it will consider all the 
> commits that are completed after C4 which will be C3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-04 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501653#comment-17501653
 ] 

Tao Meng commented on HUDI-3355:


[~suryaprasanna]   

I tried to understand the problem,do you mean that:

1) start time:  c3(clustering) <  c4  < c5

2) c4 is not conflict with c3(clustering), so c4 can be commit successfully 

3) c5 is conflict with c3(but hudi does not check that),so  c5 commit 
successfuly(in fact c5 should be failed)

 

 

> Issue with out of order commits in the timeline when ingestion writers using 
> SparkAllowUpdateStrategy
> -
>
> Key: HUDI-3355
> URL: https://issues.apache.org/jira/browse/HUDI-3355
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Assignee: tao meng
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Out of order commits can happen between two commits C1 and C2. If timestamp 
> of C2 is greater than C1's and completed before C1.
> In our use case, we are running clustering in async, and want ingestion 
> writers to be given preference over clustering.
> Following are the configs used by the ingestion writer
> {noformat}
> "hoodie.clustering.updates.strategy": 
> "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy"
> "hoodie.clustering.rollback.pending.replacecommit.on.conflict": 
> false{noformat}
> This would allow ingestion writers to ignore pending replacecommits on the 
> timeline and continue writing.
>  
> Consider the following scenario
> {code:java}
> At instant1
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.inflight -> Started 
> {code}
> {code:java}
> At instant2
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit -> Completed {code}
> {code:java}
> At instant3
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight -> Started{code}
> {code:java}
> At instant4
> C1.commit
> C2.commit
> C3.replacecommit -> Completed
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight(continuing) {code}
> {code:java}
> At instant5
> C1.commit
> C2.commit
> C3.replacecommit
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.commit -> Completed (It has conflict with C3 but since it has lower 
> timestamp than C4, C3 is not considered during conflict resolution){code}
>  
> Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though 
> the C3 is the one that is committed last.
> Ideally when sorting the timeline we should consider the transition times. 
> So, timeline should look something like,
> {code:java}
> C1.commit
> C2.commit
> C4.commit(lastSuccessfulCommit seen by C5)
> C3.replacecommit 
> C5.inflight{code}
> So, in this case when the C5 is about to complete, it will consider all the 
> commits that are completed after C4 which will be C3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3355) Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-04 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501645#comment-17501645
 ] 

Tao Meng commented on HUDI-3355:


no problem

> Issue with out of order commits in the timeline when ingestion writers using 
> SparkAllowUpdateStrategy
> -
>
> Key: HUDI-3355
> URL: https://issues.apache.org/jira/browse/HUDI-3355
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Surya Prasanna Yalla
>Assignee: tao meng
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Out of order commits can happen between two commits C1 and C2. If timestamp 
> of C2 is greater than C1's and completed before C1.
> In our use case, we are running clustering in async, and want ingestion 
> writers to be given preference over clustering.
> Following are the configs used by the ingestion writer
> {noformat}
> "hoodie.clustering.updates.strategy": 
> "org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy"
> "hoodie.clustering.rollback.pending.replacecommit.on.conflict": 
> false{noformat}
> This would allow ingestion writers to ignore pending replacecommits on the 
> timeline and continue writing.
>  
> Consider the following scenario
> {code:java}
> At instant1
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.inflight -> Started 
> {code}
> {code:java}
> At instant2
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit -> Completed {code}
> {code:java}
> At instant3
> C1.commit
> C2.commit
> C3.replacecommit.inflight 
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight -> Started{code}
> {code:java}
> At instant4
> C1.commit
> C2.commit
> C3.replacecommit -> Completed
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.inflight(continuing) {code}
> {code:java}
> At instant5
> C1.commit
> C2.commit
> C3.replacecommit
> C4.commit (lastSuccessfulCommit seen by C5)
> C5.commit -> Completed (It has conflict with C3 but since it has lower 
> timestamp than C4, C3 is not considered during conflict resolution){code}
>  
> Here, the lastSuccessfulCommit value that is seen by C5 is C4, even though 
> the C3 is the one that is committed last.
> Ideally when sorting the timeline we should consider the transition times. 
> So, timeline should look something like,
> {code:java}
> C1.commit
> C2.commit
> C4.commit(lastSuccessfulCommit seen by C5)
> C3.replacecommit 
> C5.inflight{code}
> So, in this case when the C5 is about to complete, it will consider all the 
> commits that are completed after C4 which will be C3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-02-18 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494502#comment-17494502
 ] 

Tao Meng commented on HUDI-2762:


[~alexey.kudinkin]  i This problem is hive's problem. Hive will filter out all 
the file which startwith .

 see org.apache.hadoop.hive.common.FileUtils in hive

 

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive
>Reporter: Rajesh Mahindra
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2762) Ensure hive can query insert only logs in MOR

2022-02-18 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494502#comment-17494502
 ] 

Tao Meng edited comment on HUDI-2762 at 2/18/22, 9:59 AM:
--

[~alexey.kudinkin]  This problem is hive's problem. Hive will filter out all 
the file which startwith "."

 see org.apache.hadoop.hive.common.FileUtils in hive

 


was (Author: mengtao):
[~alexey.kudinkin]  i This problem is hive's problem. Hive will filter out all 
the file which startwith .

 see org.apache.hadoop.hive.common.FileUtils in hive

 

> Ensure hive can query insert only logs in MOR
> -
>
> Key: HUDI-2762
> URL: https://issues.apache.org/jira/browse/HUDI-2762
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive
>Reporter: Rajesh Mahindra
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, we are able to query MOR tables that have base parquet files with 
> inserts an logs files with updates. However, we are currently unable to query 
> tables with insert only log files. Both _ro and _rt tables are returning 0 
> rows. However, hms does create the table and partitions for the table. 
>  
> One sample table is here:
> [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/=us-east-2]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3408) fixed the bug that BUCKET_INDEX cannot process special characters

2022-02-10 Thread Tao Meng (Jira)
Tao Meng created HUDI-3408:
--

 Summary: fixed the bug that BUCKET_INDEX cannot process special 
characters
 Key: HUDI-3408
 URL: https://issues.apache.org/jira/browse/HUDI-3408
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Affects Versions: 0.10.1
 Environment: spark3.1.1

Reporter: Tao Meng


BucketIdentifier use split(":") to split  recordKeyName and recordKeyValue,  if 
current recordKeyValue is  ":::"  ,  above split give a wrong result.

test("test Bucket") {
withTempDir { tmp =>
Seq("mor").foreach { tableType =>
val tableName = generateTableName
val tablePath = s"${new Path(tmp.getCanonicalPath, tableName).toUri.toString}"
spark.sql(
s"""
|create table $tableName (
| id int, comb int, col0 int, col1 bigint, col2 float, col3 double, col4 
decimal(10,4), col5 string, col6 date, col7 timestamp, col8 boolean, col9 
binary, par date
|) using hudi
| location '$tablePath'
| options (
| type = '$tableType',
| primaryKey = 'id,col0,col5',
| preCombineField = 'comb',
| hoodie.index.type = 'BUCKET',
| hoodie.bucket.index.num.buckets = '900'
| )
| partitioned by (par)
""".stripMargin)
spark.sql(
s"""
| insert into $tableName values
| (1,1,11,11,101.01,1001.0001,11.0001,':::','2021-12-25','2021-12-25 
12:01:01',true,'a01','2021-12-25'),
| 
(2,2,12,12,102.02,1002.0002,12.0002,'a02','2021-12-25','2021-12-25 
12:02:02',true,'a02','2021-12-25'),
| 
(3,3,13,13,103.03,1003.0003,13.0003,'a03','2021-12-25','2021-12-25 
12:03:03',false,'a03','2021-12-25'),
| 
(4,4,14,14,104.04,1004.0004,14.0004,'a04','2021-12-26','2021-12-26 
12:04:04',true,'a04','2021-12-26'),
| 
(5,5,15,15,105.05,1005.0005,15.0005,'a05','2021-12-26','2021-12-26 
12:05:05',false,'a05','2021-12-26')
|""".stripMargin)
}
}
}

 

Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at 
org.apache.hudi.index.bucket.BucketIdentifier.lambda$getBucketId$2(BucketIdentifier.java:46)
    at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at 
java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
    at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at 
org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:46)
    at 
org.apache.hudi.index.bucket.BucketIdentifier.getBucketId(BucketIdentifier.java:36)
    at 
org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:80)
    at 
org.apache.hudi.index.bucket.SparkBucketIndex$1.computeNext(SparkBucketIndex.java:70)
    at 
org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:125)
    ... 25 more

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution

2022-02-08 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489291#comment-17489291
 ] 

Tao Meng commented on HUDI-3347:


[~shivnarayan] 

This problem seems to be caused by the wrong version of hive jar package

 i try to reproduce the problem  with hive2.3/hive3.1 , but no problem happen.  
 

i checkout the hive source code, 
org.apache.hadoop.hive.metastore.IMetaStoreClient.alter_table_with_environmentContext
this Interface has not changed  since hive2.0.

could you pls paste your hive/spark version number

> Updating table schema fails w/ hms mode w/ schema evolution
> ---
>
> Key: HUDI-3347
> URL: https://issues.apache.org/jira/browse/HUDI-3347
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive-sync
>Reporter: sivabalan narayanan
>Assignee: tao meng
>Priority: Critical
> Fix For: 0.11.0
>
>
> When table schema got upgraded with a new batch of write, hms mode sync 
> fails. 
>  
> steps to reproduce using our docker demo set up. 
> I used 0.10.0 to test this out.
> adhoc-1
> {code:java}
> $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] 
> --driver-class-path $HADOOP_CONF_DIR --conf 
> spark.sql.hive.convertMetastoreParquet=false --deploy-mode client 
> --driver-memory 1G --executor-memory 3G --num-executors 1 --packages 
> org.apache.spark:spark-avro_2.11:2.4.4 {code}
> {code:java}
> import java.sql.Timestamp
> import spark.implicits._import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val df1 = Seq(
>         ("row1", 1, "part1" ,1578283932000L ),
>         ("row2", 1, "part1", 1578283942000L)
>       ).toDF("row", "ppath", "preComb","eventTime")
>  df1.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Overwrite).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> // evol schema
> {code:java}
> val df2 = Seq(
>         ("row1", 1, "part1" ,1678283932000L, "abcd" ),
>         ("row2", 1, "part1", 1678283942000L, "defg")
>       ).toDF("row", "ppath", "preComb", "eventTime", "randomStr")
>  df2.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Append).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> stacktrace
> {code:java}
> scala>  df2.write.format("hudi").
>      |         options(getQuickstartWriteConfigs).
>      |         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>      |         option(RECORDKEY_FIELD_OPT_KEY, "row").
>      |         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>      |         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>      |         
> 

[jira] [Comment Edited] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution

2022-01-31 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484721#comment-17484721
 ] 

Tao Meng edited comment on HUDI-3347 at 1/31/22, 2:51 PM:
--

[~shivnarayan]   yes, will do it.  Pls assign it to me, thanks


was (Author: mengtao):
yes, will do it.

> Updating table schema fails w/ hms mode w/ schema evolution
> ---
>
> Key: HUDI-3347
> URL: https://issues.apache.org/jira/browse/HUDI-3347
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive-sync
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.11.0
>
>
> When table schema got upgraded with a new batch of write, hms mode sync 
> fails. 
>  
> steps to reproduce using our docker demo set up. 
> I used 0.10.0 to test this out.
> adhoc-1
> {code:java}
> $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] 
> --driver-class-path $HADOOP_CONF_DIR --conf 
> spark.sql.hive.convertMetastoreParquet=false --deploy-mode client 
> --driver-memory 1G --executor-memory 3G --num-executors 1 --packages 
> org.apache.spark:spark-avro_2.11:2.4.4 {code}
> {code:java}
> import java.sql.Timestamp
> import spark.implicits._import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val df1 = Seq(
>         ("row1", 1, "part1" ,1578283932000L ),
>         ("row2", 1, "part1", 1578283942000L)
>       ).toDF("row", "ppath", "preComb","eventTime")
>  df1.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Overwrite).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> // evol schema
> {code:java}
> val df2 = Seq(
>         ("row1", 1, "part1" ,1678283932000L, "abcd" ),
>         ("row2", 1, "part1", 1678283942000L, "defg")
>       ).toDF("row", "ppath", "preComb", "eventTime", "randomStr")
>  df2.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Append).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> stacktrace
> {code:java}
> scala>  df2.write.format("hudi").
>      |         options(getQuickstartWriteConfigs).
>      |         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>      |         option(RECORDKEY_FIELD_OPT_KEY, "row").
>      |         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>      |         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>      |     

[jira] [Commented] (HUDI-3347) Updating table schema fails w/ hms mode w/ schema evolution

2022-01-31 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17484721#comment-17484721
 ] 

Tao Meng commented on HUDI-3347:


yes, will do it.

> Updating table schema fails w/ hms mode w/ schema evolution
> ---
>
> Key: HUDI-3347
> URL: https://issues.apache.org/jira/browse/HUDI-3347
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive-sync
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.11.0
>
>
> When table schema got upgraded with a new batch of write, hms mode sync 
> fails. 
>  
> steps to reproduce using our docker demo set up. 
> I used 0.10.0 to test this out.
> adhoc-1
> {code:java}
> $SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] 
> --driver-class-path $HADOOP_CONF_DIR --conf 
> spark.sql.hive.convertMetastoreParquet=false --deploy-mode client 
> --driver-memory 1G --executor-memory 3G --num-executors 1 --packages 
> org.apache.spark:spark-avro_2.11:2.4.4 {code}
> {code:java}
> import java.sql.Timestamp
> import spark.implicits._import org.apache.hudi.QuickstartUtils._
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.SaveMode._
> import org.apache.hudi.DataSourceReadOptions._
> import org.apache.hudi.DataSourceWriteOptions._
> import org.apache.hudi.config.HoodieWriteConfig._
> val df1 = Seq(
>         ("row1", 1, "part1" ,1578283932000L ),
>         ("row2", 1, "part1", 1578283942000L)
>       ).toDF("row", "ppath", "preComb","eventTime")
>  df1.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Overwrite).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> // evol schema
> {code:java}
> val df2 = Seq(
>         ("row1", 1, "part1" ,1678283932000L, "abcd" ),
>         ("row2", 1, "part1", 1678283942000L, "defg")
>       ).toDF("row", "ppath", "preComb", "eventTime", "randomStr")
>  df2.write.format("hudi").
>         options(getQuickstartWriteConfigs).
>         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>         option(RECORDKEY_FIELD_OPT_KEY, "row").
>         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>         option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>         option("hoodie.datasource.hive_sync.mode","hms").
>         option("hoodie.datasource.hive_sync.database","default").
>         option("hoodie.datasource.hive_sync.table","timestamp_tbl1").
>         
> option("hoodie.datasource.hive_sync.partition_fields","year,month,day").
>         option("hoodie.datasource.hive_sync.enable","true").
>         option(TABLE_NAME, "timestamp_tbl1").
>         mode(Append).
>         save("/tmp/hudi_timestamp_tbl1")
>  {code}
> stacktrace
> {code:java}
> scala>  df2.write.format("hudi").
>      |         options(getQuickstartWriteConfigs).
>      |         option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
>      |         option(RECORDKEY_FIELD_OPT_KEY, "row").
>      |         option(PARTITIONPATH_FIELD_OPT_KEY, "ppath").
>      |         
> option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.TimestampBasedKeyGenerator").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat","/MM/dd").
>      |         
> option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
>      |         option("hoodie.datasource.hive_sync.mode","hms").
>      |         option("hoodie.datasource.hive_sync.database","default").
>      |  

[jira] [Commented] (HUDI-2873) Support optimize data layout by sql and make the build more fast

2022-01-18 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477743#comment-17477743
 ] 

Tao Meng commented on HUDI-2873:


[~shibei]  do you have wechat,  pls add me 1037817390

> Support optimize data layout by sql and make the build more fast
> 
>
> Key: HUDI-2873
> URL: https://issues.apache.org/jira/browse/HUDI-2873
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Performance, spark
>Reporter: tao meng
>Assignee: shibei
>Priority: Critical
>  Labels: sev:high
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2873) Support optimize data layout by sql and make the build more fast

2022-01-17 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477214#comment-17477214
 ] 

Tao Meng commented on HUDI-2873:


[~alexey.kudinkin]  [~shibei] 

1)  support optimize data by sparksql, just like dela lake : OPTIMIZE xx_table 
ZORDER/HILBERT by col1, col2;   

2)  introduce a new write operation to rewrite table data directly , At 
present, The performance of clustering operation is slightly worse than that of 
direct overwrite

 

 

> Support optimize data layout by sql and make the build more fast
> 
>
> Key: HUDI-2873
> URL: https://issues.apache.org/jira/browse/HUDI-2873
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Performance, spark
>Reporter: tao meng
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2645) Rewrite Zoptimize and other files in scala into Java

2022-01-17 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477205#comment-17477205
 ] 

Tao Meng commented on HUDI-2645:


[~shibei]   vc means we should better remove scala files in 
spark-client-module,   now we have already rewrite most zoptimize codes by 
java, only RangeSample is also a scala file

> Rewrite Zoptimize and other files in scala into Java
> 
>
> Key: HUDI-2645
> URL: https://issues.apache.org/jira/browse/HUDI-2645
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3237) ALTER TABLE column type change fails select query

2022-01-13 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17475398#comment-17475398
 ] 

Tao Meng commented on HUDI-3237:


[~biyan900...@gmail.com]  [~xushiyan]  i agree that we should close this 
ability on change column type.

 I am currently working on rfc-33, this  feature has already supported in our 
work

> ALTER TABLE column type change fails select query
> -
>
> Key: HUDI-3237
> URL: https://issues.apache.org/jira/browse/HUDI-3237
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.11.0
>
> Attachments: image-2022-01-13-17-04-09-038.png
>
>
> {code:sql}
> create table if not exists cow_nonpt_nonpcf_tbl (
>   id int,
>   name string,
>   price double
> ) using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> );
> insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id int
> ALTER TABLE cow_nonpt_nonpcf_tbl change column id id bigint;
> DESC cow_nonpt_nonpcf_tbl;
> -- shows id bigint
> -- this works fine so far
> select * from cow_nonpt_nonpcf_tbl;
> -- throws exception
> {code}
> {code}
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> file:///opt/spark-warehouse/cow_nonpt_nonpcf_tbl/ff3c68e6-84d4-4a8a-8bc8-cc58736847aa-0_0-7-7_20220112182401452.parquet.
>  Column: [id], Expected: bigint, Found: INT32
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
> ... 20 more
> {code}
> reported while testing on 0.10.1-rc1 (spark 3.0.3, 3.1.2)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2874) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use hive/presto

2022-01-12 Thread Tao Meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17474530#comment-17474530
 ] 

Tao Meng commented on HUDI-2874:


[~shivnarayan]   yes we have areadly fixed this issue in 0.10

> hudi should remove the temp file which create by 
> HoodieMergedLogRecordScanner, when we use hive/presto
> --
>
> Key: HUDI-2874
> URL: https://issues.apache.org/jira/browse/HUDI-2874
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.9.0
> Environment: hive3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Priority: Major
> Fix For: 0.11.0
>
>
>  when we use hive/presto to query mor table
> hudi should remove the temp file which create by HoodieMergedLogRecordScanner
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3164) CTAS fails w/ UnsupportedOperationException when trying to modify immutable map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp

2022-01-04 Thread tao meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468646#comment-17468646
 ] 

tao meng commented on HUDI-3164:


[~shivnarayan]   leesf has already solved this problem see 
https://issues.apache.org/jira/browse/HUDI-3140  

> CTAS fails w/ UnsupportedOperationException when trying to modify immutable 
> map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp
> -
>
> Key: HUDI-3164
> URL: https://issues.apache.org/jira/browse/HUDI-3164
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.0, 0.10.1
>
>
> CTAS fails w/ UnsupportedOperationException when trying to modify immutable 
> map in DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp. 
> with spark3.2 master. 
>  
> {code:java}
> val s = """
> create table catalog_sales 
> USING HUDI
> options (
>   type = 'cow',
>   primaryKey = 'cs_item_sk,cs_order_number'
> )
> LOCATION 'file:///tmp/catalog_sales_hudi'
> PARTITIONED BY (cs_sold_date_sk)
> AS SELECT * FROM catalog_sales_ext2  {code}
> stacktrace:
> {code:java}
> java.lang.UnsupportedOperationException
>   at java.util.Collections$UnmodifiableMap.put(Collections.java:1459)
>   at 
> org.apache.hudi.DataSourceUtils.mayBeOverwriteParquetWriteLegacyFormatProp(DataSourceUtils.java:323)
>   at 
> org.apache.hudi.spark3.internal.DefaultSource.getTable(DefaultSource.java:59)
>   at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83)
>   at 
> org.apache.spark.sql.DataFrameWriter.getTable$1(DataFrameWriter.scala:280)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:296)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:478)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:109)
>   at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:91)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3096) fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark

2021-12-21 Thread Tao Meng (Jira)
Tao Meng created HUDI-3096:
--

 Summary: fixed the bug that  the cow table(contains decimalType) 
write by flink cannot be read by spark
 Key: HUDI-3096
 URL: https://issues.apache.org/jira/browse/HUDI-3096
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Affects Versions: 0.10.0
 Environment: flink  1.13.1
spark 3.1.1
Reporter: Tao Meng
 Fix For: 0.11.0


now,  flink will write decimalType as byte[]

when spark read that decimal Type, if spark find the precision of current 
decimal is small spark treat it as int/long which caused the fllow error:

 

Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet 
column cannot be converted in file 
hdfs://x/tmp/hudi/hudi_x/46d44c57-aa43-41e2-a8aa-76dcc9dac7e4_0-4-0_20211221201230.parquet.
 Column: [c7], Expected: decimal(10,4), Found: BINARY
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:517)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2059) When log exists in mor table, clustering is triggered. The query result shows that the update record in log is lost

2021-12-13 Thread tao meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458858#comment-17458858
 ] 

tao meng commented on HUDI-2059:


[~shivnarayan]   as HUDI-2170 merged, it's ok to close  

> When log exists in mor table,  clustering is triggered. The query result 
> shows that the update record in log is lost
> 
>
> Key: HUDI-2059
> URL: https://issues.apache.org/jira/browse/HUDI-2059
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: hadoop 3.1.1
> spark3.1.1/spark2.4.5
> hive3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:high
> Fix For: 0.11.0
>
>
> When log exists in mor table, and clustering is triggered. The query result 
> shows that the update record of log is lost。
> the reason of this problem is that:  hoodie use HoodieFileSliceReader to read 
> table data and then do clustering.  HoodieFileSliceReader call 
> HoodieMergedLogRecordScanner.
> processNextRecord to merge update values and old valuse,   when call that 
> function old values is reserved update values is discarded, this is wrong。
> test step:
> // step1 : create hudi mor table
> val df = spark.range(0, 1000).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("age", lit(1))
>  .withColumn("p", lit(2))
> df.write.format("hudi").
>  option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
>  option(PRECOMBINE_FIELD_OPT_KEY, "col3").
>  option(RECORDKEY_FIELD_OPT_KEY, "keyid").
>  option(PARTITIONPATH_FIELD_OPT_KEY, "p").
>  option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert").
>  option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
> classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
>  option("hoodie.insert.shuffle.parallelism", "4").
>  option("hoodie.upsert.shuffle.parallelism", "4").
>  option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
>  .mode(SaveMode.Overwrite).save(basePath)
> // step2, update age where keyid < 5 to produce log files
> val df1 = spark.range(0, 5).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("age", lit(1 + 1000))
>  .withColumn("p", lit(2))
> df1.write.format("hudi").
>  option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
>  option(PRECOMBINE_FIELD_OPT_KEY, "col3").
>  option(RECORDKEY_FIELD_OPT_KEY, "keyid").
>  option(PARTITIONPATH_FIELD_OPT_KEY, "p").
>  option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert").
>  option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
> classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
>  option("hoodie.insert.shuffle.parallelism", "4").
>  option("hoodie.upsert.shuffle.parallelism", "4").
>  option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
>  .mode(SaveMode.Append).save(basePath)
> // step3, do cluster inline
> val df2 = spark.range(6, 10).toDF("keyid")
>  .withColumn("col3", expr("keyid"))
>  .withColumn("age", lit(1 + 2000))
>  .withColumn("p", lit(2))
> df2.write.format("hudi").
>  option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
>  option(PRECOMBINE_FIELD_OPT_KEY, "col3").
>  option(RECORDKEY_FIELD_OPT_KEY, "keyid").
>  option(PARTITIONPATH_FIELD_OPT_KEY, "p").
>  option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert").
>  option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
> classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
>  option("hoodie.insert.shuffle.parallelism", "4").
>  option("hoodie.upsert.shuffle.parallelism", "4").
>  option("hoodie.parquet.small.file.limit", "0").
>  option("hoodie.clustering.inline", "true").
>  option("hoodie.clustering.inline.max.commits", "1").
>  option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824").
>  option("hoodie.clustering.plan.strategy.small.file.limit", "629145600").
>  option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
>  .mode(SaveMode.Append).save(basePath)
> spark.read.format("hudi")
>  .load(basePath).select("age").where("keyid = 0").show(100, false)
> +---+
> |age|
> +---+
> |1 |
> +—+
> the result is wrong, since we update the value of age to 1001 at step 2.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3001) clean up temp marker directory when finish bootstrap operation.

2021-12-13 Thread tao meng (Jira)
tao meng created HUDI-3001:
--

 Summary: clean up  temp  marker directory  when finish bootstrap 
operation.
 Key: HUDI-3001
 URL: https://issues.apache.org/jira/browse/HUDI-3001
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: spark2.4.5
hadoop 3.1.1
Reporter: tao meng
 Fix For: 0.11.0


now, when we finished bootstrap operation, the temp marker directory has not 
been deleted.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner

2021-12-08 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2966:
---
Priority: Minor  (was: Major)

> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
> ---
>
> Key: HUDI-2966
> URL: https://issues.apache.org/jira/browse/HUDI-2966
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Minor
> Fix For: 0.11.0
>
>
> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When 
> the query is completed。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner

2021-12-08 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2966:
---
Priority: Major  (was: Minor)

> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
> ---
>
> Key: HUDI-2966
> URL: https://issues.apache.org/jira/browse/HUDI-2966
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Major
> Fix For: 0.11.0
>
>
> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When 
> the query is completed。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner

2021-12-08 Thread tao meng (Jira)
tao meng created HUDI-2966:
--

 Summary: Add TaskCompletionListener for HoodieMergeOnReadRDD to 
close logScanner
 Key: HUDI-2966
 URL: https://issues.apache.org/jira/browse/HUDI-2966
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: tao meng
 Fix For: 0.11.0


Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When 
the query is completed。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2958) Automatically set spark.sql.parquet.writelegacyformat. When using bulkinsert to insert data will contains decimal Type.

2021-12-08 Thread tao meng (Jira)
tao meng created HUDI-2958:
--

 Summary: Automatically set spark.sql.parquet.writelegacyformat. 
When using bulkinsert to insert data will contains decimal Type.
 Key: HUDI-2958
 URL: https://issues.apache.org/jira/browse/HUDI-2958
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: tao meng
 Fix For: 0.11.0


Now by default ParquetWriteSupport will write DecimalType to parquet as 
int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(),
but AvroParquetReader which used by HoodieParquetReader cannot support read 
int32/int64 as DecimalType. this will lead follow error

Caused by: java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
    at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
    at 
org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75)
    ..



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2958) Automatically set spark.sql.parquet.writelegacyformat; When using bulkinsert to insert data which contains decimal Type.

2021-12-08 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2958:
---
Summary: Automatically set spark.sql.parquet.writelegacyformat; When using 
bulkinsert to insert data which contains decimal Type.  (was: Automatically set 
spark.sql.parquet.writelegacyformat. When using bulkinsert to insert data will 
contains decimal Type.)

> Automatically set spark.sql.parquet.writelegacyformat; When using bulkinsert 
> to insert data which contains decimal Type.
> 
>
> Key: HUDI-2958
> URL: https://issues.apache.org/jira/browse/HUDI-2958
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Minor
> Fix For: 0.11.0
>
>
> Now by default ParquetWriteSupport will write DecimalType to parquet as 
> int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(),
> but AvroParquetReader which used by HoodieParquetReader cannot support read 
> int32/int64 as DecimalType. this will lead follow error
> Caused by: java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>     at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
>     at 
> org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75)
>     ..



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2901) Fixed the bug clustering jobs are not running in parallel

2021-12-01 Thread tao meng (Jira)
tao meng created HUDI-2901:
--

 Summary: Fixed the bug clustering jobs are not running in parallel
 Key: HUDI-2901
 URL: https://issues.apache.org/jira/browse/HUDI-2901
 Project: Apache Hudi
  Issue Type: Bug
  Components: Performance
Affects Versions: 0.9.0
 Environment: spark2.4.5
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.11.0


Fixed the bug clustering jobs are not running in parasllel。

[https://github.com/apache/hudi/issues/4135]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2876) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use presto

2021-11-27 Thread tao meng (Jira)
tao meng created HUDI-2876:
--

 Summary: hudi should remove the temp file which create by 
HoodieMergedLogRecordScanner, when we use presto
 Key: HUDI-2876
 URL: https://issues.apache.org/jira/browse/HUDI-2876
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
 Environment: hive3.1.1
hadoop3.1.1
Reporter: tao meng
 Fix For: 0.11.0


now when we read mor table by hive/presto,.

HoodieMergedLogRecordScanner will create a tmp dir for read log.  this file 
will be deleted only when the jvm exit.

for presto/ hive local query,  the jvm will not exit。 with more and more query 
happen, the  tmp dir will become more and more larger。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2874) hudi should remove the temp file which create by HoodieMergedLogRecordScanner, when we use hive/presto

2021-11-27 Thread tao meng (Jira)
tao meng created HUDI-2874:
--

 Summary: hudi should remove the temp file which create by 
HoodieMergedLogRecordScanner, when we use hive/presto
 Key: HUDI-2874
 URL: https://issues.apache.org/jira/browse/HUDI-2874
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.9.0
 Environment: hive3.1.1
hadoop3.1.1
Reporter: tao meng
 Fix For: 0.11.0


 when we use hive/presto to query mor table

hudi should remove the temp file which create by HoodieMergedLogRecordScanner

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2873) support optimize data layout by sql and make the build more fast

2021-11-26 Thread tao meng (Jira)
tao meng created HUDI-2873:
--

 Summary: support optimize data layout by sql and make the build 
more fast
 Key: HUDI-2873
 URL: https://issues.apache.org/jira/browse/HUDI-2873
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Performance, Spark Integration
Reporter: tao meng
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2778) Optimize statistics collection related codes and add more docs for z-order

2021-11-16 Thread tao meng (Jira)
tao meng created HUDI-2778:
--

 Summary: Optimize statistics collection related codes  and add 
more docs for z-order
 Key: HUDI-2778
 URL: https://issues.apache.org/jira/browse/HUDI-2778
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: tao meng
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2758) remove redundant code in the HoodieRealtimeInputFormatUtils.getRealtimeSplits

2021-11-14 Thread tao meng (Jira)
tao meng created HUDI-2758:
--

 Summary: remove redundant code in the 
HoodieRealtimeInputFormatUtils.getRealtimeSplits
 Key: HUDI-2758
 URL: https://issues.apache.org/jira/browse/HUDI-2758
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Hive Integration
Reporter: tao meng
 Fix For: 0.10.0


remove redundant code in the HoodieRealtimeInputFormatUtils.getRealtimeSplits



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2697) Minor changes about hbase index config.

2021-11-05 Thread tao meng (Jira)
tao meng created HUDI-2697:
--

 Summary: Minor changes about  hbase index config.
 Key: HUDI-2697
 URL: https://issues.apache.org/jira/browse/HUDI-2697
 Project: Apache Hudi
  Issue Type: Bug
  Components: configs
Affects Versions: 0.9.0
Reporter: tao meng
 Fix For: 0.10.0


now when we use hbase index。
According to the official document ,hoodie.hbase.index.update.partition.path, 
hoodie.index.hbase.put.batch.size.autocompute are optional , however when we 
not specify the above parameters。 NPE problem will happen。
java.lang.NullPointerException
at 
org.apache.hudi.config.HoodieWriteConfig.getHbaseIndexUpdatePartitionPath(HoodieWriteConfig.java:1353)
at 
org.apache.hudi.index.hbase.SparkHoodieHBaseIndex.lambda$locationTagFunction$eda54cbe$1(SparkHoodieHBaseIndex.java:212)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2676) Hudi should synchronize owner information to hudi _rt/_ro table。

2021-11-02 Thread tao meng (Jira)
tao meng created HUDI-2676:
--

 Summary: Hudi should synchronize owner information to hudi _rt/_ro 
table。
 Key: HUDI-2676
 URL: https://issues.apache.org/jira/browse/HUDI-2676
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Affects Versions: 0.9.0
 Environment: hudi 0.9.0
spark3.1.1
hive3.1.1
hadoop3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


hudi rt/ro table missing owner information, 
 If permission protection is enabled for the query engine, the verification of 
the protection group will fail。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2674) hudi hive reader should not print read values

2021-11-02 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2674:
---
Summary: hudi hive reader should not print read values  (was: hudi hive 
reader should not log read values)

> hudi hive reader should not print read values
> -
>
> Key: HUDI-2674
> URL: https://issues.apache.org/jira/browse/HUDI-2674
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.9.0
> Environment: hudi 0.9.0
> hive 3.1.1
> hadoop 3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Critical
> Fix For: 0.10.0
>
>
> now when we use hive to query hudi table and set 
> hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
> all read values will be print. This can lead to performance problems and data 
> security problems,
> as:
> xxx 20:10:45,045 | INFO | main | Reading from record reader | 
> HoodieCombineRealtimeRecordReader.java:69
> xx 20:10:45,045 | INFO | main | "values_0.158268513314199_10": 
> \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"16","type5":"IntWritable","value6":"16jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"}
>  | HoodieCombineRealtimeRecordReader.java:70
> xxx 20:10:45,045 | INFO | main | Reading from record reader | 
> HoodieCombineRealtimeRecordReader.java:69
> xxx 20:10:45,045 | INFO | main | "values_0.16924293134429924_10": 
> \{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"96","type5":"IntWritable","value6":"96jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"}
>  | HoodieCombineRealtimeRecordReader.java:70
> 2021-11-02 20:10:45,045 | INFO | main | Reading from record reader | 
> HoodieCombineRealtimeRecordReader.java:69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2674) hudi hive reader should not log read values

2021-11-02 Thread tao meng (Jira)
tao meng created HUDI-2674:
--

 Summary: hudi hive reader should not log read values
 Key: HUDI-2674
 URL: https://issues.apache.org/jira/browse/HUDI-2674
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.9.0
 Environment: hudi 0.9.0
hive 3.1.1
hadoop 3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


now when we use hive to query hudi table and set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

all read values will be print. This can lead to performance problems and data 
security problems,

as:

xxx 20:10:45,045 | INFO | main | Reading from record reader | 
HoodieCombineRealtimeRecordReader.java:69
xx 20:10:45,045 | INFO | main | "values_0.158268513314199_10": 
\{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"16","type5":"IntWritable","value6":"16jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"}
 | HoodieCombineRealtimeRecordReader.java:70
xxx 20:10:45,045 | INFO | main | Reading from record reader | 
HoodieCombineRealtimeRecordReader.java:69
xxx 20:10:45,045 | INFO | main | "values_0.16924293134429924_10": 
\{"value0":"20211102192749","type0":"Text","value1":"null","type1":"unknown","value2":"null","type2":"unknown","value3":"null","type3":"unknown","value4":"null","type4":"unknown","value5":"96","type5":"IntWritable","value6":"96jack","type6":"Text","value7":"null","type7":"unknown","value8":"null","type8":"unknown","value9":"null","type9":"unknown"}
 | HoodieCombineRealtimeRecordReader.java:70
2021-11-02 20:10:45,045 | INFO | main | Reading from record reader | 
HoodieCombineRealtimeRecordReader.java:69



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2560) Introduce id_based schema to support full schema evolution

2021-10-15 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2560:
---
Description: Introduce id_based schema to support full schema evolution.  
(was: +Introduce id_based schema to support full schema evolution+)

> Introduce id_based schema to support full schema evolution
> --
>
> Key: HUDI-2560
> URL: https://issues.apache.org/jira/browse/HUDI-2560
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: tao meng
>Priority: Major
> Fix For: 0.10.0
>
>
> Introduce id_based schema to support full schema evolution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2560) Introduce id_based schema to support full schema evolution

2021-10-15 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2560:
---
Description: +Introduce id_based schema to support full schema evolution+

> Introduce id_based schema to support full schema evolution
> --
>
> Key: HUDI-2560
> URL: https://issues.apache.org/jira/browse/HUDI-2560
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: tao meng
>Priority: Major
> Fix For: 0.10.0
>
>
> +Introduce id_based schema to support full schema evolution+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2560) Introduce id_based schema to support full schema evolution

2021-10-15 Thread tao meng (Jira)
tao meng created HUDI-2560:
--

 Summary: Introduce id_based schema to support full schema evolution
 Key: HUDI-2560
 URL: https://issues.apache.org/jira/browse/HUDI-2560
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Common Core
Reporter: tao meng
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2429) Full schema evolution

2021-09-14 Thread tao meng (Jira)
tao meng created HUDI-2429:
--

 Summary: Full schema evolution
 Key: HUDI-2429
 URL: https://issues.apache.org/jira/browse/HUDI-2429
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Common Core
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-23 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2214:
---
Description: 
residual temporary files after clustering are not cleaned up

// test step

step1: do clustering

val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
val inputDF1: Dataset[Row] = 
spark.read.json(spark.sparkContext.parallelize(records1, 2))
inputDF1.write.format("org.apache.hudi")
 .options(commonOpts)
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 // option for clustering
 .option("hoodie.parquet.small.file.limit", "0")
 .option("hoodie.clustering.inline", "true")
 .option("hoodie.clustering.inline.max.commits", "1")
 .option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824")
 .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
 .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
Long.MaxValue.toString)
 .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
String.valueOf(12 *1024 * 1024L))
 .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, begin_lon")
 .mode(SaveMode.Overwrite)
 .save(basePath)

step2: check the temp dir, we find 
/tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty

{color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
 {color}

is not cleaned up.

 

 

 

 

  was:
residual temporary files after clustering are not cleaned up

 

 

 


> residual temporary files after clustering are not cleaned up
> 
>
> Key: HUDI-2214
> URL: https://issues.apache.org/jira/browse/HUDI-2214
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
> Fix For: 0.10.0
>
>
> residual temporary files after clustering are not cleaned up
> // test step
> step1: do clustering
> val records1 = recordsToStrings(dataGen.generateInserts("001", 1000)).toList
> val inputDF1: Dataset[Row] = 
> spark.read.json(spark.sparkContext.parallelize(records1, 2))
> inputDF1.write.format("org.apache.hudi")
>  .options(commonOpts)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY.key(), 
> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY.key(), 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  // option for clustering
>  .option("hoodie.parquet.small.file.limit", "0")
>  .option("hoodie.clustering.inline", "true")
>  .option("hoodie.clustering.inline.max.commits", "1")
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> "1073741824")
>  .option("hoodie.clustering.plan.strategy.small.file.limit", "629145600")
>  .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
> Long.MaxValue.toString)
>  .option("hoodie.clustering.plan.strategy.target.file.max.bytes", 
> String.valueOf(12 *1024 * 1024L))
>  .option("hoodie.clustering.plan.strategy.sort.columns", "begin_lat, 
> begin_lon")
>  .mode(SaveMode.Overwrite)
>  .save(basePath)
> step2: check the temp dir, we find 
> /tmp/junit1835474867260509758/dataset/.hoodie/.temp/ is not empty
> {color:#FF}/tmp/junit1835474867260509758/dataset/.hoodie/.temp/20210723171208
>  {color}
> is not cleaned up.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2214) residual temporary files after clustering are not cleaned up

2021-07-23 Thread tao meng (Jira)
tao meng created HUDI-2214:
--

 Summary: residual temporary files after clustering are not cleaned 
up
 Key: HUDI-2214
 URL: https://issues.apache.org/jira/browse/HUDI-2214
 Project: Apache Hudi
  Issue Type: Bug
  Components: Cleaner
Affects Versions: 0.8.0
 Environment: spark3.1.1
hadoop3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


residual temporary files after clustering are not cleaned up

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2116) sync 10w partitions to hive by using HiveSyncTool lead to the oom of hive MetaStore

2021-07-01 Thread tao meng (Jira)
tao meng created HUDI-2116:
--

 Summary:  sync 10w partitions to hive by using HiveSyncTool lead 
to the oom of hive MetaStore
 Key: HUDI-2116
 URL: https://issues.apache.org/jira/browse/HUDI-2116
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
 Environment: hive3.1.1
hadoop 3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


when we try to sync 10w partitions to hive by using HiveSyncTool lead to the 
oom of hive MetaStore。
 
here is a stress test for HiveSyncTool

env: 

hive metastore -Xms16G -Xmx16G

hive.metastore.client.socket.timeout=10800

 
||partitionNum||time consume||
|100|37s|
|1000|168s|
|5000|1830s|
|1|timeout|
|10|hive metastore oom|

HiveSyncTools  sync all partitions to hive metastore at once。 when the 
partitions num is large ,it puts a lot of pressure on hive metastore。 for large 
partition num we should support batch sync 。

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2102) support hilbert curve for hudi

2021-06-29 Thread tao meng (Jira)
tao meng created HUDI-2102:
--

 Summary: support hilbert curve for hudi
 Key: HUDI-2102
 URL: https://issues.apache.org/jira/browse/HUDI-2102
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


support hilbert curve for hudi to optimze hudi query



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2101) support z-order for hudi

2021-06-29 Thread tao meng (Jira)
tao meng created HUDI-2101:
--

 Summary: support z-order for hudi
 Key: HUDI-2101
 URL: https://issues.apache.org/jira/browse/HUDI-2101
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


support z-order for hudi to optimze the query



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2100) Support Space curve for hudi

2021-06-29 Thread tao meng (Jira)
tao meng created HUDI-2100:
--

 Summary: Support Space curve for hudi
 Key: HUDI-2100
 URL: https://issues.apache.org/jira/browse/HUDI-2100
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Spark Integration
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.10.0


supoort space curve to optimize the cluster of hudi file to improve query 
performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2099) hive lock which state is WATING should be released, otherwise this hive lock will be locked forever

2021-06-29 Thread tao meng (Jira)
tao meng created HUDI-2099:
--

 Summary:  hive lock which state is WATING should be released,  
otherwise this hive lock will be locked forever
 Key: HUDI-2099
 URL: https://issues.apache.org/jira/browse/HUDI-2099
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Affects Versions: 0.8.0
 Environment: spark3.1.1
hive3.1.1
hadoop3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.8.0


when we acquire hive lock failed and the lock state is WATING, we should 
release this WATING lock; otherwise this hive lock will be locked forever。

test step:

use hive lock to control concurrent write for hudi, let‘s call this lock 
hive_lock

start three writers to write hudi table by using hive_lock concurrently, one of 
the writer will failed to acquire hive lock due to competition issues。

*Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
Unable to acquire lock, lock object LockResponse(lockid:76, state:WAITING)*

 

start another writer to write hudi table by using same hive_lock, then we find 
hive_lock is locked forever, we have no way to acquire it

*Exception in thread "main" org.apache.hudi.exception.HoodieLockException: 
Unable to acquire lock, lock object LockResponse(lockid:87, state:WAITING)*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2098) add Hdfs file lock for HUDI

2021-06-29 Thread tao meng (Jira)
tao meng created HUDI-2098:
--

 Summary: add Hdfs file lock for HUDI
 Key: HUDI-2098
 URL: https://issues.apache.org/jira/browse/HUDI-2098
 Project: Apache Hudi
  Issue Type: Bug
  Components: Usability, Utilities
Affects Versions: 0.8.0
 Environment: spark3.1.1
hive3.1.1
hadoop3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.9.0


now hudi support hive/zk lock for concurrency write,  we introduce a new lock 
type hdfs lock

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2098) add Hdfs file lock for HUDI

2021-06-29 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-2098:
---
Component/s: (was: Utilities)

> add Hdfs file lock for HUDI
> ---
>
> Key: HUDI-2098
> URL: https://issues.apache.org/jira/browse/HUDI-2098
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Usability
>Affects Versions: 0.8.0
> Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: tao meng
>Assignee: tao meng
>Priority: Minor
>  Labels: features
> Fix For: 0.9.0
>
>
> now hudi support hive/zk lock for concurrency write,  we introduce a new lock 
> type hdfs lock
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2090) when hudi metadata is enabled, use different user to query table, the query will failed

2021-06-28 Thread tao meng (Jira)
tao meng created HUDI-2090:
--

 Summary: when  hudi metadata is enabled,  use different user to 
query table, the query will failed
 Key: HUDI-2090
 URL: https://issues.apache.org/jira/browse/HUDI-2090
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Affects Versions: 0.8.0
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.9.0


when hudi metadata is enabled, use different user to query table, the query 
will failed.
 
The user permissions of the temporary directory generated by DiskBasedMap are 
incorrect. This directory only has permissions for the user of current 
operation, and other users have no permissions to access it, which leads to 
this problem

test step:

step1: create hudi table with metadata enabled.

step1: create two user(omm,user2)

step2:  

f1) use omm to query hudi table 

DiskBasedMap will generate view_map with permissions drwx--.

2) then user user2 to query hudi table

now user2 has no right to access view_map which created by omm,   the exception 
will throws:

     org.apache.hudi.exception.HoodieIOException: IOException when creating 
ExternalSplillableMap at /tmp/view_map

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2089) fix the bug that metatable cannot support non_partition table

2021-06-28 Thread tao meng (Jira)
tao meng created HUDI-2089:
--

 Summary: fix the bug that metatable cannot support non_partition 
table
 Key: HUDI-2089
 URL: https://issues.apache.org/jira/browse/HUDI-2089
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Affects Versions: 0.8.0
 Environment: spark3.1.1
hive3.1.1
hadoop 3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.9.0


now, we found that when we enable metable for non_partition hudi table,  the 
follow  error occur:

org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata 
table.org.apache.hudi.exception.HoodieMetadataException: Error syncing to 
metadata table.
 at 
org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:447)
 at 
org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:433)
 at 
org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:187)

we use hudi 0.8, but we  also find this problem in latest code of hudi

test step:

val df = spark.range(0, 1000).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("age", lit(1))
 .withColumn("p", lit(2))

df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL).
 option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "").
 option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert").
 option("hoodie.insert.shuffle.parallelism", "4").
 option("hoodie.metadata.enable", "true").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Overwrite).save(basePath)

// upsert same record again
df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL).
 option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "").
 option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert").
 option("hoodie.insert.shuffle.parallelism", "4").
 option("hoodie.metadata.enable", "true").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Append).save(basePath)

 

org.apache.hudi.exception.HoodieMetadataException: Error syncing to metadata 
table.org.apache.hudi.exception.HoodieMetadataException: Error syncing to 
metadata table.
 at 
org.apache.hudi.client.SparkRDDWriteClient.syncTableMetadata(SparkRDDWriteClient.java:447)
 at 
org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:433)
 at 
org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:187)
 at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) 
at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:564)
 at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:230) 
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:162) at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2086) redo the logical of mor_incremental_view for hive

2021-06-28 Thread tao meng (Jira)
tao meng created HUDI-2086:
--

 Summary: redo the logical of mor_incremental_view for hive
 Key: HUDI-2086
 URL: https://issues.apache.org/jira/browse/HUDI-2086
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
 Environment: spark3.1.1
hive3.1.1
hadoop3.1.1
os: suse
Reporter: tao meng


now ,There are some problems with mor_incremental_view for hive。

For example,

1):*hudi cannot read the lastest incremental datas which are stored by logs*

think that:  create a mor table with bulk_insert, and then do upsert for this 
table, 

no we want to query the latest incremental data by hive/sparksql,   however the 
lastest incremental datas are stored by logs,   when we do query nothings will 
return

step1: prepare data

val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, 
x+"jack", Random.nextInt(2))).toDF()
 .withColumn("col3", expr("keyid + 3000"))
 .withColumn("p", lit(1))

step2: do bulk_insert

mergePartitionTable(df, 4, "default", "inc", tableType = 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step3: do upsert

mergePartitionTable(df, 4, "default", "inc", tableType = 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

step4:  check the lastest commit time and do query

spark.sql("set hoodie.inc.consume.mode=INCREMENTAL")
spark.sql("set hoodie.inc.consume.max.commits=1")
spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935")
spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > 
'20210628103935' order by keyid").show(100, false)

+-++
|keyid|col3|
+-++
+-++

 

2):*if we do insert_over_write/insert_over_write_table for hudi mor table, the 
incr query result is wrong when we want to query the data before 
insert_overwrite/insert_overwrite_table*

step1: do bulk_insert 

mergePartitionTable(df, 4, "default", "overInc", tableType = 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

now the commits is

[20210628160614.deltacommit ]

step2: do insert_overwrite_table

mergePartitionTable(df, 4, "default", "overInc", tableType = 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table")

now the commits is

[20210628160614.deltacommit, 20210628160923.replacecommit ]

step3: query the data before insert_overwrite_table

spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL")
spark.sql("set hoodie.overInc.consume.max.commits=1")
spark.sql("set hoodie.overInc.consume.start.timestamp=0")
spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > '0' 
order by keyid").show(100, false)

+-++
|keyid|col3|
+-++
+-++

 

3) *hive/presto/flink  cannot read  file groups which has only logs*

when we use hbase/inmemory as index, mor table will produce log files instead 
of parquet file, but now hive/presto cannot read those files since those files 
are log files.

*HUDI-2048* mentions this problem.

 

however when we use spark data source to executre incremental query, there is 
no such problem above。keep the logical of mor_incremental_view for hive as the 
same logicl as spark dataSource is necessary。

we redo the logical of mor_incremental_view for hive,to solve above problems 
and keep the logical of mor_incremental_view  as the same logicl as spark 
dataSource

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2086) redo the logical of mor_incremental_view for hive

2021-06-28 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng reassigned HUDI-2086:
--

Assignee: tao meng

> redo the logical of mor_incremental_view for hive
> -
>
> Key: HUDI-2086
> URL: https://issues.apache.org/jira/browse/HUDI-2086
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
> Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
> os: suse
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>
> now ,There are some problems with mor_incremental_view for hive。
> For example,
> 1):*hudi cannot read the lastest incremental datas which are stored by logs*
> think that:  create a mor table with bulk_insert, and then do upsert for this 
> table, 
> no we want to query the latest incremental data by hive/sparksql,   however 
> the lastest incremental datas are stored by logs,   when we do query nothings 
> will return
> step1: prepare data
> val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x, 
> x+"jack", Random.nextInt(2))).toDF()
>  .withColumn("col3", expr("keyid + 3000"))
>  .withColumn("p", lit(1))
> step2: do bulk_insert
> mergePartitionTable(df, 4, "default", "inc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step3: do upsert
> mergePartitionTable(df, 4, "default", "inc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> step4:  check the lastest commit time and do query
> spark.sql("set hoodie.inc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.inc.consume.max.commits=1")
> spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935")
> spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` > 
> '20210628103935' order by keyid").show(100, false)
> +-++
> |keyid|col3|
> +-++
> +-++
>  
> 2):*if we do insert_over_write/insert_over_write_table for hudi mor table, 
> the incr query result is wrong when we want to query the data before 
> insert_overwrite/insert_overwrite_table*
> step1: do bulk_insert 
> mergePartitionTable(df, 4, "default", "overInc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> now the commits is
> [20210628160614.deltacommit ]
> step2: do insert_overwrite_table
> mergePartitionTable(df, 4, "default", "overInc", tableType = 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table")
> now the commits is
> [20210628160614.deltacommit, 20210628160923.replacecommit ]
> step3: query the data before insert_overwrite_table
> spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.overInc.consume.max.commits=1")
> spark.sql("set hoodie.overInc.consume.start.timestamp=0")
> spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` > 
> '0' order by keyid").show(100, false)
> +-++
> |keyid|col3|
> +-++
> +-++
>  
> 3) *hive/presto/flink  cannot read  file groups which has only logs*
> when we use hbase/inmemory as index, mor table will produce log files instead 
> of parquet file, but now hive/presto cannot read those files since those 
> files are log files.
> *HUDI-2048* mentions this problem.
>  
> however when we use spark data source to executre incremental query, there is 
> no such problem above。keep the logical of mor_incremental_view for hive as 
> the same logicl as spark dataSource is necessary。
> we redo the logical of mor_incremental_view for hive,to solve above problems 
> and keep the logical of mor_incremental_view  as the same logicl as spark 
> dataSource
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2059) When log exists in mor table, clustering is triggered. The query result shows that the update record in log is lost

2021-06-22 Thread tao meng (Jira)
tao meng created HUDI-2059:
--

 Summary: When log exists in mor table,  clustering is triggered. 
The query result shows that the update record in log is lost
 Key: HUDI-2059
 URL: https://issues.apache.org/jira/browse/HUDI-2059
 Project: Apache Hudi
  Issue Type: Bug
Affects Versions: 0.8.0
 Environment: hadoop 3.1.1
spark3.1.1/spark2.4.5
hive3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.9.0


When log exists in mor table, and clustering is triggered. The query result 
shows that the update record of log is lost。

the reason of this problem is that:  hoodie use HoodieFileSliceReader to read 
table data and then do clustering.  HoodieFileSliceReader call 
HoodieMergedLogRecordScanner.

processNextRecord to merge update values and old valuse,   when call that 
function old values is reserved update values is discarded, this is wrong。

test step:

// step1 : create hudi mor table
val df = spark.range(0, 1000).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("age", lit(1))
 .withColumn("p", lit(2))

df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
 option(PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(PARTITIONPATH_FIELD_OPT_KEY, "p").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert").
 option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
 option("hoodie.insert.shuffle.parallelism", "4").
 option("hoodie.upsert.shuffle.parallelism", "4").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Overwrite).save(basePath)

// step2, update age where keyid < 5 to produce log files
val df1 = spark.range(0, 5).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("age", lit(1 + 1000))
 .withColumn("p", lit(2))
df1.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
 option(PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(PARTITIONPATH_FIELD_OPT_KEY, "p").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert").
 option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
 option("hoodie.insert.shuffle.parallelism", "4").
 option("hoodie.upsert.shuffle.parallelism", "4").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Append).save(basePath)

// step3, do cluster inline
val df2 = spark.range(6, 10).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("age", lit(1 + 2000))
 .withColumn("p", lit(2))

df2.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL).
 option(PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(PARTITIONPATH_FIELD_OPT_KEY, "p").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert").
 option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, 
classOf[org.apache.hudi.keygen.ComplexKeyGenerator].getName).
 option("hoodie.insert.shuffle.parallelism", "4").
 option("hoodie.upsert.shuffle.parallelism", "4").
 option("hoodie.parquet.small.file.limit", "0").
 option("hoodie.clustering.inline", "true").
 option("hoodie.clustering.inline.max.commits", "1").
 option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824").
 option("hoodie.clustering.plan.strategy.small.file.limit", "629145600").
 option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
Long.MaxValue.toString)
 .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Append).save(basePath)

spark.read.format("hudi")
 .load(basePath).select("age").where("keyid = 0").show(100, false)

+---+
|age|
+---+
|1 |
+—+

the result is wrong, since we update the value of age to 1001 at step 2.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2058) support incremental query for insert_overwrite_table/insert_overwrite operation on cow table

2021-06-22 Thread tao meng (Jira)
tao meng created HUDI-2058:
--

 Summary: support incremental query for 
insert_overwrite_table/insert_overwrite operation on cow table
 Key: HUDI-2058
 URL: https://issues.apache.org/jira/browse/HUDI-2058
 Project: Apache Hudi
  Issue Type: Bug
  Components: Incremental Pull
Affects Versions: 0.8.0
 Environment: hadoop 3.1.1
spark3.1.1
hive 3.1.1
Reporter: tao meng
Assignee: tao meng
 Fix For: 0.9.0


 when  incremental query contains multiple commit before and after 
replacecommit, and the query result contains the data of the old file. Notice: 
mor table is ok, only cow table has this problem.
 
when query incr_view for cow table, replacecommit is ignored which lead the 
wrong result. 
 
 
test step:
step1:  create dataFrame
val df = spark.range(0, 10).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("age", lit(1))
 .withColumn("p", lit(2))
 
step2:  insert df to a empty hoodie table
df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL).
 option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "").
 option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert").
 option("hoodie.insert.shuffle.parallelism", "4").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Overwrite).save(basePath)
 
step3: do insert_overwrite
df.write.format("hudi").
 option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL).
 option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "col3").
 option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "keyid").
 option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "").
 option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator").
 option(DataSourceWriteOptions.OPERATION_OPT_KEY, "insert_overwrite_table").
 option("hoodie.insert.shuffle.parallelism", "4").
 option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .mode(SaveMode.Append).save(basePath)
 
step4: query incrematal table 
spark.read.format("hudi").option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
 .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, "")
 .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY, currentCommits(0))
 .load(basePath).select("keyid").orderBy("keyid").show(100, false)
 
result:   the result contains old data
+-+
|keyid|
+-+
|0 |
|0 |
|1 |
|1 |
|2 |
|2 |
|3 |
|3 |
|4 |
|4 |
|5 |
|5 |
|6 |
|6 |
|7 |
|7 |
|8 |
|8 |
|9 |
|9 |
+-+
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1676) Support SQL with spark3

2021-06-08 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng closed HUDI-1676.
--
Resolution: Fixed

> Support SQL with spark3
> ---
>
> Key: HUDI-1676
> URL: https://issues.apache.org/jira/browse/HUDI-1676
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> 1、support CTAS for spark3
> 3、support INSERT for spark3
> 4、support merge、update、delete without RowKey constraint for spark3
> 5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1676) Support SQL with spark3

2021-06-08 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng reassigned HUDI-1676:
--

Assignee: tao meng

> Support SQL with spark3
> ---
>
> Key: HUDI-1676
> URL: https://issues.apache.org/jira/browse/HUDI-1676
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: tao meng
>Assignee: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> 1、support CTAS for spark3
> 3、support INSERT for spark3
> 4、support merge、update、delete without RowKey constraint for spark3
> 5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1676) Support SQL with spark3

2021-06-08 Thread tao meng (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359310#comment-17359310
 ] 

tao meng commented on HUDI-1676:


[~pzw2018] great works 

> Support SQL with spark3
> ---
>
> Key: HUDI-1676
> URL: https://issues.apache.org/jira/browse/HUDI-1676
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.9.0
>
>
> 1、support CTAS for spark3
> 3、support INSERT for spark3
> 4、support merge、update、delete without RowKey constraint for spark3
> 5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1817) when query incr view of hudi table by using spark-sql. the result is wrong

2021-04-20 Thread tao meng (Jira)
tao meng created HUDI-1817:
--

 Summary: when query incr view of hudi table by using spark-sql. 
the result is wrong
 Key: HUDI-1817
 URL: https://issues.apache.org/jira/browse/HUDI-1817
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
 Environment: spark2.4.5   hive 3.1.1   hadoop 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


create hudi table (mor or cow)

 

val base_data = spark.read.parquet("/tmp/tb_base")
val upsert_data = spark.read.parquet("/tmp/tb_upsert")

base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par")

upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par")

query incr view by sparksql:

set hoodie.tb_test_mor_par.consume.mode=INCREMENTAL;
set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330;
set hoodie.tb_test_mor_par.consume.max.commits=3;
select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 
from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' 
order by primary_key;

+---+---+++++
|_hoodie_commit_time|primary_key|col0|col1|col6 |col7|
+---+---+++++
|20210420155738 |20 |77 |sC |158788760400|739 |
|20210420155738 |21 |66 |ps |160979049700|61 |
|20210420155738 |22 |47 |1P |158460042900|835 |
|20210420155738 |23 |36 |5K |160763480800|538 |
|20210420155738 |24 |1 |BA |160685711300|775 |
|20210420155738 |24 |101 |BA |160685711300|775 |
|20210420155738 |24 |100 |BA |160685711300|775 |
|20210420155738 |24 |102 |BA |160685711300|775 |
+---+---+++++

 

the primary_key is repeated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong

2021-04-20 Thread tao meng (Jira)
tao meng created HUDI-1816:
--

 Summary: when query incr view of hudi table by using spark-sql, 
the query result is wrong
 Key: HUDI-1816
 URL: https://issues.apache.org/jira/browse/HUDI-1816
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
 Environment: spark2.4.5   hive 3.1.1hadoop 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


test step1:

create a partitioned hudi table (mor / cow)

val base_data = spark.read.parquet("/tmp/tb_base")
val upsert_data = spark.read.parquet("/tmp/tb_upsert")

base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par")

upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par")

query incr view by sparksql:
set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330;
set hoodie.tb_test_mor_par.consume.max.commits=3;
select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 
from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' 
order by primary_key;

+---+---+++++
|_hoodie_commit_time|primary_key|col0|col1|col6 |col7|
+---+---+++++
|20210420155738 |20 |77 |sC |158788760400|739 |
|20210420155738 |21 |66 |ps |160979049700|61 |
|20210420155738 |22 |47 |1P |158460042900|835 |
|20210420155738 |23 |36 |5K |160763480800|538 |
|20210420155738 |24 |1 |BA |160685711300|775 |
|20210420155738 |24 |101 |BA |160685711300|775 |
|20210420155738 |24 |100 |BA |160685711300|775 |
|20210420155738 |24 |102 |BA |160685711300|775 |
+---+---+++++

 

primary key 24 is repeated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1783) support Huawei Cloud Object Storage

2021-04-09 Thread tao meng (Jira)
tao meng created HUDI-1783:
--

 Summary: support Huawei  Cloud Object Storage
 Key: HUDI-1783
 URL: https://issues.apache.org/jira/browse/HUDI-1783
 Project: Apache Hudi
  Issue Type: Bug
  Components: Common Core
Reporter: tao meng
 Fix For: 0.9.0


add support  for Huawei Cloud Object Storage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions:
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0038_m_00_0: Error: 
java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher 
event handler | Diagnostics report from attempt_1615883368881_0038_m_00_0: 
Error: java.lang.NullPointerException at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77)
 at 
org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205)
 at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at 
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

 

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark, however hive has no way to avoid this problem。

 test step:

step1:

val df = spark.range(0, 10).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String] ("sb1", "rz")))
 .withColumn("a2", lit(Array[String] ("sb1", "rz")))

// create hoodie table hive_14b

 merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

notice:  bulk_insert will produce 4 files in hoodie table

step2:

 val df = spark.range(9, 12).toDF("keyid")
 .withColumn("col3", expr("keyid"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String] ("sb1", "rz")))
 .withColumn("a2", lit(Array[String] ("sb1", "rz")))

// upsert table 

merge(df, 4, "default", "hive_14b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

 now :  we have four base files and one log file in hoodie table

step3: 

spark-sql/beeline: 

 select count(col3) from hive_14b_rt;

then the query failed.

  was:
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions:
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark, however hive has no way to avoid this problem。

 

 

 

  was:
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions:
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 

[jira] [Updated] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1722:
---
Description: 
HUDI-892 introduce this problem。
 this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
 Consider the following questions:
 we have four getRecordReaders: 
 reader1(its hoodieRealtimeSplit contains no log files)
 reader2 (its hoodieRealtimeSplit contains log files)
 reader3(its hoodieRealtimeSplit contains log files)
 reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
 which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
 Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)

Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok

sparksql can avoid this problem by set  spark.hadoop.cloneConf=true which is 
not recommended in spark, however hive has no way to avoid this problem。

 

  was:
HUDI-892 introduce this problem。
this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
Consider the following questions:
we have four getRecordReaders: 
reader1(its hoodieRealtimeSplit contains no log files)
reader2 (its hoodieRealtimeSplit contains log files)
reader3(its hoodieRealtimeSplit contains log files)
reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 

[jira] [Created] (HUDI-1722) hive beeline/spark-sql query specified field on mor table occur NPE

2021-03-25 Thread tao meng (Jira)
tao meng created HUDI-1722:
--

 Summary: hive beeline/spark-sql  query specified field on mor 
table occur NPE
 Key: HUDI-1722
 URL: https://issues.apache.org/jira/browse/HUDI-1722
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration, Spark Integration
Affects Versions: 0.7.0
 Environment: spark2.4.5, hadoop3.1.1, hive 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


HUDI-892 introduce this problem。
this pr skip adding projection columns if there are no log files in the 
hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders 
share same jobConf。
Consider the following questions:
we have four getRecordReaders: 
reader1(its hoodieRealtimeSplit contains no log files)
reader2 (its hoodieRealtimeSplit contains log files)
reader3(its hoodieRealtimeSplit contains log files)
reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf will be set to be true, and no hoodie additional projection columns 
will be added to jobConf (see 
HoodieParquetRealtimeInputFormat.addProjectionToJobConf)

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in 
jobConf is set to be true, no hoodie additional projection columns will be 
added to jobConf. (see HoodieParquetRealtimeInputFormat.addProjectionToJobConf)
which lead to the result that _hoodie_record_key would be missing and merge 
step would throw exceptions

Caused by: java.io.IOException: java.lang.NullPointerException
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:611)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150)
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296)
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252)
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)
 ... 24 more
Caused by: java.lang.NullPointerException
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:93)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79)
 
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
 
 at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:578)
 at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:518)
 at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:150) 
 at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:3296) 
 at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:252) 
 at 
org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:522)


Obviously, this is an occasional problem。 if reader2 run first, hoodie 
additional projection columns will be added to jobConf and in this case the 
query will be ok



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1719:
---
Description: 
now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300; // this timestamp is 
smaller the earlist commit, so  we can query whole commits

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

+-+++-+
|p|p1|p2|keyid|

+-+++-+
|0|0|6|0|
|0|0|6|1|
|0|0|6|2|
|0|0|6|3|
|0|0|6|4|
|0|0|6|4|
|0|0|6|0|
|0|0|6|3|
|0|0|6|2|
|0|0|6|1|

+-+++-+

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 

  was:
now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

++-+-++ 
| p | p1 | p2 | keyid | 
++-+-++ 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 1 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 1 | 
++-+-++

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 


> hive on spark/mr,Incremental query of the mor table, the partition field is 
> incorrect
> -
>
> Key: HUDI-1719
> URL: https://issues.apache.org/jira/browse/HUDI-1719
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive 

[jira] [Updated] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1718:
---
Description: 
HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
 test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert table hive8b

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

step3:

start hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this timestamp 
is smaller the earlist commit, so  we can query whole commits

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'

 

2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

 

  was:
HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))

[jira] [Created] (HUDI-1720) when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError

2021-03-25 Thread tao meng (Jira)
tao meng created HUDI-1720:
--

 Summary: when query incr view of  mor table which has many delete 
records use sparksql/hive-beeline,  StackOverflowError
 Key: HUDI-1720
 URL: https://issues.apache.org/jira/browse/HUDI-1720
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration, Spark Integration
Affects Versions: 0.7.0, 0.8.0
Reporter: tao meng
 Fix For: 0.9.0


 now RealtimeCompactedRecordReader.next   deal with delete record by recursion, 
see:

[https://github.com/apache/hudi/blob/6e803e08b1328b32a5c3a6acd8168fdabc8a1e50/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L106]

however when the log file contains many delete record,  the logcial of 
RealtimeCompactedRecordReader.next  will lead stackOverflowError

test step:

step1:

val df = spark.range(0, 100).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert 100w row (keyid from 0 to 100)

merge(df, 4, "default", "hive_9b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 90).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// delete 90w row (keyid from 0 to 90)

delete(df, 4, "default", "hive_9b")

step3:

query on beeline/spark-sql :  select count(col3)  from hive_9b_rt

2021-03-25 15:33:29,029 | INFO  | main | RECORDS_OUT_OPERATOR_RS_3:1, 
RECORDS_OUT_INTERMEDIATE:1,  | Operator.java:10382021-03-25 15:33:29,029 | INFO 
 | main | RECORDS_OUT_OPERATOR_RS_3:1, RECORDS_OUT_INTERMEDIATE:1,  | 
Operator.java:10382021-03-25 15:33:29,029 | ERROR | main | Error running child 
: java.lang.StackOverflowError at 
org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:83) at 
org.apache.parquet.column.values.plain.BinaryPlainValuesReader.readBytes(BinaryPlainValuesReader.java:39)
 at 
org.apache.parquet.column.impl.ColumnReaderBase$2$6.read(ColumnReaderBase.java:344)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.readValue(ColumnReaderBase.java:503)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:409)
 at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
 at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
 at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:159)
 at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.next(ParquetRecordReaderWrapper.java:41)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:84)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:106)
 at 

[jira] [Created] (HUDI-1719) hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-03-25 Thread tao meng (Jira)
tao meng created HUDI-1719:
--

 Summary: hive on spark/mr,Incremental query of the mor table, the 
partition field is incorrect
 Key: HUDI-1719
 URL: https://issues.apache.org/jira/browse/HUDI-1719
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.7.0, 0.8.0
 Environment: spark2.4.5, hadoop 3.1.1, hive 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


now hudi use HoodieCombineHiveInputFormat to achieve Incremental query of the 
mor table.

when we have some small files in different partitions, 
HoodieCombineHiveInputFormat  will combine those small file readers.   
HoodieCombineHiveInputFormat  build partition field base on  the first file 
reader in it, however now HoodieCombineHiveInputFormat  holds other file 
readers which come from different partitions.

When switching readers, we should  update ioctx

test env:

spark2.4.5, hadoop 3.1.1, hive 3.1.1

test step:

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// create hudi table which has three  level partitions p,p1,p2

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

 

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert current table

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300' and `keyid` < 5;

query result:

++-+-++ 
| p | p1 | p2 | keyid | 
++-+-++ 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 1 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 4 | 
| 0 | 0 | 6 | 0 | 
| 0 | 0 | 6 | 3 | 
| 0 | 0 | 6 | 2 | 
| 0 | 0 | 6 | 1 | 
++-+-++

this result is wrong, since the second step we insert new data in table which 
p2=7, however in the query result we cannot find p2=7, all p2= 6

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1718) when query incr view of mor table which has Multi level partitions, the query failed

2021-03-25 Thread tao meng (Jira)
tao meng created HUDI-1718:
--

 Summary: when query incr view of  mor table which has Multi level 
partitions, the query failed
 Key: HUDI-1718
 URL: https://issues.apache.org/jira/browse/HUDI-1718
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.7.0, 0.8.0
Reporter: tao meng
 Fix For: 0.9.0


HoodieCombineHiveInputFormat use "," to join mutil partitions, however hive use 
"/" to join muit1 partitions. there exists some gap, so modify 
HoodieCombineHiveInputFormat's logical
test env

spark2.4.5, hadoop 3.1.1, hive 3.1.1

 

step1:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(6))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// bulk_insert df,   partition by p,p1,p2

  merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

step2:

val df = spark.range(0, 1).toDF("keyid")
 .withColumn("col3", expr("keyid + 1000"))
 .withColumn("p", lit(0))
 .withColumn("p1", lit(0))
 .withColumn("p2", lit(7))
 .withColumn("a1", lit(Array[String]("sb1", "rz")))
 .withColumn("a2", lit(Array[String]("sb1", "rz")))

// upsert table hive8b

merge(df, 4, "default", "hive_8b", 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

step3:

start hive beeline:

set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;

set hoodie.hive_8b.consume.mode=INCREMENTAL;

set hoodie.hive_8b.consume.max.commits=3;

set hoodie.hive_8b.consume.start.timestamp=20210325141300;  // this timestamp 
is smaller the earlist commit, so  we can query whole commits

2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | Diagnostics 
report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 2021-03-25 14:14:36,036 | INFO  | AsyncDispatcher event handler | 
Diagnostics report from attempt_1615883368881_0028_m_00_3: Error: 
org.apache.hudi.org.apache.avro.SchemaParseException: Illegal character in: 
p,p1,p2 at 
org.apache.hudi.org.apache.avro.Schema.validateName(Schema.java:1151) at 
org.apache.hudi.org.apache.avro.Schema.access$200(Schema.java:81) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:403) at 
org.apache.hudi.org.apache.avro.Schema$Field.(Schema.java:396) at 
org.apache.hudi.avro.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:268)
 at 
org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils.addPartitionFields(HoodieRealtimeRecordReaderUtils.java:286)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:98)
 at 
org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:67)
 at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:53)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:70)
 at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:47)
 at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:123)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getRecordReader(HoodieCombineHiveInputFormat.java:975)
 at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getRecordReader(HoodieCombineHiveInputFormat.java:556)
 at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) 
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at 
org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

select `p`, `p1`, `p2`,`keyid` from hive_8b_rt where 
`_hoodie_commit_time`>'20210325141300'

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished

2021-03-14 Thread tao meng (Jira)
tao meng created HUDI-1688:
--

 Summary: hudi write should uncache rdd, when the write operation 
is finnished
 Key: HUDI-1688
 URL: https://issues.apache.org/jira/browse/HUDI-1688
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Affects Versions: 0.7.0
Reporter: tao meng
 Fix For: 0.8.0


now, hudi improve write performance by cache necessary rdds; however when the 
write operation is finnished, those cached rdds have not been uncached which 
waste lots of memory.

[https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115]

https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214

In our environment:

step1: insert 100GB data into hudi table by spark   (ok)

step2: insert another 100GB data into hudi table by spark again (oom ) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1676) Support SQL with spark3

2021-03-09 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1676:
---
Summary: Support SQL with spark3  (was: Support sql with spark3)

> Support SQL with spark3
> ---
>
> Key: HUDI-1676
> URL: https://issues.apache.org/jira/browse/HUDI-1676
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Major
> Fix For: 0.8.0
>
>
> 1、support CTAS for spark3
> 3、support INSERT for spark3
> 4、support merge、update、delete without RowKey constraint for spark3
> 5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1677) Support Clustering and Metatable for SQL performance

2021-03-09 Thread tao meng (Jira)
tao meng created HUDI-1677:
--

 Summary: Support Clustering and Metatable for SQL performance
 Key: HUDI-1677
 URL: https://issues.apache.org/jira/browse/HUDI-1677
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: tao meng
 Fix For: 0.8.0


1、support Metatable to improve SQL write pefermance

2、support Clustering to SQL read performance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1676) Support sql with spark3

2021-03-09 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1676:
---
Summary: Support sql with spark3  (was: support sql with spark3)

> Support sql with spark3
> ---
>
> Key: HUDI-1676
> URL: https://issues.apache.org/jira/browse/HUDI-1676
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Major
> Fix For: 0.8.0
>
>
> 1、support CTAS for spark3
> 3、support INSERT for spark3
> 4、support merge、update、delete without RowKey constraint for spark3
> 5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1676) support sql with spark3

2021-03-09 Thread tao meng (Jira)
tao meng created HUDI-1676:
--

 Summary: support sql with spark3
 Key: HUDI-1676
 URL: https://issues.apache.org/jira/browse/HUDI-1676
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: tao meng
 Fix For: 0.8.0


1、support CTAS for spark3

3、support INSERT for spark3

4、support merge、update、delete without RowKey constraint for spark3

5、support dataSourceV2 for spark3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1675) Externalize all Hudi configurations

2021-03-09 Thread tao meng (Jira)
tao meng created HUDI-1675:
--

 Summary: Externalize all Hudi configurations
 Key: HUDI-1675
 URL: https://issues.apache.org/jira/browse/HUDI-1675
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: tao meng
 Fix For: 0.8.0


# Externalize all Hudi configurations (separate configuration file)
 # Save table related properties into hoodie.properties file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
  
 Root cause analysis:

hudi use avro format to store log file, avro store DateType as INT(Type is INT 
but logcialType is date)。

when hudi read log file and convert avro INT type record to 
writable,logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem: 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
  
 Modification plan: when cast avro INT type  to writable,  logicalType must be  
considerd

case INT:
 if (schema.getLogicalType() != null && 
schema.getLogicalType().getName().equals("date")) {
 return new DateWritable((Integer) value);
 } else {
 return new IntWritable((Integer) value);
 }

 

 

 

 

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
 
Root cause analysis:

hudi use avro format to store log file, avro store DateType as INT(Type is INT 
but logcialType is date)。

when hudi read log file and convert avro INT type record to 
writable,logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem: 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
 
Modification plan: when cast avro INT type  to writable,  logicalType must be  
considerd
if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
"date") {
 return new DateWritable((Integer) value);
} else {
 return new IntWritable((Integer) value);
}
 

 

 

 

 


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> use beeline/spark-sql   execute   statement select * from 
> huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
> then the follow error will occur:
> _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_
>  
>   
>  Root cause analysis:
> hudi use avro format to store log file, avro store DateType as INT(Type is 
> INT but logcialType is date)。
> when hudi read log file and convert avro INT type record to 
> writable,logicalType is not respected which lead the dateType will cast 

[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

use beeline/spark-sql   execute   statement select * from 
huditest.bulkinsert_mor_10g_rt where primary_key = 1000;

then the follow error will occur:

_java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be cast 
to org.apache.hadoop.hive.serde2.io.DateWritableV2_

 
 
Root cause analysis:

hudi use avro format to store log file, avro store DateType as INT(Type is INT 
but logcialType is date)。

when hudi read log file and convert avro INT type record to 
writable,logicalType is not respected which lead the dateType will cast to 
IntWritable。

seem: 
[https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
 
Modification plan: when cast avro INT type  to writable,  logicalType must be  
considerd
if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
"date") {
 return new DateWritable((Integer) value);
} else {
 return new IntWritable((Integer) value);
}
 

 

 

 

 

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!image-2021-03-05-10-06-11-949.png!


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> use beeline/spark-sql   execute   statement select * from 
> huditest.bulkinsert_mor_10g_rt where primary_key = 1000;
> then the follow error will occur:
> _java.lang.ClassCastExceoption: org.apache.hadoop.io.IntWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.DateWritableV2_
>  
>  
> Root cause analysis:
> hudi use avro format to store log file, avro store DateType as INT(Type is 
> INT but logcialType is date)。
> when hudi read log file and convert avro INT type record to 
> writable,logicalType is not respected which lead the dateType will cast to 
> IntWritable。
> seem: 
> [https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java#L169]
>  
> Modification plan: when cast avro INT type  to writable,  logicalType must be 
>  considerd
> if (schema.getLogicalType() != null && schema.getLogicalType().getName() == 
> "date") {
>  return new DateWritable((Integer) value);
> } else {
>  return new IntWritable((Integer) value);
> }
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1662) Failed to query real-time view use hive/spark-sql when hudi mor table contains dateType

2021-03-04 Thread tao meng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao meng updated HUDI-1662:
---
Description: 
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")

step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!image-2021-03-05-10-06-11-949.png!

  was:
step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable

df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))

merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")


step2: prepare update DataFrame with DateType, and upsert into HudiMorTable

 df_update = sql("select * from 
huditest.bulkinsert_mor_10g_rt").withColumn("date", 
lit(Date.valueOf("2020-11-11")))

merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")

 

step3: use hive-beeeline/ spark-sql query mor_rt table

!file:///C:/Users/m00443775/AppData/Roaming/eSpace_Desktop/UserData/m00443775/imagefiles/847287BC-86FF-4EFF-9BDC-C2A60FFD3F47.png|id=847287BC-86FF-4EFF-9BDC-C2A60FFD3F47,vspace=3!


>  Failed to query real-time view use hive/spark-sql when hudi mor table 
> contains dateType
> 
>
> Key: HUDI-1662
> URL: https://issues.apache.org/jira/browse/HUDI-1662
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Affects Versions: 0.7.0
> Environment: hive 3.1.1
> spark 2.4.5
> hadoop 3.1.1
> suse os
>Reporter: tao meng
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> step1: prepare raw DataFrame with DateType, and insert it to HudiMorTable
> df_raw.withColumn("date", lit(Date.valueOf("2020-11-10")))
> merge(df_raw, "bulk_insert", "huditest.bulkinsert_mor_10g")
> step2: prepare update DataFrame with DateType, and upsert into HudiMorTable
>  df_update = sql("select * from 
> huditest.bulkinsert_mor_10g_rt").withColumn("date", 
> lit(Date.valueOf("2020-11-11")))
> merge(df_update, "upsert", "huditest.bulkinsert_mor_10g")
>  
> step3: use hive-beeeline/ spark-sql query mor_rt table
> !image-2021-03-05-10-06-11-949.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >