[jira] [Assigned] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding reassigned HUDI-1181:
--

Assignee: Wenning Ding

> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this 
> [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
> we need to handle this special case to convert bytes array back before 
> converting to String.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] zhedoubushishi opened a new pull request #1953: [HUDI-1181] Fix decimal type display issue for record key field

2020-08-11 Thread GitBox


zhedoubushishi opened a new pull request #1953:
URL: https://github.com/apache/hudi/pull/1953


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array. 
   
   During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
   ```
   optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
   ```
   Then Hudi will convert it into the following avro decimal type:
   ```
   {
   "name" : "LN_LQDN_OBJ_ID",
   "type" : [ {
 "type" : "fixed",
 "name" : "fixed",
 "namespace" : "hoodie.hudi_ln_lqdn.hudi_ln_lqdn_record.LN_LQDN_OBJ_ID",
 "size" : 16,
 "logicalType" : "decimal",
 "precision" : 38,
 "scale" : 0
   }, "null" ]
   }
   ```
   This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this 
[converter](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58).
 
   
   However, the problem is, when setting decimal type as record keys, Hudi 
would read the value from Avro Generic Record and then directly convert it into 
```String``` type(See 
[here](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76)).
 As a result, what shows in the ```_hoodie_record_key``` field would be 
something like: ```LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 
95, -71]```. 
   
   So we need to handle this special case to convert bytes array back before 
converting to ```String```.
   
   
   ## Brief change log
   
   Similar to what we did for Date type columns: 
https://github.com/apache/hudi/commit/2d040145810b8b14c59c5882f9115698351039d1#diff-21f77fb372831d468dab018505592e12,
 I added another logic to handle decimal type column.
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
 - *Added a decimal test case in TestDataSourceUtils.java to verify the 
change.*
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1181:
-
Labels: pull-request-available  (was: )

> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this 
> [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
> we need to handle this special case to convert bytes array back before 
> converting to String.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "LN_LQDN_OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this converter.
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, 

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this 
[converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this 
> [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58].
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> 

[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
not correctly display the decimal value, instead, Hudi would display it as a 
byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

 
{code:java}
optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
 

Then Hudi will convert it into the following avro decimal type:
{code:java}
{
"name" : "LN_LQDN_OBJ_ID",
"type" : [ {
  "type" : "fixed",
  "name" : "fixed",
  "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
  "size" : 16,
  "logicalType" : "decimal",
  "precision" : 38,
  "scale" : 0
}, "null" ]
}
{code}
This decimal field would be stored as a fixed length bytes array. And in the 
reading phase, Hudi will convert this bytes array back to a readable decimal 
value through this converter.

However, the problem is, when setting decimal type as record keys, Hudi would 
read the value from Avro Generic Record and then directly convert it into 
String type(See 
[here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).

As a result, what shows in the _hoodie_record_key field would be something 
like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
we need to handle this special case to convert bytes array back before 
converting to String.

  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

{

optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
 }
 Then Hudi will convert it into the following avro decimal type:


> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would 
> not correctly display the decimal value, instead, Hudi would display it as a 
> byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
>  
> {code:java}
> optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code}
>  
> Then Hudi will convert it into the following avro decimal type:
> {code:java}
> {
> "name" : "LN_LQDN_OBJ_ID",
> "type" : [ {
>   "type" : "fixed",
>   "name" : "fixed",
>   "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID",
>   "size" : 16,
>   "logicalType" : "decimal",
>   "precision" : 38,
>   "scale" : 0
> }, "null" ]
> }
> {code}
> This decimal field would be stored as a fixed length bytes array. And in the 
> reading phase, Hudi will convert this bytes array back to a readable decimal 
> value through this converter.
> However, the problem is, when setting decimal type as record keys, Hudi would 
> read the value from Avro Generic Record and then directly convert it into 
> String type(See 
> [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]).
> As a result, what shows in the _hoodie_record_key field would be something 
> like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So 
> we need to handle this special case to convert bytes array back before 
> converting to String.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:

{

optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
 }
 Then Hudi will convert it into the following avro decimal type:

  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
{
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
}
Then Hudi will convert it into the following avro decimal type:



> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
> would not correctly display the decimal value, instead, Hudi would display it 
> as a byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
> {
> optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
>  }
>  Then Hudi will convert it into the following avro decimal type:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1181:
---
Description: 
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
{
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
}
Then Hudi will convert it into the following avro decimal type:


  was:
When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
```
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
```
Then Hudi will convert it into the following avro decimal type:



> Decimal type display issue for record key field
> ---
>
> Key: HUDI-1181
> URL: https://issues.apache.org/jira/browse/HUDI-1181
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>
> When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
> would not correctly display the decimal value, instead, Hudi would display it 
> as a byte array.
> During the Hudi writing phase, Hudi would save the parquet source data into 
> Avro Generic Record. For example, the source parquet data has a column with 
> decimal type:
> {
> optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
> }
> Then Hudi will convert it into the following avro decimal type:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1181) Decimal type display issue for record key field

2020-08-11 Thread Wenning Ding (Jira)
Wenning Ding created HUDI-1181:
--

 Summary: Decimal type display issue for record key field
 Key: HUDI-1181
 URL: https://issues.apache.org/jira/browse/HUDI-1181
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Wenning Ding


When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi 
would not correctly display the decimal value, instead, Hudi would display it 
as a byte array.

During the Hudi writing phase, Hudi would save the parquet source data into 
Avro Generic Record. For example, the source parquet data has a column with 
decimal type:
```
optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0));
```
Then Hudi will convert it into the following avro decimal type:




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] tooptoop4 commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

2020-08-11 Thread GitBox


tooptoop4 commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-672595024


   @bschell which PR fixes it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1180) Upgrade HBase to 2.3.3

2020-08-11 Thread Wenning Ding (Jira)
Wenning Ding created HUDI-1180:
--

 Summary: Upgrade HBase to 2.3.3
 Key: HUDI-1180
 URL: https://issues.apache.org/jira/browse/HUDI-1180
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Wenning Ding


Trying to upgrade HBase to 2.3.3 but ran into several issues.

According to the Hadoop version support matrix: 
[http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
2.8.5+.

 

There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need to 
resolve this first. After resolving conflicts, I am able to compile it but then 
I ran into a tricky jetty version issue during the testing:
{code:java}
[ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 s  
<<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
elapsed: 0.076 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 0.16 
s  <<< ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


[ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
ERROR!
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V


34206 [Thread-260] WARN  
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
shutdown has been called
34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
localhost/127.0.0.1:55924] WARN  
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
IncrementalBlockReportManager interrupted
34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
localhost/127.0.0.1:55924] WARN  
org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
34246 
[refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
 WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
to refresh disk information: sleep interrupted
34247 
[refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
 WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
to refresh disk information: sleep interrupted
37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  - 
Cannot locate configuration: tried 
hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
43904 [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[INFO] 
[ERROR] Tests run: 10, Failures: 0, Errors: 7, Skipped: 0
[INFO] 
{code}
Basically currently Hudi and it's dependency Javalin depend on Jetty 9.4.x but 
Hbase depends on jetty 9.3.x. And they have incompatible APIs which could not 
be easily resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #367

2020-08-11 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.58 KB...]
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${scala.binary.version}:[unknown-version],
 

 line 27, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the 

[GitHub] [hudi] vinothchandar commented on issue #1837: [SUPPORT]S3 file listing causing compaction to get eventually slow

2020-08-11 Thread GitBox


vinothchandar commented on issue #1837:
URL: https://github.com/apache/hudi/issues/1837#issuecomment-672560925


   and this is the last such place. (cleaner, rollback are all incremental now) 
. cc @prashantwason to the rescue ;)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #827: java.lang.ClassNotFoundException: com.uber.hoodie.hadoop.HoodieInputFormat

2020-08-11 Thread GitBox


bvaradar commented on issue #827:
URL: https://github.com/apache/hudi/issues/827#issuecomment-672558523


   @saumyasuhagiya : This is a very old ticket about hudi 0.4.x version. Are 
you using 0.4.x or 0.5.x. If it is newer, please open a new ticket with 
complete context.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] saumyasuhagiya edited a comment on issue #827: java.lang.ClassNotFoundException: com.uber.hoodie.hadoop.HoodieInputFormat

2020-08-11 Thread GitBox


saumyasuhagiya edited a comment on issue #827:
URL: https://github.com/apache/hudi/issues/827#issuecomment-672555379


   @malanb5 @n3nash I have tried that as well still its failing. I am using 
hudi spark bundle and above dependency on databricks cluster.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] saumyasuhagiya commented on issue #827: java.lang.ClassNotFoundException: com.uber.hoodie.hadoop.HoodieInputFormat

2020-08-11 Thread GitBox


saumyasuhagiya commented on issue #827:
URL: https://github.com/apache/hudi/issues/827#issuecomment-672555379


   @malanb5 @n3nash I have tried that as well still its failing. I am using 
hudi spark bundle and above dependency.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar opened a new pull request #1952: [Not For Merging] Debug integ Tests

2020-08-11 Thread GitBox


bvaradar opened a new pull request #1952:
URL: https://github.com/apache/hudi/pull/1952


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468975269



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
##
@@ -125,49 +130,58 @@ public TimestampBasedKeyGenerator(TypedProperties config, 
String partitionPathFi
 
   @Override
   public String getPartitionPath(GenericRecord record) {
-Object partitionVal = HoodieAvroUtils.getNestedFieldVal(record, 
partitionPathField, true);
+Object partitionVal = HoodieAvroUtils.getNestedFieldVal(record, 
getPartitionPathFields().get(0), true);
 if (partitionVal == null) {
   partitionVal = 1L;
 }
+try {
+  return getPartitionPath(partitionVal);
+} catch (Exception e) {
+  throw new HoodieDeltaStreamerException("Unable to parse input partition 
field :" + partitionVal, e);
+}
+  }
 
+  /**
+   * Parse and fetch partition path based on data type.
+   *
+   * @param partitionVal partition path object value fetched from record/row
+   * @return the parsed partition path based on data type
+   * @throws ParseException on any parse exception
+   */
+  private String getPartitionPath(Object partitionVal) throws ParseException {
 DateTimeFormatter partitionFormatter = 
DateTimeFormat.forPattern(outputDateFormat);
 if (this.outputDateTimeZone != null) {
   partitionFormatter = partitionFormatter.withZone(outputDateTimeZone);
 }
-
-try {
-  long timeMs;
-  if (partitionVal instanceof Double) {
-timeMs = convertLongTimeToMillis(((Double) partitionVal).longValue());
-  } else if (partitionVal instanceof Float) {
-timeMs = convertLongTimeToMillis(((Float) partitionVal).longValue());
-  } else if (partitionVal instanceof Long) {
-timeMs = convertLongTimeToMillis((Long) partitionVal);
-  } else if (partitionVal instanceof CharSequence) {
-DateTime parsedDateTime = 
inputFormatter.parseDateTime(partitionVal.toString());
-if (this.outputDateTimeZone == null) {
-  // Use the timezone that came off the date that was passed in, if it 
had one
-  partitionFormatter = 
partitionFormatter.withZone(parsedDateTime.getZone());
-}
-
-timeMs = 
inputFormatter.parseDateTime(partitionVal.toString()).getMillis();
-  } else {
-throw new HoodieNotSupportedException(
-"Unexpected type for partition field: " + 
partitionVal.getClass().getName());
+long timeMs;

Review comment:
   note to reviewer: removed the outer try catch and moved it to the 
caller. Except that, no other code changes.

##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
##
@@ -22,30 +22,27 @@
 import org.apache.hudi.common.config.TypedProperties;
 
 import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
 
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
-import java.util.stream.Collectors;
 
 /**
- * Key generator for deletes using global indices. Global index deletes do not 
require partition value
- * so this key generator avoids using partition value for generating HoodieKey.
+ * Key generator for deletes using global indices. Global index deletes do not 
require partition value so this key generator avoids using partition value for 
generating HoodieKey.
  */
 public class GlobalDeleteKeyGenerator extends BuiltinKeyGenerator {
 
   private static final String EMPTY_PARTITION = "";
 
-  protected final List recordKeyFields;
-
   public GlobalDeleteKeyGenerator(TypedProperties config) {
 super(config);
-this.recordKeyFields = 
Arrays.stream(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+this.recordKeyFields = 
Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));

Review comment:
   This line is different from what I see before this patch. It is an 
optimization, but just to be safe, we can keep it as is. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468884350



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
##
@@ -177,4 +191,26 @@ private long convertLongTimeToMillis(Long partitionVal) {
 }
 return MILLISECONDS.convert(partitionVal, timeUnit);
   }
+
+  @Override
+  public String getRecordKey(Row row) {
+return RowKeyGeneratorHelper.getRecordKeyFromRow(row, 
getRecordKeyFields(), getRecordKeyPositions(), false);
+  }
+
+  @Override
+  public String getPartitionPath(Row row) {
+Object fieldVal = null;
+Object partitionPathFieldVal =  
RowKeyGeneratorHelper.getNestedFieldVal(row, 
getPartitionPathPositions().get(getPartitionPathFields().get(0)));

Review comment:
   yes, this extends from SimpleKeyGenerator. Also,  we have a special case 
for partition path if incase we don't find the field. couldn't find a better 
way to do it. position will return -1 and when parsing for actual Row, we will 
return DEFAULT_PARTITION_PATH. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468883660



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java
##
@@ -55,21 +51,22 @@ public SimpleKeyGenerator(TypedProperties props, String 
partitionPathField) {
 
   @Override
   public String getRecordKey(GenericRecord record) {
-return KeyGenUtils.getRecordKey(record, recordKeyField);
+return KeyGenUtils.getRecordKey(record, getRecordKeyFields().get(0));

Review comment:
   We wanted to have the same behavior as getKey(). we don't throw 
exception in constructor if record key is not found. we throw only when 
getKey(GenericRecord record) is called. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

2020-08-11 Thread GitBox


bvaradar commented on issue #1813:
URL: https://github.com/apache/hudi/issues/1813#issuecomment-672433664


   @jcunhafonte : @bschell confirmed it works in master. Can you try using 
master or wait for 0.6 (Release should happen in a weeks time).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1813: ERROR HoodieDeltaStreamer: Got error running delta sync once.

2020-08-11 Thread GitBox


bvaradar closed issue #1813:
URL: https://github.com/apache/hudi/issues/1813


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan closed HUDI-1146.


> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Assignee: Balaji Varadarajan
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan closed HUDI-1091.


> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-1146.
--
Resolution: Fixed

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Assignee: Balaji Varadarajan
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1146:
-
Status: In Progress  (was: Open)

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Assignee: Balaji Varadarajan
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1146:
-
Status: Open  (was: New)

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Assignee: Balaji Varadarajan
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1146:


Assignee: Balaji Varadarajan

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Assignee: Balaji Varadarajan
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-08-11 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175918#comment-17175918
 ] 

Balaji Varadarajan commented on HUDI-1091:
--

[~bschell] confirmed this is resolved in master. Resolving this ticket.

> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-1091.
--
Resolution: Fixed

> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1091) Handle empty input batch gracefully in ParquetDFSSource

2020-08-11 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1091:
-
Status: In Progress  (was: Open)

> Handle empty input batch gracefully in ParquetDFSSource
> ---
>
> Key: HUDI-1091
> URL: https://issues.apache.org/jira/browse/HUDI-1091
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1813]
>  Looking at 0.5.3, it is possible the below exception can happen when running 
> in standalone mode and the next batch to write is empty.
> ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down 
> org.apache.hudi.exception.HoodieException: Please provide a valid schema 
> provider class! at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #1837: [SUPPORT]S3 file listing causing compaction to get eventually slow

2020-08-11 Thread GitBox


bvaradar commented on issue #1837:
URL: https://github.com/apache/hudi/issues/1837#issuecomment-672430487


   Thanks @steveloughran : Good to know. We are looking at an approach using 
consolidated metadata to avoid file listing (RFC-15) in the first place. 
@umehrot2 : What are your thoughts on this ? Do you think this would 
significantly help S3 case in the interim ? We do list all partitions for 
compaction scheduling  currently.  I am wondering if this is worth looking at 
0.6.1 timeframe. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1902: [SUPPORT] Hudi dont put the same day in the same file

2020-08-11 Thread GitBox


bvaradar commented on issue #1902:
URL: https://github.com/apache/hudi/issues/1902#issuecomment-672406725


   With bulk insert, the parallelism configuration determines the lower bound 
on the number of files. Since, you started with bulk insert, you are seeing 
that many number of files. Hudi upsert/insert will route "new records" (with 
new record keys) to these small files. So, If there are new records on the same 
partition, you will see those smalll files growing.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468934922



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieRowParquetWriteSupport.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.bloom.HoodieDynamicBoundedBloomFilter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+import org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.HashMap;
+
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_MAX_RECORD_KEY_FOOTER;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_MIN_RECORD_KEY_FOOTER;
+
+/**
+ * Hoodie Write Support for directly writing Row to Parquet.
+ */
+public class HoodieRowParquetWriteSupport extends ParquetWriteSupport {
+
+  private Configuration hadoopConf;
+  private BloomFilter bloomFilter;
+  private String minRecordKey;
+  private String maxRecordKey;
+
+  public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter) {
+super();
+Configuration hadoopConf = new Configuration(conf);
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", "false");

Review comment:
   Check lines 94 to 104 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
 . Or was your ask just about hardcoding these configs. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468932578



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();

Review comment:
   StructType is just the schema and for recordKey fields and partition 
paths, we parse the structType and store the chain of positions (if nested). 
don't think we get away without storing positions. 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #1902: [SUPPORT] Hudi dont put the same day in the same file

2020-08-11 Thread GitBox


rubenssoto commented on issue #1902:
URL: https://github.com/apache/hudi/issues/1902#issuecomment-672378827


   Hi,
   With bulk_insert my data was organized very well, so I started a streaming 
job with upsert on the same data.
   
   https://user-images.githubusercontent.com/36298331/89960776-4f892280-dc16-11ea-9ab7-843f67e5961b.png;>
   
   
   Why upsert didn't keep files organized? Its the same Hudi Options I only 
changed 
   hoodie.datasource.write.operation



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1936: Hudi Query Error

2020-08-11 Thread GitBox


umehrot2 commented on issue #1936:
URL: https://github.com/apache/hudi/issues/1936#issuecomment-672374251


   @harishchanderramesh What is the configured s3 path or your hudi table ? 
does it start with `s3a://` or `s3://` ? If it starts with `s3a` you may want 
to try using `s3://` once.
   
   If that's not the cause this would need deeper investigation by AWS support 
and its not something that we can or should drive over here, as this most 
likely does not have to do with Hudi. Please open an AWS support ticket and 
they should be able to drive the investigation over this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468901418



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   on 2nd thought, yes, it makes sense to move this to KeyGenerator. and 
thats why we had the default impl of re-using getRecord(). So that all existing 
customers can still 

[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468900813



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   guess, this could be a bug. just now realizing, we don't have tests for 
this. we have tests for all built in key generators, but not for this. Will get 
it done by tonight. 





[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468900813



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   guess, this could be a bug. just now realizing, we don't have tests for 
this. we have tests for all built in key generators, but not for this. Will get 
it done by tonight. sorry 

[GitHub] [hudi] rubenssoto commented on issue #1902: [SUPPORT] Hudi dont put the same day in the same file

2020-08-11 Thread GitBox


rubenssoto commented on issue #1902:
URL: https://github.com/apache/hudi/issues/1902#issuecomment-672306115


   Thank you so much for your help, it worked.
   
   Last question, Hudi organized data very well by files, but created some 
small files, is there any way to solve?
   
   https://user-images.githubusercontent.com/36298331/89953414-6bd09380-dc05-11ea-8fe9-8167ac6a91b9.png;>
   
   {
  "hoodie.datasource.write.recordkey.field":"created_date_brt,id",
  "hoodie.datasource.write.table.name":"order",
  "hoodie.datasource.write.operation":"bulk_insert",
  "hoodie.datasource.write.partitionpath.field":"partitionpath",
  "hoodie.datasource.write.hive_style_partitioning":"true",
  "hoodie.combine.before.insert":"true",
  "hoodie.combine.before.upsert":"false",
  "hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
  
"hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.ComplexKeyGenerator",
  "hoodie.parquet.small.file.limit":943718400,
  "hoodie.parquet.max.file.size":1073741824,
  "hoodie.parquet.block.size":1073741824,
  "hoodie.copyonwrite.record.size.estimate":512,
  "hoodie.cleaner.commits.retained":5,
  "hoodie.datasource.hive_sync.enable":"true",
  "hoodie.datasource.hive_sync.database":"datalake_raw",
  "hoodie.datasource.hive_sync.table":"order",
  "hoodie.datasource.hive_sync.partition_fields":"partitionpath",
  
"hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
  
"hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-82-196.us-west-2.compute.internal:1000",
  "hoodie.insert.shuffle.parallelism":1500,
  "hoodie.bulkinsert.shuffle.parallelism":8,
  "hoodie.upsert.shuffle.parallelism":1500
   }



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468879180



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
##
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.model;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.CalendarInterval;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} 
and keeps meta columns locally. But the {@link InternalRow}
+ * does include the meta columns as well just that {@link HoodieInternalRow} 
will intercept queries for meta columns and serve from its
+ * copy rather than fetching from {@link InternalRow}.
+ */
+public class HoodieInternalRow extends InternalRow {
+
+  private String commitTime;
+  private String commitSeqNumber;
+  private String recordKey;
+  private String partitionPath;
+  private String fileName;
+  private InternalRow row;
+
+  public HoodieInternalRow(String commitTime, String commitSeqNumber, String 
recordKey, String partitionPath,
+  String fileName, InternalRow row) {
+this.commitTime = commitTime;
+this.commitSeqNumber = commitSeqNumber;
+this.recordKey = recordKey;
+this.partitionPath = partitionPath;
+this.fileName = fileName;
+this.row = row;
+  }
+
+  @Override
+  public int numFields() {
+return row.numFields();
+  }
+
+  @Override
+  public void setNullAt(int i) {
+if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
+  switch (i) {
+case 0: {
+  this.commitTime = null;
+  break;
+}
+case 1: {
+  this.commitSeqNumber = null;
+  break;
+}
+case 2: {
+  this.recordKey = null;
+  break;
+}
+case 3: {
+  this.partitionPath = null;
+  break;
+}
+case 4: {
+  this.fileName = null;
+  break;
+}
+default: throw new IllegalArgumentException("Not expected");
+  }
+} else {
+  row.setNullAt(i);

Review comment:
   even I had the same doubt when I start reviewing this at first. thats 
why added some java docs for this class. row will have meta columns as well. 
just that meta columns will not be fetched from the row but from instance 
variables in this class. @bvaradar did some analysis before arriving at this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468885750



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   hmmm, not sure on this. I will reconcile w/ Balaji on this. 





This is an automated message from the Apache Git 

[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468885195



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieRowParquetWriteSupport.java
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io.storage;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.bloom.HoodieDynamicBoundedBloomFilter;
+import org.apache.parquet.hadoop.api.WriteSupport;
+import org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.HashMap;
+
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_BLOOM_FILTER_TYPE_CODE;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_MAX_RECORD_KEY_FOOTER;
+import static 
org.apache.hudi.avro.HoodieAvroWriteSupport.HOODIE_MIN_RECORD_KEY_FOOTER;
+
+/**
+ * Hoodie Write Support for directly writing Row to Parquet.
+ */
+public class HoodieRowParquetWriteSupport extends ParquetWriteSupport {
+
+  private Configuration hadoopConf;
+  private BloomFilter bloomFilter;
+  private String minRecordKey;
+  private String maxRecordKey;
+
+  public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter) {
+super();
+Configuration hadoopConf = new Configuration(conf);
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", "false");

Review comment:
   Nope. we need to fix this. The built in ParquetWriteSupport expects 
these two params to be set. I will double check once again to ensure this. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468884350



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/TimestampBasedKeyGenerator.java
##
@@ -177,4 +191,26 @@ private long convertLongTimeToMillis(Long partitionVal) {
 }
 return MILLISECONDS.convert(partitionVal, timeUnit);
   }
+
+  @Override
+  public String getRecordKey(Row row) {
+return RowKeyGeneratorHelper.getRecordKeyFromRow(row, 
getRecordKeyFields(), getRecordKeyPositions(), false);
+  }
+
+  @Override
+  public String getPartitionPath(Row row) {
+Object fieldVal = null;
+Object partitionPathFieldVal =  
RowKeyGeneratorHelper.getNestedFieldVal(row, 
getPartitionPathPositions().get(getPartitionPathFields().get(0)));

Review comment:
   yes, we have a special case for partition path. couldn't find a better 
way to do it. position will return -1 and when parsing for actual Row, we will 
return DEFAULT_PARTITION_PATH. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468883660



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java
##
@@ -55,21 +51,22 @@ public SimpleKeyGenerator(TypedProperties props, String 
partitionPathField) {
 
   @Override
   public String getRecordKey(GenericRecord record) {
-return KeyGenUtils.getRecordKey(record, recordKeyField);
+return KeyGenUtils.getRecordKey(record, getRecordKeyFields().get(0));

Review comment:
   We wanted to have the same exp as getKey(). we don't throw exception in 
constructor if record key is not found. we throw only when getKey(GenericRecord 
record) is called. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468882653



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   When I was doing the rebase, I saw getRecordKeyFieldNames in 
KeyGenerator was throwing UnsupportedOperationException. Hence went with the 
same for these methods too. So didn't 

[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468881817



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();

Review comment:
   yes, you could do that. I vaguely remember running into some issues and 
then I went with positions. Don't remember exactly. Might have to code it up to 
check. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468879180



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java
##
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.model;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.spark.sql.catalyst.InternalRow;
+import org.apache.spark.sql.catalyst.util.ArrayData;
+import org.apache.spark.sql.catalyst.util.MapData;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.unsafe.types.CalendarInterval;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * Internal Row implementation for Hoodie Row. It wraps an {@link InternalRow} 
and keeps meta columns locally. But the {@link InternalRow}
+ * does include the meta columns as well just that {@link HoodieInternalRow} 
will intercept queries for meta columns and serve from its
+ * copy rather than fetching from {@link InternalRow}.
+ */
+public class HoodieInternalRow extends InternalRow {
+
+  private String commitTime;
+  private String commitSeqNumber;
+  private String recordKey;
+  private String partitionPath;
+  private String fileName;
+  private InternalRow row;
+
+  public HoodieInternalRow(String commitTime, String commitSeqNumber, String 
recordKey, String partitionPath,
+  String fileName, InternalRow row) {
+this.commitTime = commitTime;
+this.commitSeqNumber = commitSeqNumber;
+this.recordKey = recordKey;
+this.partitionPath = partitionPath;
+this.fileName = fileName;
+this.row = row;
+  }
+
+  @Override
+  public int numFields() {
+return row.numFields();
+  }
+
+  @Override
+  public void setNullAt(int i) {
+if (i < HoodieRecord.HOODIE_META_COLUMNS.size()) {
+  switch (i) {
+case 0: {
+  this.commitTime = null;
+  break;
+}
+case 1: {
+  this.commitSeqNumber = null;
+  break;
+}
+case 2: {
+  this.recordKey = null;
+  break;
+}
+case 3: {
+  this.partitionPath = null;
+  break;
+}
+case 4: {
+  this.fileName = null;
+  break;
+}
+default: throw new IllegalArgumentException("Not expected");
+  }
+} else {
+  row.setNullAt(i);

Review comment:
   even I had the same doubt when I start reviewing this at first. thats 
why added some java docs for this class. row will have meta columns as well. 
just that meta columns will not be fetched from the row but from memory. 
@bvaradar did some analysis before arriving at this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


vinothchandar commented on pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#issuecomment-672288068


   in some sense, due to the bundling changes, this feel very last minute to 
validate more. We have to rely on 1-2 rounds of RC testing to weed things out.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


vinothchandar commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468873425



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -159,5 +172,24 @@
   avro
   compile
 
+
+

Review comment:
   ah. did not realize that we live in this world now. yeah makes sense. is 
hbase 2.x not happening? anyways, we need to shade guava. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1179) Add Row tests to all key generator test classes

2020-08-11 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1179:
-
Fix Version/s: 0.6.1

> Add Row tests to all key generator test classes
> ---
>
> Key: HUDI-1179
> URL: https://issues.apache.org/jira/browse/HUDI-1179
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468868307



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {

Review comment:
   responded elsewhere. we could move this to getRecordKey(Row) and 
getPartitionPath(Row) if need be. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1179) Add Row tests to all key generator test classes

2020-08-11 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1179:
-

 Summary: Add Row tests to all key generator test classes
 Key: HUDI-1179
 URL: https://issues.apache.org/jira/browse/HUDI-1179
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468867821



##
File path: 
hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
##
@@ -34,13 +36,28 @@ import org.scalatest.Assertions.fail
 class TestDataSourceDefaults {
 
   val schema = SchemaTestUtil.getComplexEvolvedSchema
+  val structType = AvroConversionUtils.convertAvroSchemaToStructType(schema)

Review comment:
   https://issues.apache.org/jira/browse/HUDI-1179





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1179) Add Row tests to all key generator test classes

2020-08-11 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-1179:
-

Assignee: sivabalan narayanan

> Add Row tests to all key generator test classes
> ---
>
> Key: HUDI-1179
> URL: https://issues.apache.org/jira/browse/HUDI-1179
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468866921



##
File path: hudi-client/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
##
@@ -51,4 +53,32 @@ protected KeyGenerator(TypedProperties config) {
 throw new UnsupportedOperationException("Bootstrap not supported for key 
generator. "
 + "Please override this method in your custom key generator.");
   }
+
+  /**
+   * Initializes {@link KeyGenerator} for {@link Row} based operations.
+   * @param structType structype of the dataset.
+   * @param structName struct name of the dataset.
+   * @param recordNamespace record namespace of the dataset.
+   */
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {

Review comment:
   yes, we could do that. since HoodieDatasetBulkInsertHelper is the only 
class calls into getRecordKey(row) and getPartitionPath(Row), it should have 
access to structype and other args.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468866012



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava
+  commons-lang:commons-lang

Review comment:
   I am able to shade as well as relocate `commons-lang` and `protobuf`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


bvaradar commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468857201



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -670,7 +670,9 @@ public Builder withPath(String basePath) {
 }
 
 public Builder withSchema(String schemaStr) {
-  props.setProperty(AVRO_SCHEMA, schemaStr);
+  if (null != schemaStr) {

Review comment:
   For Bulk Insert V2, we are passing null in createHoodieConfig(...). May 
be we can change createHoodieConfig() to not call withSchema() for bulk insert 
V2.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


nsivabalan edited a comment on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672262343


   @bschell @vinothchandar : I gave it a shot on this. I don't have permission 
to push to your branch to update this PR. 
   Diff1: adding support to spark 3. but does not upgrade spark version in 
pom.xml : https://github.com/apache/hudi/pull/1950
   Diff2: also upgrades spark version to 3.0.0 : 
https://github.com/apache/hudi/pull/1951
   Diff2 results in [compilation 
failure](https://github.com/apache/hudi/pull/1951#issuecomment-672261179) as 
one of the classes that Hoodie uses(SparkHadoopUtil) is not accessible anymore.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


nsivabalan edited a comment on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672262343


   @bschell : I gave it a shot on this. I don't have permission to push to your 
branch to update this PR. 
   Diff1: adding support to spark 3. but does not upgrade spark version in 
pom.xml : https://github.com/apache/hudi/pull/1950
   Diff2: also upgrades spark version to 3.0.0 : 
https://github.com/apache/hudi/pull/1951
   Diff2 results in [compilation 
failure](https://github.com/apache/hudi/pull/1951#issuecomment-672261179) as 
one of the classes that Hoodie uses(SparkHadoopUtil) is not accessible anymore.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


nsivabalan edited a comment on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672262343


   @bschell : I gave it a shot on this. I don't have permission to push to your 
branch to update this PR. 
   Diff1: adding support to spark 3. but does not upgrade spark version in 
pom.xml : https://github.com/apache/hudi/pull/1950
   Diff2: also upgrades spark version to 3.0.0 : 
https://github.com/apache/hudi/pull/1951
   Diff2 results in [compilation 
failure](https://github.com/apache/hudi/pull/1951#issuecomment-672261179) as of 
the classes that Hoodie uses(SparkHadoopUtil) is not accessible anymore.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


nsivabalan commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672262343


   @bschell : I have it a shot on this. I don't have permission to push to your 
branch to update this PR. 
   Diff1: adding support to spark 3. but does not upgrade spark version in 
pom.xml : https://github.com/apache/hudi/pull/1950
   Diff2: also upgrades spark version to 3.0.0 : 
https://github.com/apache/hudi/pull/1951
   Diff2 results in compilation failure as of the classes that Hoodie 
uses(SparkHadoopUtil) is not accessible anymore.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on pull request #1951: [WIP HUDI 1040 Part2] Upgrading to spark 3.0.0

2020-08-11 Thread GitBox


nsivabalan edited a comment on pull request #1951:
URL: https://github.com/apache/hudi/pull/1951#issuecomment-672261179


   Compilation error:
   ```
   [ERROR] 
/Users/sivabala/Documents/personal/projects/siva_hudi/hudi_aug2020/hudi/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:41:
 error: object SparkHadoopUtil in package deploy cannot be accessed in package 
org.apache.spark.deploy
   [ERROR]   val globPaths = SparkHadoopUtil.get.globPathIfNecessary(fs, 
qualified)
   [ERROR]   ^
   [ERROR] 
/Users/sivabala/Documents/personal/projects/siva_hudi/hudi_aug2020/hudi/hudi-spark/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala:118:
 error: not found: value SparkHadoopUtil
   [ERROR] SparkHadoopUtil.get.addCredentials(jobConf)
   [ERROR] ^
   [WARNING] three warnings found
   [ERROR] two errors found
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #1951: [WIP HUDI 1040 Part2] Upgrading to spark 3.0.0

2020-08-11 Thread GitBox


nsivabalan commented on pull request #1951:
URL: https://github.com/apache/hudi/pull/1951#issuecomment-672261179


   Compilation error:
   ```
   [INFO] BUILD FAILURE
   [INFO] 

   [INFO] Total time:  36.866 s
   [INFO] Finished at: 2020-08-11T16:22:01-04:00
   [INFO] 

   [ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.3.1:compile (scala-compile-first) on 
project hudi-spark_2.12: wrap: org.apache.commons.exec.ExecuteException: 
Process exited with an error: 1 (Exit value: 1) -> [Help 1]
   [ERROR] 
   [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
   [ERROR] Re-run Maven using the -X switch to enable full debug logging.
   [ERROR] 
   [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
   [ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
   [ERROR] 
   [ERROR] After correcting the problems, you can resume the build with the 
command
   [ERROR]   mvn  -rf :hudi-spark_2.12
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan opened a new pull request #1951: [WIP HUDI 1040 Part2] Upgrading to spark 3.0.0

2020-08-11 Thread GitBox


nsivabalan opened a new pull request #1951:
URL: https://github.com/apache/hudi/pull/1951


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan opened a new pull request #1950: [WIP HUDI 1040] Supporting Spark 3

2020-08-11 Thread GitBox


nsivabalan opened a new pull request #1950:
URL: https://github.com/apache/hudi/pull/1950


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-11 Thread Brandon Scheller (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175818#comment-17175818
 ] 

Brandon Scheller commented on HUDI-1146:


Fixed by https://github.com/apache/hudi/pull/1921

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468819625



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava

Review comment:
   Stack trace why we need guava shading:
   ```
   java.lang.NoSuchMethodError: 
com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.LruBlockCache.toString(LruBlockCache.java:704)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.toString(CacheConfig.java:502)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:260)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:128)
at java.util.HashMap.forEach(HashMap.java:1289)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:125)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:131)
at 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.filterFileStatusForSnapshotMode(HoodieInputFormatUtils.java:376)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:119)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:370)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:263)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:95)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:192)
at 
com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at 
com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at 
com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at 
com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```
   
   This happens because `hbase 1.2.3` needs `guava 12.0.1` whereas `presto` is 
on `guava 26.0`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468818565



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -159,5 +172,24 @@
   avro
   compile
 
+
+

Review comment:
   Here is the stack trace why we need to shade guava:
   ```
   java.lang.NoSuchMethodError: 
com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.LruBlockCache.toString(LruBlockCache.java:704)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.toString(CacheConfig.java:502)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at 
org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.(CacheConfig.java:260)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243)
at 
org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:128)
at java.util.HashMap.forEach(HashMap.java:1289)
at 
org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:125)
at 
org.apache.hudi.common.table.view.HoodieTableFileSystemView.(HoodieTableFileSystemView.java:131)
at 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.filterFileStatusForSnapshotMode(HoodieInputFormatUtils.java:376)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:119)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:370)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:263)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:95)
at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:192)
at 
com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at 
com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at 
com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at 
com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   ```
   
   This happens because `hbase 1.2.3` needs `guava 12.0.1` whereas `presto` is 
on `guava 26.0`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468799779



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava
+  commons-lang:commons-lang

Review comment:
   Let me quickly try shading them again, and remember what issues I was 
running into.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468799403



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava
+  commons-lang:commons-lang

Review comment:
   I think shading protobuf causes other issues. Also these dependencies 
were not part of presto runtime that is why I did not feel the need to shade 
them.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468797799



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -159,5 +172,24 @@
   avro
   compile
 
+
+

Review comment:
   I remember its due a conflict between the version `hbase` needs vs what 
`presto` has. Most of the issues are coming in because of trying to package 
`hbase` and make it work through presto.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468797076



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava

Review comment:
   These dependencies are fixes for runtime issues. `guava` is there is the 
class-path but there is a runtime mismatch. The main problem is coming in due 
to `hbase` dependencies, and they conflict with the `presto` versions. Let me 
try to provide the stack traces if that helps.
   
   About the licensing:
   - https://mvnrepository.com/artifact/com.yammer.metrics/metrics-core
   - https://mvnrepository.com/artifact/com.google.guava/guava
   - https://mvnrepository.com/artifact/commons-lang/commons-lang
   - https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java => 
this seems to be **BSD License**. So is this going to be an issue ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


vinothchandar commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468745957



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  protected List recordKeyFields;
+  protected List partitionPathFields;
+
+  private Map> recordKeyPositions = new HashMap<>();
+  private Map> partitionPathPositions = new HashMap<>();
+
+  private transient Function1 converterFn = null;
+  protected StructType structType;
+  private String structName;
+  private String recordNamespace;
+
+  protected BuiltinKeyGenerator(TypedProperties config) {
+super(config);
+  }
+
+  /**
+   * Generate a record Key out of provided generic record.
+   */
+  public abstract String getRecordKey(GenericRecord record);
+
+  /**
+   * Generate a partition path out of provided generic record.
+   */
+  public abstract String getPartitionPath(GenericRecord record);
+
+  /**
+   * Generate a Hoodie Key out of provided generic record.
+   */
+  public final HoodieKey getKey(GenericRecord record) {
+if (getRecordKeyFields() == null || getPartitionPathFields() == null) {
+  throw new HoodieKeyException("Unable to find field names for record key 
or partition path in cfg");
+}
+return new HoodieKey(getRecordKey(record), getPartitionPath(record));
+  }
+
+  @Override
+  public final List getRecordKeyFieldNames() {
+// For nested columns, pick top level column name
+return getRecordKeyFields().stream().map(k -> {
+  int idx = k.indexOf('.');
+  return idx > 0 ? k.substring(0, idx) : k;
+}).collect(Collectors.toList());
+  }
+
+  @Override
+  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
+// parse simple feilds
+getRecordKeyFields().stream()
+.filter(f -> !(f.contains(".")))
+.forEach(f -> recordKeyPositions.put(f, 
Collections.singletonList((Integer) (structType.getFieldIndex(f).get();
+// parse nested fields
+getRecordKeyFields().stream()
+.filter(f -> f.contains("."))
+.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+// parse simple fields
+if (getPartitionPathFields() != null) {
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  .forEach(f -> partitionPathPositions.put(f,
+  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get();
+  // parse nested fields
+  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
+  .forEach(f -> partitionPathPositions.put(f,
+  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+}
+this.structName = structName;
+this.structType = structType;
+this.recordNamespace = recordNamespace;
+  }
+
+  /**
+   * Fetch record key from {@link Row}.
+   * @param row instance of {@link Row} from which record key is requested.
+   * @return the record key of interest from {@link Row}.
+   */
+  @Override
+  public String getRecordKey(Row row) {
+if (null != converterFn) {

Review comment:
   as far as I can tell, this is private and set to null by default and not 
assigned anywhere else. so we will never pass `if (null != ..)` check. I think 
this should be if 

[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468783171



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -38,6 +40,9 @@
   public static final String HOODIE_CONSUME_MODE_PATTERN = 
"hoodie.%s.consume.mode";
   public static final String HOODIE_START_COMMIT_PATTERN = 
"hoodie.%s.consume.start.timestamp";
   public static final String HOODIE_MAX_COMMIT_PATTERN = 
"hoodie.%s.consume.max.commits";
+  public static final Set VIRTUAL_COLUMN_NAMES = 
CollectionUtils.createImmutableSet(

Review comment:
   This is a concept of Hive metadata, and not specific to parquet.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468782242



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
##
@@ -63,6 +62,7 @@
  * that does not correspond to a hoodie table then they are passed in as is 
(as what FileInputFormat.listStatus()
  * would do). The JobConf could have paths from multipe Hoodie/Non-Hoodie 
tables
  */
+@UseRecordReaderFromInputFormat

Review comment:
   Yes this is the main change required to integrate with presto. Rest are 
just fixes for the various runtime issues I ran into trying to get this working.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


umehrot2 commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468782242



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
##
@@ -63,6 +62,7 @@
  * that does not correspond to a hoodie table then they are passed in as is 
(as what FileInputFormat.listStatus()
  * would do). The JobConf could have paths from multipe Hoodie/Non-Hoodie 
tables
  */
+@UseRecordReaderFromInputFormat

Review comment:
   Yes he main change required is this to integrate with presto. Rest are 
just fixes for the various runtime issues I ran into trying to get this working.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha opened a new pull request #1949: [MINOR] Fix release script for onetime uploading of gpgkeys

2020-08-11 Thread GitBox


bhasudha opened a new pull request #1949:
URL: https://github.com/apache/hudi/pull/1949


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   This fixes the `preparation_before_release.sh` to take in apache account 
creds for  svn checkout and commit operations. 
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


bvaradar commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468723660



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava

Review comment:
   Unless, there is a strong reason, we should avoid bundling it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bschell commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


bschell commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672067962


   @vinothchandar While this works, the reflection does hurt performance as 
this is a frequently used path. I was looking into any better options to 
workaround the performance hit.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


bvaradar commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468702061



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava

Review comment:
   @umehrot2  Are these jars not provided by Presto runtime itself ? 
Version mismatch with guava had been a regular issue earlier but I see that you 
have shaded them. 
   
   Please also check the licensing of the new dependencies. Are they Apache ?
   

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -38,6 +40,9 @@
   public static final String HOODIE_CONSUME_MODE_PATTERN = 
"hoodie.%s.consume.mode";
   public static final String HOODIE_START_COMMIT_PATTERN = 
"hoodie.%s.consume.start.timestamp";
   public static final String HOODIE_MAX_COMMIT_PATTERN = 
"hoodie.%s.consume.max.commits";
+  public static final Set VIRTUAL_COLUMN_NAMES = 
CollectionUtils.createImmutableSet(

Review comment:
   Is this specific to Parquet or Hive in general. If this is specific to 
Parquet, can you move it to ParquetUtils ?

##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -76,7 +76,12 @@
   org.apache.hbase:hbase-common
   org.apache.hbase:hbase-protocol
   org.apache.hbase:hbase-server
+  org.apache.hbase:hbase-annotations
   org.apache.htrace:htrace-core
+  com.yammer.metrics:metrics-core
+  com.google.guava:guava
+  commons-lang:commons-lang

Review comment:
   Can you also shade commons-lang and protobuf ?

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
##
@@ -63,6 +62,7 @@
  * that does not correspond to a hoodie table then they are passed in as is 
(as what FileInputFormat.listStatus()
  * would do). The JobConf could have paths from multipe Hoodie/Non-Hoodie 
tables
  */
+@UseRecordReaderFromInputFormat

Review comment:
   Is this the main change ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-11 Thread GitBox


vinothchandar commented on pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#issuecomment-672053354


   @bschell is this tested and ready to go? would like to get it into the RC if 
possible
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1944: [HUDI-1174] Changes for bootstrapped tables to work with presto

2020-08-11 Thread GitBox


vinothchandar commented on a change in pull request #1944:
URL: https://github.com/apache/hudi/pull/1944#discussion_r468693275



##
File path: packaging/hudi-presto-bundle/pom.xml
##
@@ -159,5 +172,24 @@
   avro
   compile
 
+
+

Review comment:
   why are we bundling guava?  we can just use presto's guava classes right?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 opened a new issue #1948: [SUPPORT] DMS example complains about dfs-source.properties

2020-08-11 Thread GitBox


tooptoop4 opened a new issue #1948:
URL: https://github.com/apache/hudi/issues/1948


   /home/ec2-user/spark_home/bin/spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars 
"/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 
--deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar 
--table-type COPY_ON_WRITE --source-ordering-field dms_timestamp --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path 
s3a://redact/my/finaltbl --target-table mytestdms --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class 
org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf 
hoodie.datasource.write.recordkey.field=id --hoodie-conf 
hoodie.datasource.write.partitionpath.field=id --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://redact/my/dms/test
   
   ```
   2020-08-11 15:11:43,418 [main] ERROR 
org.apache.hudi.common.util.DFSPropertiesConfiguration - Error reading in 
properies from dfs
   java.io.FileNotFoundException: File 
file:/home/ec2-user/src/test/resources/delta-streamer-config/dfs-source.properties
 does not exist
   at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
   at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
   at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
   at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146)
   at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:787)
   at 
org.apache.hudi.common.util.DFSPropertiesConfiguration.visitFile(DFSPropertiesConfiguration.java:87)
   at 
org.apache.hudi.common.util.DFSPropertiesConfiguration.(DFSPropertiesConfiguration.java:60)
   at 
org.apache.hudi.common.util.DFSPropertiesConfiguration.(DFSPropertiesConfiguration.java:64)
   at 
org.apache.hudi.utilities.UtilHelpers.readConfig(UtilHelpers.java:118)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.(HoodieDeltaStreamer.java:451)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:97)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.(HoodieDeltaStreamer.java:91)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:380)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
   at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468656221



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -105,6 +104,22 @@ private[hudi] object HoodieSparkSqlWriter {
 } else {
   // Handle various save modes
   handleSaveModes(mode, basePath, tableConfig, tblName, operation, fs)
+  // Create the table if not present
+  if (!tableExists) {
+val tableMetaClient = 
HoodieTableMetaClient.initTableType(sparkContext.hadoopConfiguration, path.get,
+  HoodieTableType.valueOf(tableType), tblName, "archived", 
parameters(PAYLOAD_CLASS_OPT_KEY),
+  null.asInstanceOf[String])
+tableConfig = tableMetaClient.getTableConfig
+  }
+
+  // short-circuit if bulk_insert via row is enabled.
+  // scalastyle:off
+  if (operation.equalsIgnoreCase(BULK_INSERT_DATASET_OPERATION_OPT_VAL)) {
+val (success, commitTime: common.util.Option[String]) = 
bulkInsertAsRow(sqlContext, parameters, df, tblName,
+   
 basePath, path, instantTime)
+return (success, commitTime, common.util.Option.of(""), 
hoodieWriteClient.orNull, tableConfig)

Review comment:
   nit: can you make the 3rd arg Option.empty. when I put up the PR, I got 
compilation issues and hence returned empty string. I tested Option.empty 
locally with latest change and compilation seems to succeed. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468656221



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -105,6 +104,22 @@ private[hudi] object HoodieSparkSqlWriter {
 } else {
   // Handle various save modes
   handleSaveModes(mode, basePath, tableConfig, tblName, operation, fs)
+  // Create the table if not present
+  if (!tableExists) {
+val tableMetaClient = 
HoodieTableMetaClient.initTableType(sparkContext.hadoopConfiguration, path.get,
+  HoodieTableType.valueOf(tableType), tblName, "archived", 
parameters(PAYLOAD_CLASS_OPT_KEY),
+  null.asInstanceOf[String])
+tableConfig = tableMetaClient.getTableConfig
+  }
+
+  // short-circuit if bulk_insert via row is enabled.
+  // scalastyle:off
+  if (operation.equalsIgnoreCase(BULK_INSERT_DATASET_OPERATION_OPT_VAL)) {
+val (success, commitTime: common.util.Option[String]) = 
bulkInsertAsRow(sqlContext, parameters, df, tblName,
+   
 basePath, path, instantTime)
+return (success, commitTime, common.util.Option.of(""), 
hoodieWriteClient.orNull, tableConfig)

Review comment:
   nit: can you make the 3rd arg Option.empty. when I put up the PR, I got 
compilation issues and hence returned empty. I tested Option.empty locally with 
latest change and compilation seems to succeed. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-11 Thread GitBox


wangxianghu commented on pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#issuecomment-671970849


   @yanghua this pr is ready for review now :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1900: [HUDI-531]Add java doc for hudi test suite general classes

2020-08-11 Thread GitBox


wangxianghu commented on pull request #1900:
URL: https://github.com/apache/hudi/pull/1900#issuecomment-671971132


   @yanghua  this pr is ready for review now :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar closed pull request #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-08-11 Thread GitBox


vinothchandar closed pull request #1512:
URL: https://github.com/apache/hudi/pull/1512


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


vinothchandar commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468565839



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.AvroConversionHelper;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Function1;
+
+/**
+ * Base class for all the built-in key generators. Contains methods structured 
for
+ * code reuse amongst them.
+ */
+public abstract class BuiltinKeyGenerator extends KeyGenerator {
+
+  private List recordKeyFields;

Review comment:
   you mean having all the variables here? why did we need that change?  
Not sure if simple and complex should share though





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1178) Test Flakiness in CI

2020-08-11 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-1178:
-

Assignee: Balaji Varadarajan

> Test Flakiness in CI
> 
>
> Key: HUDI-1178
> URL: https://issues.apache.org/jira/browse/HUDI-1178
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup
>Affects Versions: 0.6.1
>Reporter: sivabalan narayanan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> This particular test intermittently fails in CI sometimes. 
>  
> [ERROR] Failures: 
> [ERROR] 
> ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable:50->testRunHoodieJavaApp:158->ITTestBase.executeCommandStringInDocker:212->ITTestBase.executeCommandInDocker:191
>  Command ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, 
> --hive-sync, --table-path, 
> hdfs://namenode/docker_hoodie_single_partition_key_cow_test, --hive-url, 
> jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, --hive-table, 
> docker_hoodie_single_partition_key_cow_test]) expected to succeed. Exit (255) 
> ==> expected: <0> but was: <255>
>  
> job link: [https://travis-ci.org/github/nsivabalan/hudi/jobs/716545898]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1178) Test Flakiness in CI (ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable)

2020-08-11 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1178:
--
Summary: Test Flakiness in CI 
(ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable)  (was: 
Test Flakiness in CI)

> Test Flakiness in CI 
> (ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable)
> --
>
> Key: HUDI-1178
> URL: https://issues.apache.org/jira/browse/HUDI-1178
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Code Cleanup
>Affects Versions: 0.6.1
>Reporter: sivabalan narayanan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> This particular test intermittently fails in CI sometimes. 
>  
> [ERROR] Failures: 
> [ERROR] 
> ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable:50->testRunHoodieJavaApp:158->ITTestBase.executeCommandStringInDocker:212->ITTestBase.executeCommandInDocker:191
>  Command ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, 
> --hive-sync, --table-path, 
> hdfs://namenode/docker_hoodie_single_partition_key_cow_test, --hive-url, 
> jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, --hive-table, 
> docker_hoodie_single_partition_key_cow_test]) expected to succeed. Exit (255) 
> ==> expected: <0> but was: <255>
>  
> job link: [https://travis-ci.org/github/nsivabalan/hudi/jobs/716545898]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1178) Test Flakiness in CI

2020-08-11 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1178:
-

 Summary: Test Flakiness in CI
 Key: HUDI-1178
 URL: https://issues.apache.org/jira/browse/HUDI-1178
 Project: Apache Hudi
  Issue Type: Bug
  Components: Code Cleanup
Affects Versions: 0.6.1
Reporter: sivabalan narayanan


This particular test intermittently fails in CI sometimes. 

 

[ERROR] Failures: 
[ERROR] 
ITTestHoodieSanity.testRunHoodieJavaAppOnSinglePartitionKeyCOWTable:50->testRunHoodieJavaApp:158->ITTestBase.executeCommandStringInDocker:212->ITTestBase.executeCommandInDocker:191
 Command ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, 
--table-path, hdfs://namenode/docker_hoodie_single_partition_key_cow_test, 
--hive-url, jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, 
--hive-table, docker_hoodie_single_partition_key_cow_test]) expected to 
succeed. Exit (255) ==> expected: <0> but was: <255>

 

job link: [https://travis-ci.org/github/nsivabalan/hudi/jobs/716545898]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan edited a comment on pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan edited a comment on pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#issuecomment-671892573


   https://github.com/apache/hudi/pull/1834#discussion_r461939866
   : bcoz, this is for Row where as existing WriteStats is for HoodieRecords. 
Guess we should have templatized this too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#issuecomment-671892573


   https://github.com/apache/hudi/pull/1834#discussion_r461939866
   : bcoz, this is for Row where as existing WriteStats is for HoodieRecords.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-11 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r468512749



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
##
@@ -54,12 +51,17 @@ public String getPartitionPath(GenericRecord record) {
   }
 
   @Override
-  public List getRecordKeyFields() {
-return recordKeyFields;
+  public List getPartitionPathFields() {
+return new ArrayList<>();
   }
 
   @Override
-  public List getPartitionPathFields() {
-return new ArrayList<>();
+  public String getRecordKeyFromRow(Row row) {
+return RowKeyGeneratorHelper.getRecordKeyFromRow(row, 
getRecordKeyFields(), getRecordKeyPositions(), true);
+  }
+
+  @Override
+  public String getPartitionPathFromRow(Row row) {

Review comment:
   yes, makes sense. we will revisit after 0.6.0 release.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1177) fix key generator bug

2020-08-11 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-1177:

Affects Version/s: 0.6.0

> fix key generator bug 
> --
>
> Key: HUDI-1177
> URL: https://issues.apache.org/jira/browse/HUDI-1177
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1177) fix key generator bug

2020-08-11 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-1177:

Status: Open  (was: New)

> fix key generator bug 
> --
>
> Key: HUDI-1177
> URL: https://issues.apache.org/jira/browse/HUDI-1177
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1177) fix key generator bug

2020-08-11 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-1177:

Status: In Progress  (was: Open)

> fix key generator bug 
> --
>
> Key: HUDI-1177
> URL: https://issues.apache.org/jira/browse/HUDI-1177
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >