[GitHub] [hudi] shenbinglife opened a new issue #2857: [SUPPORT] How to compile package hudi ?

2021-04-20 Thread GitBox


shenbinglife opened a new issue #2857:
URL: https://github.com/apache/hudi/issues/2857


   How to compile package hudi ? 
   
   mvn package -DskipTests -Dskip.tests=true
   
   [INFO] Scanning for projects...
   [INFO] 
   [INFO] < org.apache.hudi:hudi 
>
   [INFO] Building Hudi 0.7.0
   [INFO] [ pom 
]-
   [INFO] 
   [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven-version) @ 
hudi ---
   [INFO] 
   [INFO] --- maven-remote-resources-plugin:1.5:process 
(process-resource-bundles) @ hudi ---
   [INFO] 
   [INFO] --- maven-checkstyle-plugin:3.0.0:check (default) @ hudi ---
   [INFO] 开始检查……
   检查完成。
   [INFO] 
   [INFO] --- maven-site-plugin:3.7.1:attach-descriptor (attach-descriptor) @ 
hudi ---
   [INFO] No site descriptor found: nothing to attach.
   [INFO] 
   [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ hudi ---
   [INFO] 

   [INFO] BUILD SUCCESS
   [INFO] 

   [INFO] Total time:  4.984 s
   [INFO] Finished at: 2021-04-21T14:41:35+08:00
   [INFO] 

   
   Process finished with exit code 0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1818) Validate and check the option 'write.precombine.field' for Flink writer

2021-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

谢波 reassigned HUDI-1818:


Assignee: 谢波

> Validate and check the option 'write.precombine.field' for Flink writer
> ---
>
> Key: HUDI-1818
> URL: https://issues.apache.org/jira/browse/HUDI-1818
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: 谢波
>Priority: Major
> Fix For: 0.9.0
>
>
> Validate the option 'write.precombine.field' must exist in table schema when 
> creating table source, if it does not exist, tell the user to config this 
> option with the right field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1818) Validate and check the option 'write.precombine.field' for Flink writer

2021-04-20 Thread Danny Chen (Jira)
Danny Chen created HUDI-1818:


 Summary: Validate and check the option 'write.precombine.field' 
for Flink writer
 Key: HUDI-1818
 URL: https://issues.apache.org/jira/browse/HUDI-1818
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Flink Integration
Reporter: Danny Chen
 Fix For: 0.9.0


Validate the option 'write.precombine.field' must exist in table schema when 
creating table source, if it does not exist, tell the user to config this 
option with the right field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1415) Read Hoodie Table As Spark DataSource Table

2021-04-20 Thread pengzhiwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengzhiwei updated HUDI-1415:
-
Status: Open  (was: New)

> Read Hoodie Table As Spark DataSource Table 
> 
>
> Key: HUDI-1415
> URL: https://issues.apache.org/jira/browse/HUDI-1415
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.9.0
>
>
>  Currently hudi can sync the meta data to hive meta store using HiveSyncTool. 
> The table description  synced to hive  just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int, 
>   `dt` string)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124105009', 
>   'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, it trait it as a Hive Table, not a 
> spark data source table and convert it to parquet LogicalRelation in 
> HiveStrategies#RelationConversions. As a result, spark sql read the hudi 
> table just like a parquet data source.  This lead to an incorrect query 
> result if user missing set the spark.sql.hive.convertMetastoreParquet=false.
> Inorder to query hudi table as data source table in spark, more table 
> properties and serde properties must be added to the hive meta,just like the 
> follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='/tmp/hudi/tbl_price_cow0') 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124120532', 
>   'spark.sql.sources.provider'='hudi', 
>   'spark.sql.sources.schema.numParts'='1', 
>   
> 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
>  
>   'transient_lastDdlTime'='1606190729')
> {code}
> These are the missing table properties:
> {code:java}
> spark.sql.sources.provider= 'hudi'
> spark.sql.sources.schema.numParts = 'xx'
> spark.sql.sources.schema.part.{num} ='xx'
> spark.sql.sources.schema.numPartCols = 'xx'
> spark.sql.sources.schema.partCol.{num} = 'xx'{code}
> and serde property:
> {code:java}
> 'path'='/path/to/hudi'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1817) when query incr view of hudi table by using spark-sql. the result is wrong

2021-04-20 Thread tao meng (Jira)
tao meng created HUDI-1817:
--

 Summary: when query incr view of hudi table by using spark-sql. 
the result is wrong
 Key: HUDI-1817
 URL: https://issues.apache.org/jira/browse/HUDI-1817
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
 Environment: spark2.4.5   hive 3.1.1   hadoop 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


create hudi table (mor or cow)

 

val base_data = spark.read.parquet("/tmp/tb_base")
val upsert_data = spark.read.parquet("/tmp/tb_upsert")

base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par")

upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par")

query incr view by sparksql:

set hoodie.tb_test_mor_par.consume.mode=INCREMENTAL;
set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330;
set hoodie.tb_test_mor_par.consume.max.commits=3;
select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 
from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' 
order by primary_key;

+---+---+++++
|_hoodie_commit_time|primary_key|col0|col1|col6 |col7|
+---+---+++++
|20210420155738 |20 |77 |sC |158788760400|739 |
|20210420155738 |21 |66 |ps |160979049700|61 |
|20210420155738 |22 |47 |1P |158460042900|835 |
|20210420155738 |23 |36 |5K |160763480800|538 |
|20210420155738 |24 |1 |BA |160685711300|775 |
|20210420155738 |24 |101 |BA |160685711300|775 |
|20210420155738 |24 |100 |BA |160685711300|775 |
|20210420155738 |24 |102 |BA |160685711300|775 |
+---+---+++++

 

the primary_key is repeated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong

2021-04-20 Thread tao meng (Jira)
tao meng created HUDI-1816:
--

 Summary: when query incr view of hudi table by using spark-sql, 
the query result is wrong
 Key: HUDI-1816
 URL: https://issues.apache.org/jira/browse/HUDI-1816
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Affects Versions: 0.8.0
 Environment: spark2.4.5   hive 3.1.1hadoop 3.1.1
Reporter: tao meng
 Fix For: 0.9.0


test step1:

create a partitioned hudi table (mor / cow)

val base_data = spark.read.parquet("/tmp/tb_base")
val upsert_data = spark.read.parquet("/tmp/tb_upsert")

base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par")

upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, 
MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, 
"col2").option(RECORDKEY_FIELD_OPT_KEY, 
"primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, 
"col0").option(KEYGENERATOR_CLASS_OPT_KEY, 
"org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, 
"upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(HIVE_PARTITION_FIELDS_OPT_KEY, 
"col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY,
 "testdb").option(HIVE_TABLE_OPT_KEY, 
"tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, 
"false").option("hoodie.bulkinsert.shuffle.parallelism", 
4).option("hoodie.insert.shuffle.parallelism", 
4).option("hoodie.upsert.shuffle.parallelism", 
4).option("hoodie.delete.shuffle.parallelism", 
4).option("hoodie.datasource.write.hive_style_partitioning", 
"true").option(TABLE_NAME, 
"tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par")

query incr view by sparksql:
set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330;
set hoodie.tb_test_mor_par.consume.max.commits=3;
select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 
from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' 
order by primary_key;

+---+---+++++
|_hoodie_commit_time|primary_key|col0|col1|col6 |col7|
+---+---+++++
|20210420155738 |20 |77 |sC |158788760400|739 |
|20210420155738 |21 |66 |ps |160979049700|61 |
|20210420155738 |22 |47 |1P |158460042900|835 |
|20210420155738 |23 |36 |5K |160763480800|538 |
|20210420155738 |24 |1 |BA |160685711300|775 |
|20210420155738 |24 |101 |BA |160685711300|775 |
|20210420155738 |24 |100 |BA |160685711300|775 |
|20210420155738 |24 |102 |BA |160685711300|775 |
+---+---+++++

 

primary key 24 is repeated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #2830: [SUPPORT]same _hoodie_record_key has duplicates data

2021-04-20 Thread GitBox


nsivabalan commented on issue #2830:
URL: https://github.com/apache/hudi/issues/2830#issuecomment-823752462


   oh, I see you are using GLOBAL_BLOOM as your index. Can you tell us which 
version of hudi are you using and other env details. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification

2021-04-20 Thread GitBox


nsivabalan edited a comment on issue #2852:
URL: https://github.com/apache/hudi/issues/2852#issuecomment-823751580


   Guess the documentation you have linked actually talks about the usage. 
   ```This will ensure the input format classes with its dependencies are 
available for query planning & execution.```
   
   @bvaradar @n3nash can add more info if required. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification

2021-04-20 Thread GitBox


nsivabalan commented on issue #2852:
URL: https://github.com/apache/hudi/issues/2852#issuecomment-823751580


   Guess the documentation you have linked actually talks about the usage. 
   ```This will ensure the input format classes with its dependencies are 
available for query planning & execution.```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2855: [SUPPORT] hudi-utilities documentation

2021-04-20 Thread GitBox


nsivabalan commented on issue #2855:
URL: https://github.com/apache/hudi/issues/2855#issuecomment-823749990


   yes, HoodieDeltastreamer is heavily used by many users, which is in 
Hudi-utilities-bundle. 
   https://issues.apache.org/jira/browse/HUDI-1815
   @bvaradar : can you briefly go over what all is offered in Hudi-utilities 
for end users. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1815) Add readme to each bundle to give a brief intro about each bundle

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1815:
--
Labels: docs sev:normal  (was: )

> Add readme to each bundle to give a brief intro about each bundle
> -
>
> Key: HUDI-1815
> URL: https://issues.apache.org/jira/browse/HUDI-1815
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: docs, sev:normal
>
> hudi-utilities-bundle nor Hudi-spark-bundle does not have any readme as to 
> what's the purpose. 
> Add a readme with some details about the same. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1815) Add readme to each bundle to give a brief intro about each bundle

2021-04-20 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1815:
-

 Summary: Add readme to each bundle to give a brief intro about 
each bundle
 Key: HUDI-1815
 URL: https://issues.apache.org/jira/browse/HUDI-1815
 Project: Apache Hudi
  Issue Type: Task
  Components: Docs
Reporter: sivabalan narayanan


hudi-utilities-bundle nor Hudi-spark-bundle does not have any readme as to 
what's the purpose. 

Add a readme with some details about the same. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode

2021-04-20 Thread GitBox


nsivabalan commented on issue #2850:
URL: https://github.com/apache/hudi/issues/2850#issuecomment-823748401


   CC @xushiyan @bvaradar @n3nash 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode

2021-04-20 Thread GitBox


nsivabalan commented on issue #2850:
URL: https://github.com/apache/hudi/issues/2850#issuecomment-823747511


   We know one bug ATM w/ deltastreamer where if multiple files are present w/ 
same mod time, deltastreamer could skip some of them. 
https://issues.apache.org/jira/browse/HUDI-1723 
https://github.com/apache/hudi/pull/2845 
   Do you think yours is falling into this category ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1814) Non partitioned table for Flink writer

2021-04-20 Thread Danny Chen (Jira)
Danny Chen created HUDI-1814:


 Summary: Non partitioned table for Flink writer
 Key: HUDI-1814
 URL: https://issues.apache.org/jira/browse/HUDI-1814
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Docs
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


codecov-commenter edited a comment on pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (1e379c9) into 
[master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (62bb9e1) will **increase** coverage by `17.08%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2853   +/-   ##
   =
   + Coverage 52.60%   69.68%   +17.08% 
   + Complexity 3709  373 -3336 
   =
 Files   485   54  -431 
 Lines 23224 1996-21228 
 Branches   2465  236 -2229 
   =
   - Hits  12216 1391-10825 
   + Misses 9929  473 -9456 
   + Partials   1079  132  -947 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.68% <ø> (-0.11%)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `62.50% <0.00%> (-2.72%)` | `9.00% <0.00%> (ø%)` | |
   | 
[.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=)
 | | | |
   | 
[.../main/scala/org/apache/hudi/HoodieSparkUtils.scala](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrVXRpbHMuc2NhbGE=)
 | | | |
   | 
[...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=)
 | | | |
   | 
[...org/apache/hudi/common/model/TableServiceType.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1RhYmxlU2VydmljZVR5cGUuamF2YQ==)
 | | | |
   | 
[...ava/org/apache/hudi/common/util/DateTimeUtils.java](https://codecov.io/gh/apache/hudi/

[GitHub] [hudi] wk888 commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException

2021-04-20 Thread GitBox


wk888 commented on issue #2834:
URL: https://github.com/apache/hudi/issues/2834#issuecomment-823728826


   @yanghua 
   i can find  the hoodie file in hdfs:
   
   
![image](https://user-images.githubusercontent.com/16316415/115488269-cf2ebd00-a28c-11eb-85ac-73ed631b6f31.png)
   
   but from the error log  you can see it find the file from the  
/tmp/hive/root/1c7ec12e-4953-4913-bf9f-a09372b51609/.hoodie it seemd like the 
hive tm directory...so it cant find the .hoodie file
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


codecov-commenter edited a comment on pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (1e379c9) into 
[master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (62bb9e1) will **decrease** coverage by `43.23%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2853   +/-   ##
   
   - Coverage 52.60%   9.36%   -43.24% 
   + Complexity 3709  48 -3661 
   
 Files   485  54  -431 
 Lines 232241996-21228 
 Branches   2465 236 -2229 
   
   - Hits  12216 187-12029 
   + Misses 99291796 -8133 
   + Partials   1079  13 -1066 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.36% <ø> (-60.43%)` | `48.00 <ø> (-325.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS

[GitHub] [hudi] garyli1019 commented on a change in pull request #2847: [HUDI-1769]Add download page to the site

2021-04-20 Thread GitBox


garyli1019 commented on a change in pull request #2847:
URL: https://github.com/apache/hudi/pull/2847#discussion_r617151804



##
File path: docs/_pages/download.cn.md
##
@@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
 ## Release 0.8.0
-* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))
+* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))

Review comment:
   probably. www.apache.org/dyn/closer.lua was also mentioned in the 
instruction email sent by the owner of annou...@apache.org, so I think this 
should be the right one to put on the site.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] MyLanPangzi closed pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


MyLanPangzi closed pull request #2853:
URL: https://github.com/apache/hudi/pull/2853


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE

2021-04-20 Thread GitBox


xiarixiaoyao commented on pull request #2722:
URL: https://github.com/apache/hudi/pull/2722#issuecomment-823714627


   @lw309637554  @nsivabalan  thanks for your review. i will try 
testHoodieRealtimeCombineHoodieInputFormat in another pr, since 
   it has nothing to do with this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1744) [Rollback] rollback fail on mor table when the partition path hasn't any files

2021-04-20 Thread lrz (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lrz resolved HUDI-1744.
---
Resolution: Fixed

> [Rollback] rollback fail on mor table when the partition path hasn't any files
> --
>
> Key: HUDI-1744
> URL: https://issues.apache.org/jira/browse/HUDI-1744
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: lrz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> when rollback on a mor table, and if the partition path hasn't any files, 
> then will throw exception because of call rdd.flatmap with 0 as numpartitions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-04-20 Thread GitBox


xiarixiaoyao commented on a change in pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#discussion_r617137325



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
##
@@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim 
createInputFormatShim() {
 if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, 
"").isEmpty()) {
   List partitions = new 
ArrayList<>(part.getPartSpec().keySet());
   if (!partitions.isEmpty()) {
-String partitionStr = String.join(",", partitions);

Review comment:
   @nsivabalan   just see the funtion initObjectInspector in 
MapOperator.java(my hive version is hive 3.1.1):
   
   // Next check if this table has partitions and if so
   // get the list of partition names as well as allocate
   // the serdes for the partition columns
   **line 189**String pcols = 
overlayedProps.getProperty(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS);
   
   **line 191**if (pcols != null && pcols.length() > 0) {
   **line 192**String[] partKeys = pcols.trim().split("/");




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException

2021-04-20 Thread GitBox


yanghua commented on issue #2834:
URL: https://github.com/apache/hudi/issues/2834#issuecomment-823705049


   @wk888 OK, I reviewed the code, at `TableNotFoundException.java:53`, the 
path you provided triggered `FileNotFoundException | IllegalArgumentException`. 
Did you make sure the path exists?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


yanghua commented on pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#issuecomment-823702360


   Hi @MyLanPangzi Would you please recheck the Travis? If it was not caused by 
your change, then please retrigger the CI. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wk888 commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException

2021-04-20 Thread GitBox


wk888 commented on issue #2834:
URL: https://github.com/apache/hudi/issues/2834#issuecomment-823700483


   @yanghua it seems have no privilege create database not create table 
   and the table is created successed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #2849: [SUPPORT] - org.apache.hudi.exception.HoodieIOException: Could not load Hoodie properties from file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties

2021-04-20 Thread GitBox


nsivabalan commented on issue #2849:
URL: https://github.com/apache/hudi/issues/2849#issuecomment-823641624


   Can you clean up the base path once and retry. 
   rm -rf  
   sometimes, there could be some residues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

2021-04-20 Thread GitBox


nsivabalan commented on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-823619094


   great job on the patch 👍  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

2021-04-20 Thread GitBox


vinothchandar commented on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-823611327


   This is a great contribution. Thanks @pengzhiwei2018 ! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283)

2021-04-20 Thread uditme
This is an automated email from the ASF dual-hosted git repository.

uditme pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new aacb8be  [HUDI-1415] Read Hoodie Table As Spark DataSource Table 
(#2283)
aacb8be is described below

commit aacb8be5213a64a3cc9ddd791e2321526517d044
Author: pengzhiwei 
AuthorDate: Wed Apr 21 05:21:38 2021 +0800

[HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283)
---
 .../scala/org/apache/hudi/DataSourceOptions.scala  |  4 +
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 99 ++
 .../functional/HoodieSparkSqlWriterSuite.scala | 44 ++
 .../main/java/org/apache/hudi/dla/DLASyncTool.java |  5 +-
 .../java/org/apache/hudi/dla/HoodieDLAClient.java  |  7 +-
 .../java/org/apache/hudi/hive/HiveSyncConfig.java  | 52 +++-
 .../java/org/apache/hudi/hive/HiveSyncTool.java| 12 ++-
 .../org/apache/hudi/hive/HoodieHiveClient.java | 27 +-
 .../org/apache/hudi/hive/util/ConfigUtils.java | 73 
 .../org/apache/hudi/hive/util/HiveSchemaUtil.java  | 26 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java | 58 -
 .../hudi/sync/common/AbstractSyncHoodieClient.java | 16 +++-
 .../functional/TestHoodieDeltaStreamer.java|  7 ++
 13 files changed, 382 insertions(+), 48 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
index 4c76f5f..4643da5 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
@@ -353,6 +353,9 @@ object DataSourceWriteOptions {
   val HIVE_IGNORE_EXCEPTIONS_OPT_KEY = 
"hoodie.datasource.hive_sync.ignore_exceptions"
   val HIVE_SKIP_RO_SUFFIX = "hoodie.datasource.hive_sync.skip_ro_suffix"
   val HIVE_SUPPORT_TIMESTAMP = "hoodie.datasource.hive_sync.support_timestamp"
+  val HIVE_TABLE_PROPERTIES = "hoodie.datasource.hive_sync.table_properties"
+  val HIVE_TABLE_SERDE_PROPERTIES = 
"hoodie.datasource.hive_sync.serde_properties"
+  val HIVE_SYNC_AS_DATA_SOURCE_TABLE = 
"hoodie.datasource.hive_sync.sync_as_datasource"
 
   // DEFAULT FOR HIVE SPECIFIC CONFIGS
   val DEFAULT_HIVE_SYNC_ENABLED_OPT_VAL = "false"
@@ -372,6 +375,7 @@ object DataSourceWriteOptions {
   val DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "false"
   val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false"
   val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false"
+  val DEFAULT_HIVE_SYNC_AS_DATA_SOURCE_TABLE = "true"
 
   // Async Compaction - Enabled by default for MOR
   val ASYNC_COMPACT_ENABLE_OPT_KEY = 
"hoodie.datasource.compaction.async.enable"
diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
index 340ac14..3a5b51e 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
@@ -36,6 +36,7 @@ import org.apache.hudi.common.util.{CommitUtils, 
ReflectionUtils}
 import org.apache.hudi.config.HoodieBootstrapConfig.{BOOTSTRAP_BASE_PATH_PROP, 
BOOTSTRAP_INDEX_CLASS_PROP, DEFAULT_BOOTSTRAP_INDEX_CLASS}
 import org.apache.hudi.config.HoodieWriteConfig
 import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.hive.util.ConfigUtils
 import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool}
 import org.apache.hudi.internal.DataSourceInternalWriterHelper
 import org.apache.hudi.sync.common.AbstractSyncTool
@@ -44,7 +45,10 @@ import org.apache.spark.SPARK_VERSION
 import org.apache.spark.SparkContext
 import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.rdd.RDD
-import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}
+import org.apache.spark.sql.internal.SQLConf
+import 
org.apache.spark.sql.internal.StaticSQLConf.SCHEMA_STRING_LENGTH_THRESHOLD
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession}
 
 import scala.collection.JavaConversions._
 import scala.collection.mutable.ListBuffer
@@ -220,7 +224,8 @@ private[hudi] object HoodieSparkSqlWriter {
 
   // Check for errors and commit the write.
   val (writeSuccessful, compactionInstant) =
-commitAndPerformPostOperations(writeResult, parameters, writeClient, 
tableConfig, jsc,
+commitAndPerformPostOperations(sqlContext.sparkSession, df.schema,
+  writeResult, parameters, writeClient, tableConfig, jsc,
   TableInstantInfo(basePath, instantTime, commitActionType, operation))
 
   def unpersistRdd(rdd: RD

[GitHub] [hudi] umehrot2 merged pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

2021-04-20 Thread GitBox


umehrot2 merged pull request #2283:
URL: https://github.com/apache/hudi/pull/2283


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

2021-04-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325964#comment-17325964
 ] 

sivabalan narayanan commented on HUDI-1343:
---

[~liujinhui] [~vbalaji]: Do you folks think if this is still required after 
this fix [https://github.com/apache/hudi/pull/2765] . This fixes 
AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first 
entry in union and default value is set to null if a field is nullable in spark 
structtype. 

I mean, we have enabled the post schema processor by default. so wanted to 
double check if it's still applicable. 

> Add standard schema postprocessor which would rewrite the schema using 
> spark-avro conversion
> 
>
> Key: HUDI-1343
> URL: https://issues.apache.org/jira/browse/HUDI-1343
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When we use Transformer, the final Schema which we use to convert avro record 
> to bytes is auto generated by spark. This could be different (due to the way 
> Avro treats it) from the target schema that is being used to write (as the 
> target schema could be coming from Schema Registry). 
>  
> For example : 
> Schema generated by spark-avro when converting Row to avro
> {
>   "type" : "record",
>   "name" : "hoodie_source",
>   "namespace" : "hoodie.source",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "long", "null" ]
>   }, {
>     "name" : "_op",
>     "type" : "string"
>   }, {
>     "name" : "inc_id",
>     "type" : "int"
>   }, {
>     "name" : "year",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "violation_code",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "flag",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long"
>   } ]
> }
>  
> is not compatible with the Avro Schema:
>  
> {
>   "type" : "record",
>   "name" : "formatted_debezium_payload",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "null", "long" ],
>     "default" : null
>   }, {
>     "name" : "_op",
>     "type" : "string",
>     "default" : null
>   }, {
>     "name" : "inc_id",
>     "type" : "int",
>     "default" : null
>   }, {
>     "name" : "year",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "violation_code",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "flag",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long",
>     "default" : null
>   } ]
> }
>  
> Note that the type order is different for individual fields : 
> "type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is 
> read using second schema.
>  
> One way to fix is to use configured target schema when generating record 
> bytes but this is not easy without breaking Record payload constructor API 
> used by deltastreamer. 
> The other option is to apply a post-processor on target schema to make it 
> schema consistent with Transformer generated records.
>  
> This ticket is to use the later approach of creating a standard schema 
> post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325963#comment-17325963
 ] 

Vinoth Chandar commented on HUDI-648:
-

I see it linked now. 

I queued the PR up for review. 

> Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction 
> writes
> 
>
> Key: HUDI-648
> URL: https://issues.apache.org/jira/browse/HUDI-648
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available, sev:normal, user-support-issues
> Attachments: image-2021-03-03-11-40-21-083.png
>
>
> We would like a way to hand the erroring records from writing or compaction 
> back to the users, in a separate table or log. This needs to work generically 
> across all the different writer paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] satishkotha commented on pull request #2809: [HUDI-1789] Support reading older snapshots

2021-04-20 Thread GitBox


satishkotha commented on pull request #2809:
URL: https://github.com/apache/hudi/pull/2809#issuecomment-823443803


   @jsbali added few comments. Can you also check why CI is failing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2809: [HUDI-1789] Support reading older snapshots

2021-04-20 Thread GitBox


satishkotha commented on a change in pull request #2809:
URL: https://github.com/apache/hudi/pull/2809#discussion_r616870583



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java
##
@@ -438,11 +437,20 @@ public static HoodieMetadataConfig 
buildMetadataConfig(Configuration conf) {
 if (LOG.isDebugEnabled()) {
   LOG.debug("Hoodie Metadata initialized with completed commit instant 
as :" + metaClient);
 }
-
 HoodieTimeline timeline = 
HoodieHiveUtils.getTableTimeline(metaClient.getTableConfig().getTableName(), 
job, metaClient);
+

Review comment:
   can we combine this into getTableTimeline method? 
HoodieHiveUtils.getTableTimeline is anyway getting all the config. So i think 
that provides better abstraction to get relevant timeline.

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -137,4 +139,16 @@ public static HoodieTimeline getTableTimeline(final String 
tableName, final JobC
 // by default return all completed commits.
 return 
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
   }
+
+  public static Option getSnapshotMaxCommitTime(JobConf job, String 
tableName) {
+String maxCommitTime = job.get(getSnapshotMaxCommitKey(tableName));
+if (maxCommitTime != null) {

Review comment:
   consider using !StringUtils.isNullorEmpty() or just simply return 
Option.ofNullable(maxCommitTime)

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -68,6 +69,7 @@
   public static final String HOODIE_STOP_AT_COMPACTION_PATTERN = 
"hoodie.%s.ro.stop.at.compaction";
   public static final String INCREMENTAL_SCAN_MODE = "INCREMENTAL";
   public static final String SNAPSHOT_SCAN_MODE = "SNAPSHOT";
+  public static final String HOODIE_SNAPSHOT_CONSUME_COMMIT_PATTERN = 
"hoodie.%s.consume.snapshot.time";

Review comment:
   Do you think we can reuse existing config? perhaps HOODIE_CONSUME_COMMIT?

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -137,4 +139,16 @@ public static HoodieTimeline getTableTimeline(final String 
tableName, final JobC
 // by default return all completed commits.
 return 
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
   }
+
+  public static Option getSnapshotMaxCommitTime(JobConf job, String 
tableName) {

Review comment:
   nit: consider adding javadoc for all public methods (I know we dont 
follow this consistently, but would be great to add for all new code)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325961#comment-17325961
 ] 

Vinoth Chandar commented on HUDI-251:
-

Please also feel free to take over the RFC as well. I can give you perms

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Affects Versions: 0.9.0
>Reporter: Taher Koitawala
>Assignee: Sagar Sumit
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325958#comment-17325958
 ] 

Vinoth Chandar commented on HUDI-251:
-

On 2, I think we have to enforce some sorting when limiting (if you are pulling 
very incrementally, hopefully it won't be as bad) and given we persist, to 
derive the checkpoint, we will pick the maximum value of the `ckpt` column 
value each time and we should be okay. 

>where ckpt > last_ckpt order by ckpt desc limit x

yes. we are on the same page. We have to sort and paginate like this.

 

>Can you please elaborate more on the tailing mechanism?

What I meant was there could be scenarios, where we could still miss data in 
this JDBC based approach. We should clearly document these.

For e.g As we fetch `ckpt > 10` there could be a long running transaction that 
just committed an earlier `ckpt=8` value. We would just fetch all records from 
10 and move on. Let's think through also other issues like this? I think its 
okay, since everybody understands JDBC pulling is more for convenience than 
anything, works correctly when you don't run into these cases. Does that make 
sense?

 

 

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Affects Versions: 0.9.0
>Reporter: Taher Koitawala
>Assignee: Purushotham Pushpavanthar
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2021-04-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-251:
---

Assignee: Sagar Sumit  (was: Purushotham Pushpavanthar)

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Affects Versions: 0.9.0
>Reporter: Taher Koitawala
>Assignee: Sagar Sumit
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] satishkotha merged pull request #2773: [HUDI-1764] Add Hudi-CLI support for clustering

2021-04-20 Thread GitBox


satishkotha merged pull request #2773:
URL: https://github.com/apache/hudi/pull/2773


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1764] Add Hudi-CLI support for clustering (#2773)

2021-04-20 Thread satish
This is an automated email from the ASF dual-hosted git repository.

satish pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3253079  [HUDI-1764] Add Hudi-CLI support for clustering (#2773)
3253079 is described below

commit 3253079507d74f6d52e78ad7f88b297daf969455
Author: Jintao Guan 
AuthorDate: Tue Apr 20 09:46:42 2021 -0700

[HUDI-1764] Add Hudi-CLI support for clustering (#2773)

* tmp base

* update

* update unit test

* update

* update

* update CLI parameters

* linting

* update doSchedule in HoodieClusteringJob

* update

* update diff according to comments
---
 .../hudi/cli/commands/ClusteringCommand.java   | 107 +
 .../org/apache/hudi/cli/commands/SparkMain.java|  43 -
 .../apache/hudi/utilities/HoodieClusteringJob.java |   5 +-
 3 files changed, 153 insertions(+), 2 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java
new file mode 100644
index 000..092f927
--- /dev/null
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.commands.SparkMain.SparkCommand;
+import org.apache.hudi.cli.utils.InputStreamConsumer;
+import org.apache.hudi.cli.utils.SparkUtil;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.launcher.SparkLauncher;
+import org.apache.spark.util.Utils;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+import scala.collection.JavaConverters;
+
+@Component
+public class ClusteringCommand implements CommandMarker {
+
+  private static final Logger LOG = 
LogManager.getLogger(ClusteringCommand.class);
+
+  @CliCommand(value = "clustering schedule", help = "Schedule Clustering")
+  public String scheduleClustering(
+  @CliOption(key = "sparkMemory", help = "Spark executor memory",
+  unspecifiedDefaultValue = "1G") final String sparkMemory,
+  @CliOption(key = "propsFilePath", help = "path to properties file on 
localfs or dfs with configurations for hoodie client for clustering",
+  unspecifiedDefaultValue = "") final String propsFilePath,
+  @CliOption(key = "hoodieConfigs", help = "Any configuration that can be 
set in the properties file can be passed here in the form of an array",
+  unspecifiedDefaultValue = "") final String[] configs) throws 
Exception {
+HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
+boolean initialized = HoodieCLI.initConf();
+HoodieCLI.initFS(initialized);
+
+String sparkPropertiesPath =
+
Utils.getDefaultPropertiesFile(JavaConverters.mapAsScalaMapConverter(System.getenv()).asScala());
+SparkLauncher sparkLauncher = SparkUtil.initLauncher(sparkPropertiesPath);
+
+// First get a clustering instant time and pass it to spark launcher for 
scheduling clustering
+String clusteringInstantTime = HoodieActiveTimeline.createNewInstantTime();
+
+sparkLauncher.addAppArgs(SparkCommand.CLUSTERING_SCHEDULE.toString(), 
client.getBasePath(),
+client.getTableConfig().getTableName(), clusteringInstantTime, 
sparkMemory, propsFilePath);
+UtilHelpers.validateAndAddProperties(configs, sparkLauncher);
+Process process = sparkLauncher.launch();
+InputStreamConsumer.captureOutput(process);
+int exitCode = process.waitFor();
+if (exitCode != 0) {
+  return "Failed to schedule clustering for " + clusteringInstantTime;
+}
+return "Succ

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325954#comment-17325954
 ] 

Vinoth Chandar commented on HUDI-1138:
--

[~309637554] Please let me know if you are interested in taking a swing at 
this. 

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1138) Re-implement marker files via timeline server

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325952#comment-17325952
 ] 

Vinoth Chandar edited comment on HUDI-1138 at 4/20/21, 4:41 PM:


yes. basic idea here is to 

 cc [~309637554]

0) Maintain the marker file list, in a single file called `markers` under 
.hoodie/temp// (or whatever path we write this today) 

1) Add a new endpoint to timeline server, `/createMarkerFile`, which only 
returns 200 only if successfully reads `markers` file, adds an entry to it, 
overwrites the `markers` on underlying cloud storage.

2) We employ some batching here, such that we can batch all requests that 
arrive in a 100-500ms window in a single overwrite operation. 

 

I think this will work really well (based on similar things I have done 
before). wdyt? 

Before this, we should also study how effective the current parallelization is. 
So hacking up a PoC to see the perf gains would be interesting first step.

 


was (Author: vc):
yes. basic idea here is to 

 

0) Maintain the marker file list, in a single file called `markers` under 
.hoodie/temp// (or whatever path we write this today) 

1) Add a new endpoint to timeline server, `/createMarkerFile`, which only 
returns 200 only if successfully reads `markers` file, adds an entry to it, 
overwrites the `markers` on underlying cloud storage.

2) We employ some batching here, such that we can batch all requests that 
arrive in a 100-500ms window in a single overwrite operation. 

 

I think this will work really well (based on similar things I have done 
before). wdyt? 

Before this, we should also study how effective the current parallelization is. 
So hacking up a PoC to see the perf gains would be interesting first step.

 

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-04-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325952#comment-17325952
 ] 

Vinoth Chandar commented on HUDI-1138:
--

yes. basic idea here is to 

 

0) Maintain the marker file list, in a single file called `markers` under 
.hoodie/temp// (or whatever path we write this today) 

1) Add a new endpoint to timeline server, `/createMarkerFile`, which only 
returns 200 only if successfully reads `markers` file, adds an entry to it, 
overwrites the `markers` on underlying cloud storage.

2) We employ some batching here, such that we can batch all requests that 
arrive in a 100-500ms window in a single overwrite operation. 

 

I think this will work really well (based on similar things I have done 
before). wdyt? 

Before this, we should also study how effective the current parallelization is. 
So hacking up a PoC to see the perf gains would be interesting first step.

 

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1652:
--
Status: Patch Available  (was: In Progress)

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-1652.
---
Fix Version/s: 0.7.0
   Resolution: Fixed

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
> Fix For: 0.7.0
>
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1652:
--
Status: In Progress  (was: Open)

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-1652:
-

Assignee: Balaji Varadarajan

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1652:
--
Status: Closed  (was: Patch Available)

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing

2021-04-20 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reopened HUDI-1652:
---

> DiskBasedMap:As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing
> ---
>
> Key: HUDI-1652
> URL: https://issues.apache.org/jira/browse/HUDI-1652
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Affects Versions: 0.6.0
>Reporter: wangmeng
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> We encountered a problem in the hudi production environment, which is very 
> similar to the HUDI-945 problem.
>  *Software environment:* spark 2.4.5, hudi 0.6
>  *Scenario:* consume Kafka data and write hudi, using spark streaming 
> (non-StructedStreaming).
>  *Problem:* As time goes by, the number of /temp/* file handles held by 
> the executor process is increasing.
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
>  *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, 
> and DiskBasedMap is used to flush overflowed data to the disk. But the file 
> stream can only be closed and deleted by the hook when the jvm exits. When 
> the clear method is executed in the program, the stream is not closed and the 
> file is not deleted. As a result, over time, more and more file handles are 
> still held, leading to errors. This error is similar to Hudi-945.
>  
> *软件环境:*spark 2.4.5、hudi 0.6 
> *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。
>  *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。
> "
> /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0
>  /tmp/49251680-0efd-4cc4-a55e-1af2038d3900
>  /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9
> "
> *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch asf-site updated: Travis CI build asf-site

2021-04-20 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4f86e1d  Travis CI build asf-site
4f86e1d is described below

commit 4f86e1d7d5f030bee450d5e8f6a760337fa6977b
Author: CI 
AuthorDate: Tue Apr 20 16:32:11 2021 +

Travis CI build asf-site
---
 content/assets/js/lunr/lunr-store.js|  2 +-
 content/blog/hudi-key-generators/index.html | 21 -
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/content/assets/js/lunr/lunr-store.js 
b/content/assets/js/lunr/lunr-store.js
index d73f822..02f99da 100644
--- a/content/assets/js/lunr/lunr-store.js
+++ b/content/assets/js/lunr/lunr-store.js
@@ -1680,7 +1680,7 @@ var store = [{
 "url": "https://hudi.apache.org/blog/hudi-clustering-intro/";,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
 "title": "Apache Hudi Key Generators",
-"excerpt":"Every record in Hudi is uniquely identified by a HoodieKey, 
which is a pair of record key and partition path where the record belongs to. 
Hudi has imposed this constraint so that updates and deletes can be applied to 
the record of interest. Hudi relies on the partition path 
field...","categories": ["blog"],
+"excerpt":"Every record in Hudi is uniquely identified by a primary 
key, which is a pair of record key and partition path where the record belongs 
to. Using primary keys, Hudi can impose a) partition level uniqueness integrity 
constraint b) enable fast updates and deletes on records. One should choose 
the...","categories": ["blog"],
 "tags": [],
 "url": "https://hudi.apache.org/blog/hudi-key-generators/";,
 "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{
diff --git a/content/blog/hudi-key-generators/index.html 
b/content/blog/hudi-key-generators/index.html
index 88d128c..6f87f06 100644
--- a/content/blog/hudi-key-generators/index.html
+++ b/content/blog/hudi-key-generators/index.html
@@ -197,18 +197,21 @@
 }
   
 
-Every record in Hudi is uniquely identified by a HoodieKey, which 
is a pair of record key and partition path where the 
-record belongs to. Hudi has imposed this constraint so that updates and 
deletes can be applied to the record of interest. 
-Hudi relies on the partition path field to partition your dataset and records 
within a partition have unique record keys. 
-Since uniqueness is guaranteed only within the partition, there could be 
records with same record keys across different 
-partitions. One should choose the partition field wisely as it could be a 
determining factor for your ingestion and 
-query latency.
+Every record in Hudi is uniquely identified by a primary key, which 
is a pair of record key and partition path where
+the record belongs to. Using primary keys, Hudi can impose a) partition level 
uniqueness integrity constraint
+b) enable fast updates and deletes on records. One should choose the 
partitioning scheme wisely as it could be a
+determining factor for your ingestion and query latency.
+
+In general, Hudi supports both partitioned and global indexes. For a 
dataset with partitioned index(which is most
+commonly used), each record is uniquely identified by a pair of record key and 
partition path. But for a dataset with
+global index, each record is uniquely identified by just the record key. There 
won’t be any duplicate record keys across
+partitions.
 
 Key Generators
 
-Hudi exposes a number of out of the box key generators that customers can 
use based on their need. Or can have their 
-own implementation for the KeyGenerator. This blog goes over all different 
types of key generators that are readily 
-available to use.
+Hudi provides several key generators out of the box that users can use 
based on their need, while having a pluggable
+implementation for users to implement and use their own KeyGenerator. This 
blog goes over all different types of key 
+generators that are readily available to use.
 
 https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java";>Here
 is the interface for KeyGenerator in Hudi for your reference.


[GitHub] [hudi] nsivabalan commented on a change in pull request #2847: [HUDI-1769]Add download page to the site

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2847:
URL: https://github.com/apache/hudi/pull/2847#discussion_r616847950



##
File path: docs/_pages/download.cn.md
##
@@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
 ## Release 0.8.0
-* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))
+* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))

Review comment:
   probably it chooses a mirror location closer to your geo location. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2847: [HUDI-1769]Add download page to the site

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2847:
URL: https://github.com/apache/hudi/pull/2847#discussion_r616847043



##
File path: docs/_pages/download.cn.md
##
@@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
 ## Release 0.8.0
-* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))
+* Source Release : [Apache Hudi 0.8.0 Source 
Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))

Review comment:
   https://www.apache.org/dyn/closer.lua/hudi/ redirects me to 
https://apache.osuosl.org/hudi/ fyi. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: [MINOR] Fixing key generators blog content (#2739)

2021-04-20 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 860abd0  [MINOR] Fixing key generators blog content (#2739)
860abd0 is described below

commit 860abd04cbb3e78265ba9a300bb0cd849fff7e44
Author: Sivabalan Narayanan 
AuthorDate: Tue Apr 20 12:17:30 2021 -0400

[MINOR] Fixing key generators blog content (#2739)
---
 docs/_posts/2021-02-13-hudi-key-generators.md | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/docs/_posts/2021-02-13-hudi-key-generators.md 
b/docs/_posts/2021-02-13-hudi-key-generators.md
index 5076ec6..fc3faa8 100644
--- a/docs/_posts/2021-02-13-hudi-key-generators.md
+++ b/docs/_posts/2021-02-13-hudi-key-generators.md
@@ -5,18 +5,21 @@ author: shivnarayan
 category: blog
 ---
 
-Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of 
record key and partition path where the 
-record belongs to. Hudi has imposed this constraint so that updates and 
deletes can be applied to the record of interest. 
-Hudi relies on the partition path field to partition your dataset and records 
within a partition have unique record keys. 
-Since uniqueness is guaranteed only within the partition, there could be 
records with same record keys across different 
-partitions. One should choose the partition field wisely as it could be a 
determining factor for your ingestion and 
-query latency.
+Every record in Hudi is uniquely identified by a primary key, which is a pair 
of record key and partition path where
+the record belongs to. Using primary keys, Hudi can impose a) partition level 
uniqueness integrity constraint
+b) enable fast updates and deletes on records. One should choose the 
partitioning scheme wisely as it could be a
+determining factor for your ingestion and query latency.
+
+In general, Hudi supports both partitioned and global indexes. For a dataset 
with partitioned index(which is most
+commonly used), each record is uniquely identified by a pair of record key and 
partition path. But for a dataset with
+global index, each record is uniquely identified by just the record key. There 
won't be any duplicate record keys across
+partitions.
 
 ## Key Generators
 
-Hudi exposes a number of out of the box key generators that customers can use 
based on their need. Or can have their 
-own implementation for the KeyGenerator. This blog goes over all different 
types of key generators that are readily 
-available to use.
+Hudi provides several key generators out of the box that users can use based 
on their need, while having a pluggable
+implementation for users to implement and use their own KeyGenerator. This 
blog goes over all different types of key 
+generators that are readily available to use.
 
 
[Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java)
 is the interface for KeyGenerator in Hudi for your reference.


[GitHub] [hudi] nsivabalan merged pull request #2739: [MINOR] Fixing key generators blog content

2021-04-20 Thread GitBox


nsivabalan merged pull request #2739:
URL: https://github.com/apache/hudi/pull/2739


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] PavelPetukhov opened a new issue #2856: [SUPPORT] Metrics Prometheus pushgateway

2021-04-20 Thread GitBox


PavelPetukhov opened a new issue #2856:
URL: https://github.com/apache/hudi/issues/2856


   I have discovered that you've added prometheus related changes 
   like here https://issues.apache.org/jira/browse/HUDI-210
   
   But unfortunately there is no documentation related to pushing hudi metrics 
to Prometheus Push Gateway
   https://hudi.apache.org/docs/metrics.html#hoodiemetrics
   
   What parameters should be set in order to do that?
   
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.7
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : hdfs
   
   * Running on Docker? (yes/no) : no
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] raphaelauv opened a new issue #2855: [SUPPORT] hudi-utilities documentation

2021-04-20 Thread GitBox


raphaelauv opened a new issue #2855:
URL: https://github.com/apache/hudi/issues/2855


   **Describe the problem you faced**
   
   The hudi-utilities are use in the [Docker 
Demo](https://hudi.apache.org/docs/docker_demo.html) , but there is no 
documentation on there purpose and if they can be considered prod-ready jobs ?
   
   **Expected behavior**
   A Readme inside the folder hudi-utilities or some lines in the documentation 
explaining the purpose and if they can be considered prod ready jobs.
   
   Thank you all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #2847: [HUDI-1769]Add download page to the site

2021-04-20 Thread GitBox


garyli1019 commented on pull request #2847:
URL: https://github.com/apache/hudi/pull/2847#issuecomment-823322655


   > @garyli1019 the download links are pointing to dist.apache.org tar balls??
   > 
   > while we are at it, can we also update release cwiki page with updating 
this for each release?
   
   @vinothchandar fixed. Changed the link to the mirror site like other apache 
projects. And added the updating `download.md` step to the README instructions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1809) Flink merge on read input split uses wrong base file path for default merge type

2021-04-20 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1809.
--
Resolution: Fixed

d6d52c60636ae6a0c16469fa6761d0080fddf72f

> Flink merge on read input split uses wrong base file path for default merge 
> type
> 
>
> Key: HUDI-1809
> URL: https://issues.apache.org/jira/browse/HUDI-1809
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Should use the base file path instead of the table path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated: [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846)

2021-04-20 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d6d52c6  [HUDI-1809] Flink merge on read input split uses wrong base 
file path for default merge type (#2846)
d6d52c6 is described below

commit d6d52c60636ae6a0c16469fa6761d0080fddf72f
Author: Danny Chan 
AuthorDate: Tue Apr 20 21:27:09 2021 +0800

[HUDI-1809] Flink merge on read input split uses wrong base file path for 
default merge type (#2846)
---
 .../table/format/mor/MergeOnReadInputFormat.java   | 38 +-
 .../org/apache/hudi/util/RowDataProjection.java| 61 ++
 .../apache/hudi/table/format/TestInputFormat.java  |  1 +
 3 files changed, 88 insertions(+), 12 deletions(-)

diff --git 
a/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
 
b/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
index 12bebdf..1186cff 100644
--- 
a/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
+++ 
b/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
@@ -29,6 +29,7 @@ import org.apache.hudi.table.format.FormatUtils;
 import org.apache.hudi.table.format.cow.ParquetColumnarRowSplitReader;
 import org.apache.hudi.table.format.cow.ParquetSplitReaderUtil;
 import org.apache.hudi.util.AvroToRowDataConverters;
+import org.apache.hudi.util.RowDataProjection;
 import org.apache.hudi.util.RowDataToAvroConverters;
 import org.apache.hudi.util.StreamerUtil;
 import org.apache.hudi.util.StringToRowDataConverter;
@@ -63,6 +64,7 @@ import java.util.stream.IntStream;
 
 import static 
org.apache.flink.table.data.vector.VectorizedColumnBatch.DEFAULT_SIZE;
 import static 
org.apache.flink.table.filesystem.RowPartitionComputer.restorePartValueFromType;
+import static 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_COMMIT_TIME_COL_POS;
 import static 
org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS;
 import static org.apache.hudi.table.format.FormatUtils.buildAvroRecordBySchema;
 
@@ -180,7 +182,7 @@ public class MergeOnReadInputFormat
   new Schema.Parser().parse(this.tableState.getAvroSchema()),
   new Schema.Parser().parse(this.tableState.getRequiredAvroSchema()),
   this.requiredPos,
-  getFullSchemaReader(split.getTablePath()));
+  getFullSchemaReader(split.getBasePath().get()));
 } else {
   throw new HoodieException("Unable to select an Iterator to read the 
Hoodie MOR File Split for "
   + "file path: " + split.getBasePath()
@@ -337,7 +339,7 @@ public class MergeOnReadInputFormat
 // efficient.
 if (split.getInstantRange().isPresent()) {
   // based on the fact that commit time is always the first field
-  String commitTime = curAvroRecord.get().get(0).toString();
+  String commitTime = 
curAvroRecord.get().get(HOODIE_COMMIT_TIME_COL_POS).toString();
   if (!split.getInstantRange().get().isInRange(commitTime)) {
 // filter out the records that are not in range
 return hasNext();
@@ -431,6 +433,11 @@ public class MergeOnReadInputFormat
 // iterator for log files
 private final Iterator iterator;
 
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:
+// method #reachedEnd() returns false after it returns true.
+// refactor it out once FLINK-22370 is resolved.
+private boolean readLogs = false;
+
 private RowData currentRecord;
 
 SkipMergeIterator(ParquetColumnarRowSplitReader reader, Iterator 
iterator) {
@@ -440,10 +447,11 @@ public class MergeOnReadInputFormat
 
 @Override
 public boolean reachedEnd() throws IOException {
-  if (!this.reader.reachedEnd()) {
+  if (!readLogs && !this.reader.reachedEnd()) {
 currentRecord = this.reader.nextRecord();
 return false;
   }
+  readLogs = true;
   if (this.iterator.hasNext()) {
 currentRecord = this.iterator.next();
 return false;
@@ -479,6 +487,12 @@ public class MergeOnReadInputFormat
 private final AvroToRowDataConverters.AvroToRowDataConverter 
avroToRowDataConverter;
 private final GenericRecordBuilder recordBuilder;
 
+private final RowDataProjection projection;
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:
+// method #reachedEnd() returns false after it returns true.
+// refactor it out once FLINK-22370 is resolved.
+private boolean readLogs = false;
+
 private Set keyToSkip = new HashSet<>();
 
 private RowData currentRecord;
@@ -501,11 +515,12 @@ public class MergeOnReadInputFormat
   this.recordBuilder = new GenericRecordBuilder(requiredSchema);
   this.rowDataToAvroConve

[GitHub] [hudi] yanghua merged pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


yanghua merged pull request #2846:
URL: https://github.com/apache/hudi/pull/2846


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2739: [MINOR] Fixing key generators blog content

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2739:
URL: https://github.com/apache/hudi/pull/2739#discussion_r616674663



##
File path: docs/_posts/2021-02-13-hudi-key-generators.md
##
@@ -5,18 +5,16 @@ author: shivnarayan
 category: blog
 ---
 
-Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of 
record key and partition path where the 
-record belongs to. Hudi has imposed this constraint so that updates and 
deletes can be applied to the record of interest. 
-Hudi relies on the partition path field to partition your dataset and records 
within a partition have unique record keys. 
-Since uniqueness is guaranteed only within the partition, there could be 
records with same record keys across different 
-partitions. One should choose the partition field wisely as it could be a 
determining factor for your ingestion and 
-query latency.
+Every record in Hudi is uniquely identified by a primary key, which is a pair 
of record key and partition path where
+the record belongs to. Using primary keys, Hudi can impose a) partition level 
uniqueness integrity constraint
+b) enable fast updates and deletes on records. One should choose the 
partitioning scheme wisely as it could be a
+determining factor for your ingestion and query latency.
 
 ## Key Generators
 
-Hudi exposes a number of out of the box key generators that customers can use 
based on their need. Or can have their 
-own implementation for the KeyGenerator. This blog goes over all different 
types of key generators that are readily 
-available to use.
+Hudi provides several key generators out of the box that customers can use 
based on their need while having a pluggable

Review comment:
   ok.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2767: [HUDI-1761] Adding support for Test your own Schema with QuickStart

2021-04-20 Thread GitBox


nsivabalan commented on pull request #2767:
URL: https://github.com/apache/hudi/pull/2767#issuecomment-823266697


   if not in source bundle, somewhere in util packages or somewhere would help. 
For new customers who are looking to try out hudi, would be easy to do sanity 
check if their schema works w/ hudi end to end. if not, they might have to 
manually generate data or read from elsewhere and inject into hudi. this is 
just out of the shelf option just to test any complex schemas. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2776: [HUDI-1768] spark datasource support schema validate add column

2021-04-20 Thread GitBox


nsivabalan commented on pull request #2776:
URL: https://github.com/apache/hudi/pull/2776#issuecomment-823264108


   yes, this is still valid. 
   @lw309637554 : ping me here once the PR is ready to be reviewed again. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-04-20 Thread GitBox


nsivabalan commented on pull request #2720:
URL: https://github.com/apache/hudi/pull/2720#issuecomment-823254679


   @xiarixiaoyao : LGTM. ignore the disabled test for now. can you add a UT for 
the fix. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
##
@@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim 
createInputFormatShim() {
 if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, 
"").isEmpty()) {
   List partitions = new 
ArrayList<>(part.getPartSpec().keySet());
   if (!partitions.isEmpty()) {
-String partitionStr = String.join(",", partitions);

Review comment:
   I am just getting started to understand the query side (and hence not 
very conversant). I tried looking in hive repo for 
[CombineHiveInputFormat](https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java)
 to find the usage of delimiter, but couldn't find this piece of code. Would 
you mind pointing me to file where I can find this code snippet in hive repo. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table

2021-04-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325773#comment-17325773
 ] 

sivabalan narayanan commented on HUDI-1747:
---

awesome, thanks. 

> Deltastreamer incremental read is not working on the MOR table
> --
>
> Key: HUDI-1747
> URL: https://issues.apache.org/jira/browse/HUDI-1747
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Vinoth Govindarajan
>Priority: Critical
>  Labels: sev:critical
>
> I was trying to read the MOR HUDI table incrementally using delta streamer, 
> while doing that I ran into this issue where it says:
> {code:java}
> Found recursive reference in Avro schema, which can not be processed by 
> Spark:{code}
> Spark Version: 2.4
> Hudi Version: 0.7.0-SNAPSHOT or the latest master
>  
> Full Stack Trace:
> {code:java}
> Found recursive reference in Avro schema, which can not be processed by Spark:
> {
>   "type" : "record",
>   "name" : "meta",
>   "fields" : [ {
> "name" : "verified",
> "type" : [ "null", "boolean" ],
> "default" : null
>   }, {
> "name" : "zip",
> "type" : [ "null", "string" ],
> "default" : null
>   }, {
> "name" : "lname",
> "type" : [ "null", "string" ],
> "default" : null
>   }]
> }
>   
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
>   at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105)
>   at 
> org.apache.spark.sql

[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
##
@@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim 
createInputFormatShim() {
 if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, 
"").isEmpty()) {
   List partitions = new 
ArrayList<>(part.getPartSpec().keySet());
   if (!partitions.isEmpty()) {
-String partitionStr = String.join(",", partitions);

Review comment:
   I am just getting started to understand the query side (and hence not 
very conversant). I tried looking in hive repo for CombineHiveInputFormat to 
find the usage of delimiter, but couldn't find this piece of code. Would you 
mind pointing me to file where I can find this code snippet in hive repo. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2716:
URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
##
@@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim 
createInputFormatShim() {
 if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, 
"").isEmpty()) {
   List partitions = new 
ArrayList<>(part.getPartSpec().keySet());
   if (!partitions.isEmpty()) {
-String partitionStr = String.join(",", partitions);

Review comment:
   I am just getting started to understand the query side (and hence not 
very conversant). I tried looking in hive for CombineHiveInputFormat to find 
the usage of delimiter, but couldn't find this piece of code. Would you mind 
pointing me to file where I can find this code snippet in hive repo. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tooptoop4 commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

2021-04-20 Thread GitBox


tooptoop4 commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-823247153


   
https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer commented on pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

2021-04-20 Thread GitBox


sbernauer commented on pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#issuecomment-823214232


   @sathyaprakashg @n3nash and others thanks for your work! I have rebased the 
commit for the current master and resolved all the conflicts here 
https://github.com/sbernauer/hudi/commit/b383883742ad63899fa43584ab7a10cd72d533fe
   @sathyaprakashg this may help you while rebasing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect

2021-04-20 Thread GitBox


nsivabalan commented on pull request #2720:
URL: https://github.com/apache/hudi/pull/2720#issuecomment-823203510


   @xiarixiaoyao : I was asking Raymond (@xushiyan ) as to why this test is 
disabled. From git, I found that he was the one who disabled the test and 
wanted to get info from him. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #2845: [HUDI-1723] Fix path selector listing files with the same mod date

2021-04-20 Thread GitBox


nsivabalan commented on a change in pull request #2845:
URL: https://github.com/apache/hudi/pull/2845#discussion_r616593119



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java
##
@@ -121,28 +121,30 @@ public static DFSPathSelector 
createSourceSelector(TypedProperties props,
   
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));

Review comment:
   don't we need any fix in listEligibleFiles()? this method filters files 
based on mod time > checkpoint time. I thought fix is to make this mod time 
>**=** checkpoint time.

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java
##
@@ -121,28 +121,30 @@ public static DFSPathSelector 
createSourceSelector(TypedProperties props,
   
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
   // Filter based on checkpoint & input size, if needed
   long currentBytes = 0;
-  long maxModificationTime = Long.MIN_VALUE;
+  long newCheckpointTime = lastCheckpointTime;
   List filteredFiles = new ArrayList<>();
   for (FileStatus f : eligibleFiles) {
-if (currentBytes + f.getLen() >= sourceLimit) {
+if (currentBytes + f.getLen() >= sourceLimit && 
f.getModificationTime() > newCheckpointTime) {

Review comment:
   won't this lead to overflow. in the sense that, this could lead to 
reading ```2*sourceLimit``` or even ```10* sourceLimit```, we never know. 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java
##
@@ -121,28 +121,30 @@ public static DFSPathSelector 
createSourceSelector(TypedProperties props,
   
eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime));
   // Filter based on checkpoint & input size, if needed
   long currentBytes = 0;
-  long maxModificationTime = Long.MIN_VALUE;
+  long newCheckpointTime = lastCheckpointTime;
   List filteredFiles = new ArrayList<>();
   for (FileStatus f : eligibleFiles) {
-if (currentBytes + f.getLen() >= sourceLimit) {
+if (currentBytes + f.getLen() >= sourceLimit && 
f.getModificationTime() > newCheckpointTime) {

Review comment:
   guess I get the gist now and why we don't need any fix in 
listEligibleFiles.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


codecov-commenter edited a comment on pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#issuecomment-822274991


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@4e050cc`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#section-missing-base-commit).
   > The diff coverage is `82.14%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2846/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff@@
   ## master#2846   +/-   ##
   =
 Coverage  ?   52.99%   
 Complexity? 3726   
   =
 Files ?  486   
 Lines ?23247   
 Branches  ? 2469   
   =
 Hits  ?12320   
 Misses? 9846   
 Partials  ? 1081   
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (?)` | `215.00 <ø> (?)` | |
   | hudiclient | `∅ <ø> (?)` | `0.00 <ø> (?)` | |
   | hudicommon | `50.68% <ø> (?)` | `1976.00 <ø> (?)` | |
   | hudiflink | `59.00% <82.14%> (?)` | `534.00 <5.00> (?)` | |
   | hudihadoopmr | `33.33% <ø> (?)` | `198.00 <ø> (?)` | |
   | hudisparkdatasource | `72.11% <ø> (?)` | `237.00 <ø> (?)` | |
   | hudisync | `45.70% <ø> (?)` | `131.00 <ø> (?)` | |
   | huditimelineservice | `64.36% <ø> (?)` | `62.00 <ø> (?)` | |
   | hudiutilities | `69.79% <ø> (?)` | `373.00 <ø> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../hudi/table/format/mor/MergeOnReadInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvbW9yL01lcmdlT25SZWFkSW5wdXRGb3JtYXQuamF2YQ==)
 | `75.00% <69.23%> (ø)` | `18.00 <0.00> (?)` | |
   | 
[...n/java/org/apache/hudi/util/RowDataProjection.java](https://codecov.io/gh/apache/hudi/pull/2846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1Jvd0RhdGFQcm9qZWN0aW9uLmphdmE=)
 | `93.33% <93.33%> (ø)` | `5.00 <5.00> (?)` | |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 closed pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


danny0405 closed pull request #2846:
URL: https://github.com/apache/hudi/pull/2846


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


codecov-commenter edited a comment on pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (1e379c9) into 
[master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (62bb9e1) will **decrease** coverage by `43.21%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2853   +/-   ##
   
   - Coverage 52.60%   9.38%   -43.22% 
   + Complexity 3709  48 -3661 
   
 Files   485  54  -431 
 Lines 232241993-21231 
 Branches   2465 235 -2230 
   
   - Hits  12216 187-12029 
   + Misses 99291793 -8136 
   + Partials   1079  13 -1066 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.38% <ø> (-60.42%)` | `48.00 <ø> (-325.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <

[GitHub] [hudi] codecov-commenter commented on pull request #2854: [HUDI-1771] Propagate CDC format for hoodie

2021-04-20 Thread GitBox


codecov-commenter commented on pull request #2854:
URL: https://github.com/apache/hudi/pull/2854#issuecomment-823120634


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2854](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (aeca8e7) into 
[master](https://codecov.io/gh/apache/hudi/commit/9a288ccbebf1aee3164e7bc472a3e795bb83652b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (9a288cc) will **increase** coverage by `17.20%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2854/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2854   +/-   ##
   =
   + Coverage 52.58%   69.79%   +17.20% 
   + Complexity 3708  373 -3335 
   =
 Files   485   54  -431 
 Lines 23227 1993-21234 
 Branches   2466  235 -2231 
   =
   - Hits  12215 1391-10824 
   + Misses 9934  471 -9463 
   + Partials   1078  131  -947 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.79% <ø> (ø)` | `373.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...ava/org/apache/hudi/cli/commands/UtilsCommand.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1V0aWxzQ29tbWFuZC5qYXZh)
 | | | |
   | 
[...hadoop/realtime/RealtimeCompactedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQ29tcGFjdGVkUmVjb3JkUmVhZGVyLmphdmE=)
 | | | |
   | 
[...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUlucHV0Rm9ybWF0VXRpbHMuamF2YQ==)
 | | | |
   | 
[.../hudi/common/table/timeline/dto/FileStatusDTO.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlU3RhdHVzRFRPLmphdmE=)
 | | | |
   | 
[.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==)
 | | | |
   | 
[...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_

[GitHub] [hudi] danny0405 commented on a change in pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


danny0405 commented on a change in pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#discussion_r616499589



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
##
@@ -80,6 +80,12 @@ private FlinkOptions() {
   .defaultValue(false)
   .withDescription("Whether to bootstrap the index state from existing 
hoodie table, default false");
 
+  public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
+  .key("index.state.ttl")
+  .doubleType()
+  .defaultValue(1.5D)
+  .withDescription("index state ttl in days. default is 1.5 day.");
+

Review comment:
   Index state ttl in days, default 1.5 day




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


codecov-commenter edited a comment on pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#issuecomment-822274991


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@4e050cc`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#section-missing-base-commit).
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2846/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@   Coverage Diff@@
   ## master   #2846   +/-   ##
   
 Coverage  ?   9.38%   
 Complexity?  48   
   
 Files ?  54   
 Lines ?1993   
 Branches  ? 235   
   
 Hits  ? 187   
 Misses?1793   
 Partials  ?  13   
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudiutilities | `9.38% <ø> (?)` | `48.00 <ø> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nevgin commented on issue #2832: [SUPPORT] Hive on Spark dont work

2021-04-20 Thread GitBox


nevgin commented on issue #2832:
URL: https://github.com/apache/hudi/issues/2832#issuecomment-823117798


   Directly from the spark, queries are being handled wonderfully.
   From spark for hive, according to the documentation, I removed the hive*.jar 
libraries. If you do not delete the hive does not work with the spark engine. 
   my guess is that the spark needs to be built with my version of the hive - 
2.3.8. 
   This is true? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


codecov-commenter commented on pull request #2853:
URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (9bd245c) into 
[master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (62bb9e1) will **decrease** coverage by `43.21%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2853   +/-   ##
   
   - Coverage 52.60%   9.38%   -43.22% 
   + Complexity 3709  48 -3661 
   
 Files   485  54  -431 
 Lines 232241993-21231 
 Branches   2465 235 -2230 
   
   - Hits  12216 187-12029 
   + Misses 99291793 -8136 
   + Partials   1079  13 -1066 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.38% <ø> (-60.42%)` | `48.00 <ø> (-325.00)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> 

[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer

2021-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

谢波 updated HUDI-1812:
-
Description: 
Add option:
{code:java}
public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
.key("index.state.ttl")
.doubleType()
.defaultValue(1.5D)
.withDescription("index state ttl in days. default is 1.5 day.");

{code}
If the state expires but there are still updates for old records, the records 
would be recognized as INSERT instead of UPDATE thus some data duplication.

  was:
Add option:
{code:java}
public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
.key("index.state.ttl")
.longType()
.defaultValue(24 * 60 * 60 * 1000L)
.withDescription("index state ttl in milliseconds. default is 1 day.");
{code}
If the state expires but there are still updates for old records, the records 
would be recognized as INSERT instead of UPDATE thus some data duplication.


> Add explicit index state TTL option for Flink writer
> 
>
> Key: HUDI-1812
> URL: https://issues.apache.org/jira/browse/HUDI-1812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: 谢波
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Add option:
> {code:java}
> public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
> .key("index.state.ttl")
> .doubleType()
> .defaultValue(1.5D)
> .withDescription("index state ttl in days. default is 1.5 day.");
> {code}
> If the state expires but there are still updates for old records, the records 
> would be recognized as INSERT instead of UPDATE thus some data duplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


danny0405 commented on a change in pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#discussion_r616460746



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
##
@@ -431,6 +433,10 @@ public void close() {
 // iterator for log files
 private final Iterator iterator;
 
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1771) Propagate CDC format for hoodie

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1771:
-
Labels: pull-request-available  (was: )

> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1812:
-
Labels: pull-request-available  (was: )

> Add explicit index state TTL option for Flink writer
> 
>
> Key: HUDI-1812
> URL: https://issues.apache.org/jira/browse/HUDI-1812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: 谢波
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Add option:
> {code:java}
> public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
> .key("index.state.ttl")
> .longType()
> .defaultValue(24 * 60 * 60 * 1000L)
> .withDescription("index state ttl in milliseconds. default is 1 day.");
> {code}
> If the state expires but there are still updates for old records, the records 
> would be recognized as INSERT instead of UPDATE thus some data duplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1771) Propagate CDC format for hoodie

2021-04-20 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-1771:
-
Summary: Propagate CDC format for hoodie  (was: Keep the change flags from 
CDC source for Flink writer)

> Propagate CDC format for hoodie
> ---
>
> Key: HUDI-1771
> URL: https://issues.apache.org/jira/browse/HUDI-1771
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 0.9.0
>
>
> Like what we discussed in the dev mailing list: 
> https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E
> Keep the change flags make new use cases possible: using HUDI as the unified 
> storage format for DWD and DWS layer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 opened a new pull request #2854: [HUDI-1771] Propagate CDC format for hoodie

2021-04-20 Thread GitBox


danny0405 opened a new pull request #2854:
URL: https://github.com/apache/hudi/pull/2854


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] MyLanPangzi opened a new pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer

2021-04-20 Thread GitBox


MyLanPangzi opened a new pull request #2853:
URL: https://github.com/apache/hudi/pull/2853


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Add explicit index state TTL option for Flink writer
   
   ## Brief change log
   
 - *FlinkOptions add INDEX_STATE_TTL*
 - org.apache.hudi.sink.partitioner.BucketAssignFunction#indexState enable 
ttl.
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer

2021-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

谢波 updated HUDI-1812:
-
Description: 
Add option:
{code:java}
public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
.key("index.state.ttl")
.longType()
.defaultValue(24 * 60 * 60 * 1000L)
.withDescription("index state ttl in milliseconds. default is 1 day.");
{code}
If the state expires but there are still updates for old records, the records 
would be recognized as INSERT instead of UPDATE thus some data duplication.

  was:
Add option:

{code:java}
public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
  .key("index.state.ttl")
  .doubleType()
  .defaultValue(1.5D)// default 1.5 days
  .withDescription("Index state TTL in DAYs, default 1.5 days");
{code}

If the state expires but there are still updates for old records, the records 
would be recognized as INSERT instead of UPDATE thus some data duplication.



> Add explicit index state TTL option for Flink writer
> 
>
> Key: HUDI-1812
> URL: https://issues.apache.org/jira/browse/HUDI-1812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: 谢波
>Priority: Major
> Fix For: 0.9.0
>
>
> Add option:
> {code:java}
> public static final ConfigOption INDEX_STATE_TTL = ConfigOptions
> .key("index.state.ttl")
> .longType()
> .defaultValue(24 * 60 * 60 * 1000L)
> .withDescription("index state ttl in milliseconds. default is 1 day.");
> {code}
> If the state expires but there are still updates for old records, the records 
> would be recognized as INSERT instead of UPDATE thus some data duplication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] yanghua commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


yanghua commented on a change in pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#discussion_r616439410



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
##
@@ -431,6 +433,10 @@ public void close() {
 // iterator for log files
 private final Iterator iterator;
 
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:

Review comment:
   sounds good, can we add it to the comment of this PR(I mean this file.)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


danny0405 commented on a change in pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#discussion_r616437364



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
##
@@ -431,6 +433,10 @@ public void close() {
 // iterator for log files
 private final Iterator iterator;
 
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:

Review comment:
   see https://issues.apache.org/jira/browse/FLINK-22370




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…

2021-04-20 Thread GitBox


yanghua commented on a change in pull request #2846:
URL: https://github.com/apache/hudi/pull/2846#discussion_r616415317



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
##
@@ -431,6 +433,10 @@ public void close() {
 // iterator for log files
 private final Iterator iterator;
 
+// add the flag because the flink ParquetColumnarRowSplitReader is buggy:

Review comment:
   Can we file a Jira ticket to the Flink community and paste the jira id 
here so that we can track the progress of Flink.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org