[GitHub] [hudi] shenbinglife opened a new issue #2857: [SUPPORT] How to compile package hudi ?
shenbinglife opened a new issue #2857: URL: https://github.com/apache/hudi/issues/2857 How to compile package hudi ? mvn package -DskipTests -Dskip.tests=true [INFO] Scanning for projects... [INFO] [INFO] < org.apache.hudi:hudi > [INFO] Building Hudi 0.7.0 [INFO] [ pom ]- [INFO] [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven-version) @ hudi --- [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (process-resource-bundles) @ hudi --- [INFO] [INFO] --- maven-checkstyle-plugin:3.0.0:check (default) @ hudi --- [INFO] 开始检查…… 检查完成。 [INFO] [INFO] --- maven-site-plugin:3.7.1:attach-descriptor (attach-descriptor) @ hudi --- [INFO] No site descriptor found: nothing to attach. [INFO] [INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ hudi --- [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 4.984 s [INFO] Finished at: 2021-04-21T14:41:35+08:00 [INFO] Process finished with exit code 0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-1818) Validate and check the option 'write.precombine.field' for Flink writer
[ https://issues.apache.org/jira/browse/HUDI-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 谢波 reassigned HUDI-1818: Assignee: 谢波 > Validate and check the option 'write.precombine.field' for Flink writer > --- > > Key: HUDI-1818 > URL: https://issues.apache.org/jira/browse/HUDI-1818 > Project: Apache Hudi > Issue Type: New Feature > Components: Flink Integration >Reporter: Danny Chen >Assignee: 谢波 >Priority: Major > Fix For: 0.9.0 > > > Validate the option 'write.precombine.field' must exist in table schema when > creating table source, if it does not exist, tell the user to config this > option with the right field. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1818) Validate and check the option 'write.precombine.field' for Flink writer
Danny Chen created HUDI-1818: Summary: Validate and check the option 'write.precombine.field' for Flink writer Key: HUDI-1818 URL: https://issues.apache.org/jira/browse/HUDI-1818 Project: Apache Hudi Issue Type: New Feature Components: Flink Integration Reporter: Danny Chen Fix For: 0.9.0 Validate the option 'write.precombine.field' must exist in table schema when creating table source, if it does not exist, tell the user to config this option with the right field. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1415) Read Hoodie Table As Spark DataSource Table
[ https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengzhiwei updated HUDI-1415: - Status: Open (was: New) > Read Hoodie Table As Spark DataSource Table > > > Key: HUDI-1415 > URL: https://issues.apache.org/jira/browse/HUDI-1415 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Affects Versions: 0.9.0 >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Major > Labels: pull-request-available, user-support-issues > Fix For: 0.9.0 > > > Currently hudi can sync the meta data to hive meta store using HiveSyncTool. > The table description synced to hive just like this: > {code:java} > CREATE EXTERNAL TABLE `tbl_price_insert0`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `id` int, > `name` string, > `price` double, > `version` int, > `dt` string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'file:/tmp/hudi/tbl_price_insert0' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201124105009', > 'transient_lastDdlTime'='1606186222') > {code} > When we query this table using spark sql, it trait it as a Hive Table, not a > spark data source table and convert it to parquet LogicalRelation in > HiveStrategies#RelationConversions. As a result, spark sql read the hudi > table just like a parquet data source. This lead to an incorrect query > result if user missing set the spark.sql.hive.convertMetastoreParquet=false. > Inorder to query hudi table as data source table in spark, more table > properties and serde properties must be added to the hive meta,just like the > follow: > {code:java} > CREATE EXTERNAL TABLE `tbl_price_cow0`( > `_hoodie_commit_time` string, > `_hoodie_commit_seqno` string, > `_hoodie_record_key` string, > `_hoodie_partition_path` string, > `_hoodie_file_name` string, > `id` int, > `name` string, > `price` double, > `version` int) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > WITH SERDEPROPERTIES ( > 'path'='/tmp/hudi/tbl_price_cow0') > STORED AS INPUTFORMAT > 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > LOCATION > 'file:/tmp/hudi/tbl_price_cow0' > TBLPROPERTIES ( > 'last_commit_time_sync'='20201124120532', > 'spark.sql.sources.provider'='hudi', > 'spark.sql.sources.schema.numParts'='1', > > 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', > > 'transient_lastDdlTime'='1606190729') > {code} > These are the missing table properties: > {code:java} > spark.sql.sources.provider= 'hudi' > spark.sql.sources.schema.numParts = 'xx' > spark.sql.sources.schema.part.{num} ='xx' > spark.sql.sources.schema.numPartCols = 'xx' > spark.sql.sources.schema.partCol.{num} = 'xx'{code} > and serde property: > {code:java} > 'path'='/path/to/hudi' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1817) when query incr view of hudi table by using spark-sql. the result is wrong
tao meng created HUDI-1817: -- Summary: when query incr view of hudi table by using spark-sql. the result is wrong Key: HUDI-1817 URL: https://issues.apache.org/jira/browse/HUDI-1817 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Environment: spark2.4.5 hive 3.1.1 hadoop 3.1.1 Reporter: tao meng Fix For: 0.9.0 create hudi table (mor or cow) val base_data = spark.read.parquet("/tmp/tb_base") val upsert_data = spark.read.parquet("/tmp/tb_upsert") base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par") upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par") query incr view by sparksql: set hoodie.tb_test_mor_par.consume.mode=INCREMENTAL; set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330; set hoodie.tb_test_mor_par.consume.max.commits=3; select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order by primary_key; +---+---+++++ |_hoodie_commit_time|primary_key|col0|col1|col6 |col7| +---+---+++++ |20210420155738 |20 |77 |sC |158788760400|739 | |20210420155738 |21 |66 |ps |160979049700|61 | |20210420155738 |22 |47 |1P |158460042900|835 | |20210420155738 |23 |36 |5K |160763480800|538 | |20210420155738 |24 |1 |BA |160685711300|775 | |20210420155738 |24 |101 |BA |160685711300|775 | |20210420155738 |24 |100 |BA |160685711300|775 | |20210420155738 |24 |102 |BA |160685711300|775 | +---+---+++++ the primary_key is repeated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1816) when query incr view of hudi table by using spark-sql, the query result is wrong
tao meng created HUDI-1816: -- Summary: when query incr view of hudi table by using spark-sql, the query result is wrong Key: HUDI-1816 URL: https://issues.apache.org/jira/browse/HUDI-1816 Project: Apache Hudi Issue Type: Bug Components: Hive Integration Affects Versions: 0.8.0 Environment: spark2.4.5 hive 3.1.1hadoop 3.1.1 Reporter: tao meng Fix For: 0.9.0 test step1: create a partitioned hudi table (mor / cow) val base_data = spark.read.parquet("/tmp/tb_base") val upsert_data = spark.read.parquet("/tmp/tb_upsert") base_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "bulk_insert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Overwrite).save(s"/tmp/testdb/tb_test_mor_par") upsert_data.write.format("hudi").option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL).option(PRECOMBINE_FIELD_OPT_KEY, "col2").option(RECORDKEY_FIELD_OPT_KEY, "primary_key").option(PARTITIONPATH_FIELD_OPT_KEY, "col0").option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.SimpleKeyGenerator").option(OPERATION_OPT_KEY, "upsert").option(HIVE_SYNC_ENABLED_OPT_KEY, "true").option(HIVE_PARTITION_FIELDS_OPT_KEY, "col0").option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option(HIVE_DATABASE_OPT_KEY, "testdb").option(HIVE_TABLE_OPT_KEY, "tb_test_mor_par").option(HIVE_USE_JDBC_OPT_KEY, "false").option("hoodie.bulkinsert.shuffle.parallelism", 4).option("hoodie.insert.shuffle.parallelism", 4).option("hoodie.upsert.shuffle.parallelism", 4).option("hoodie.delete.shuffle.parallelism", 4).option("hoodie.datasource.write.hive_style_partitioning", "true").option(TABLE_NAME, "tb_test_mor_par").mode(Append).save(s"/tmp/testdb/tb_test_mor_par") query incr view by sparksql: set hoodie.tb_test_mor_par.consume.start.timestamp=20210420145330; set hoodie.tb_test_mor_par.consume.max.commits=3; select _hoodie_commit_time,primary_key,col0,col1,col2,col3,col4,col5,col6,col7 from testdb.tb_test_mor_par_rt where _hoodie_commit_time > '20210420145330' order by primary_key; +---+---+++++ |_hoodie_commit_time|primary_key|col0|col1|col6 |col7| +---+---+++++ |20210420155738 |20 |77 |sC |158788760400|739 | |20210420155738 |21 |66 |ps |160979049700|61 | |20210420155738 |22 |47 |1P |158460042900|835 | |20210420155738 |23 |36 |5K |160763480800|538 | |20210420155738 |24 |1 |BA |160685711300|775 | |20210420155738 |24 |101 |BA |160685711300|775 | |20210420155738 |24 |100 |BA |160685711300|775 | |20210420155738 |24 |102 |BA |160685711300|775 | +---+---+++++ primary key 24 is repeated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan commented on issue #2830: [SUPPORT]same _hoodie_record_key has duplicates data
nsivabalan commented on issue #2830: URL: https://github.com/apache/hudi/issues/2830#issuecomment-823752462 oh, I see you are using GLOBAL_BLOOM as your index. Can you tell us which version of hudi are you using and other env details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification
nsivabalan edited a comment on issue #2852: URL: https://github.com/apache/hudi/issues/2852#issuecomment-823751580 Guess the documentation you have linked actually talks about the usage. ```This will ensure the input format classes with its dependencies are available for query planning & execution.``` @bvaradar @n3nash can add more info if required. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2852: [SUPPORT] Read Hudi Table from Hive - Hive Sync clarification
nsivabalan commented on issue #2852: URL: https://github.com/apache/hudi/issues/2852#issuecomment-823751580 Guess the documentation you have linked actually talks about the usage. ```This will ensure the input format classes with its dependencies are available for query planning & execution.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2855: [SUPPORT] hudi-utilities documentation
nsivabalan commented on issue #2855: URL: https://github.com/apache/hudi/issues/2855#issuecomment-823749990 yes, HoodieDeltastreamer is heavily used by many users, which is in Hudi-utilities-bundle. https://issues.apache.org/jira/browse/HUDI-1815 @bvaradar : can you briefly go over what all is offered in Hudi-utilities for end users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1815) Add readme to each bundle to give a brief intro about each bundle
[ https://issues.apache.org/jira/browse/HUDI-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1815: -- Labels: docs sev:normal (was: ) > Add readme to each bundle to give a brief intro about each bundle > - > > Key: HUDI-1815 > URL: https://issues.apache.org/jira/browse/HUDI-1815 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: sivabalan narayanan >Priority: Major > Labels: docs, sev:normal > > hudi-utilities-bundle nor Hudi-spark-bundle does not have any readme as to > what's the purpose. > Add a readme with some details about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1815) Add readme to each bundle to give a brief intro about each bundle
sivabalan narayanan created HUDI-1815: - Summary: Add readme to each bundle to give a brief intro about each bundle Key: HUDI-1815 URL: https://issues.apache.org/jira/browse/HUDI-1815 Project: Apache Hudi Issue Type: Task Components: Docs Reporter: sivabalan narayanan hudi-utilities-bundle nor Hudi-spark-bundle does not have any readme as to what's the purpose. Add a readme with some details about the same. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode
nsivabalan commented on issue #2850: URL: https://github.com/apache/hudi/issues/2850#issuecomment-823748401 CC @xushiyan @bvaradar @n3nash -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2850: [SUPPORT] S3 files skipped by HoodieDeltaStreamer on s3 bucket in continuous mode
nsivabalan commented on issue #2850: URL: https://github.com/apache/hudi/issues/2850#issuecomment-823747511 We know one bug ATM w/ deltastreamer where if multiple files are present w/ same mod time, deltastreamer could skip some of them. https://issues.apache.org/jira/browse/HUDI-1723 https://github.com/apache/hudi/pull/2845 Do you think yours is falling into this category ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1814) Non partitioned table for Flink writer
Danny Chen created HUDI-1814: Summary: Non partitioned table for Flink writer Key: HUDI-1814 URL: https://issues.apache.org/jira/browse/HUDI-1814 Project: Apache Hudi Issue Type: New Feature Components: Docs Reporter: Danny Chen -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
codecov-commenter edited a comment on pull request #2853: URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1e379c9) into [master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (62bb9e1) will **increase** coverage by `17.08%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#2853 +/- ## = + Coverage 52.60% 69.68% +17.08% + Complexity 3709 373 -3336 = Files 485 54 -431 Lines 23224 1996-21228 Branches 2465 236 -2229 = - Hits 12216 1391-10825 + Misses 9929 473 -9456 + Partials 1079 132 -947 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.68% <ø> (-0.11%)` | `373.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==) | `62.50% <0.00%> (-2.72%)` | `9.00% <0.00%> (ø%)` | | | [.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=) | | | | | [.../main/scala/org/apache/hudi/HoodieSparkUtils.scala](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrVXRpbHMuc2NhbGE=) | | | | | [...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=) | | | | | [...org/apache/hudi/common/model/TableServiceType.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1RhYmxlU2VydmljZVR5cGUuamF2YQ==) | | | | | [...ava/org/apache/hudi/common/util/DateTimeUtils.java](https://codecov.io/gh/apache/hudi/
[GitHub] [hudi] wk888 commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException
wk888 commented on issue #2834: URL: https://github.com/apache/hudi/issues/2834#issuecomment-823728826 @yanghua i can find the hoodie file in hdfs: ![image](https://user-images.githubusercontent.com/16316415/115488269-cf2ebd00-a28c-11eb-85ac-73ed631b6f31.png) but from the error log you can see it find the file from the /tmp/hive/root/1c7ec12e-4953-4913-bf9f-a09372b51609/.hoodie it seemd like the hive tm directory...so it cant find the .hoodie file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
codecov-commenter edited a comment on pull request #2853: URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1e379c9) into [master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (62bb9e1) will **decrease** coverage by `43.23%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #2853 +/- ## - Coverage 52.60% 9.36% -43.24% + Complexity 3709 48 -3661 Files 485 54 -431 Lines 232241996-21228 Branches 2465 236 -2229 - Hits 12216 187-12029 + Misses 99291796 -8133 + Partials 1079 13 -1066 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.36% <ø> (-60.43%)` | `48.00 <ø> (-325.00)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS
[GitHub] [hudi] garyli1019 commented on a change in pull request #2847: [HUDI-1769]Add download page to the site
garyli1019 commented on a change in pull request #2847: URL: https://github.com/apache/hudi/pull/2847#discussion_r617151804 ## File path: docs/_pages/download.cn.md ## @@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- ## Release 0.8.0 -* Source Release : [Apache Hudi 0.8.0 Source Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) +* Source Release : [Apache Hudi 0.8.0 Source Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) Review comment: probably. www.apache.org/dyn/closer.lua was also mentioned in the instruction email sent by the owner of annou...@apache.org, so I think this should be the right one to put on the site. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MyLanPangzi closed pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
MyLanPangzi closed pull request #2853: URL: https://github.com/apache/hudi/pull/2853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2722: [HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE
xiarixiaoyao commented on pull request #2722: URL: https://github.com/apache/hudi/pull/2722#issuecomment-823714627 @lw309637554 @nsivabalan thanks for your review. i will try testHoodieRealtimeCombineHoodieInputFormat in another pr, since it has nothing to do with this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-1744) [Rollback] rollback fail on mor table when the partition path hasn't any files
[ https://issues.apache.org/jira/browse/HUDI-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lrz resolved HUDI-1744. --- Resolution: Fixed > [Rollback] rollback fail on mor table when the partition path hasn't any files > -- > > Key: HUDI-1744 > URL: https://issues.apache.org/jira/browse/HUDI-1744 > Project: Apache Hudi > Issue Type: Bug >Reporter: lrz >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > when rollback on a mor table, and if the partition path hasn't any files, > then will throw exception because of call rdd.flatmap with 0 as numpartitions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
xiarixiaoyao commented on a change in pull request #2716: URL: https://github.com/apache/hudi/pull/2716#discussion_r617137325 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java ## @@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim createInputFormatShim() { if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").isEmpty()) { List partitions = new ArrayList<>(part.getPartSpec().keySet()); if (!partitions.isEmpty()) { -String partitionStr = String.join(",", partitions); Review comment: @nsivabalan just see the funtion initObjectInspector in MapOperator.java(my hive version is hive 3.1.1): // Next check if this table has partitions and if so // get the list of partition names as well as allocate // the serdes for the partition columns **line 189**String pcols = overlayedProps.getProperty(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS); **line 191**if (pcols != null && pcols.length() > 0) { **line 192**String[] partKeys = pcols.trim().split("/"); -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException
yanghua commented on issue #2834: URL: https://github.com/apache/hudi/issues/2834#issuecomment-823705049 @wk888 OK, I reviewed the code, at `TableNotFoundException.java:53`, the path you provided triggered `FileNotFoundException | IllegalArgumentException`. Did you make sure the path exists? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
yanghua commented on pull request #2853: URL: https://github.com/apache/hudi/pull/2853#issuecomment-823702360 Hi @MyLanPangzi Would you please recheck the Travis? If it was not caused by your change, then please retrigger the CI. thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wk888 commented on issue #2834: [SUPPORT] Help~~~org.apache.hudi.exception.TableNotFoundException
wk888 commented on issue #2834: URL: https://github.com/apache/hudi/issues/2834#issuecomment-823700483 @yanghua it seems have no privilege create database not create table and the table is created successed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #2849: [SUPPORT] - org.apache.hudi.exception.HoodieIOException: Could not load Hoodie properties from file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties
nsivabalan commented on issue #2849: URL: https://github.com/apache/hudi/issues/2849#issuecomment-823641624 Can you clean up the base path once and retry. rm -rf sometimes, there could be some residues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table
nsivabalan commented on pull request #2283: URL: https://github.com/apache/hudi/pull/2283#issuecomment-823619094 great job on the patch 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table
vinothchandar commented on pull request #2283: URL: https://github.com/apache/hudi/pull/2283#issuecomment-823611327 This is a great contribution. Thanks @pengzhiwei2018 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283)
This is an automated email from the ASF dual-hosted git repository. uditme pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new aacb8be [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) aacb8be is described below commit aacb8be5213a64a3cc9ddd791e2321526517d044 Author: pengzhiwei AuthorDate: Wed Apr 21 05:21:38 2021 +0800 [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) --- .../scala/org/apache/hudi/DataSourceOptions.scala | 4 + .../org/apache/hudi/HoodieSparkSqlWriter.scala | 99 ++ .../functional/HoodieSparkSqlWriterSuite.scala | 44 ++ .../main/java/org/apache/hudi/dla/DLASyncTool.java | 5 +- .../java/org/apache/hudi/dla/HoodieDLAClient.java | 7 +- .../java/org/apache/hudi/hive/HiveSyncConfig.java | 52 +++- .../java/org/apache/hudi/hive/HiveSyncTool.java| 12 ++- .../org/apache/hudi/hive/HoodieHiveClient.java | 27 +- .../org/apache/hudi/hive/util/ConfigUtils.java | 73 .../org/apache/hudi/hive/util/HiveSchemaUtil.java | 26 +- .../org/apache/hudi/hive/TestHiveSyncTool.java | 58 - .../hudi/sync/common/AbstractSyncHoodieClient.java | 16 +++- .../functional/TestHoodieDeltaStreamer.java| 7 ++ 13 files changed, 382 insertions(+), 48 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala index 4c76f5f..4643da5 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala @@ -353,6 +353,9 @@ object DataSourceWriteOptions { val HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "hoodie.datasource.hive_sync.ignore_exceptions" val HIVE_SKIP_RO_SUFFIX = "hoodie.datasource.hive_sync.skip_ro_suffix" val HIVE_SUPPORT_TIMESTAMP = "hoodie.datasource.hive_sync.support_timestamp" + val HIVE_TABLE_PROPERTIES = "hoodie.datasource.hive_sync.table_properties" + val HIVE_TABLE_SERDE_PROPERTIES = "hoodie.datasource.hive_sync.serde_properties" + val HIVE_SYNC_AS_DATA_SOURCE_TABLE = "hoodie.datasource.hive_sync.sync_as_datasource" // DEFAULT FOR HIVE SPECIFIC CONFIGS val DEFAULT_HIVE_SYNC_ENABLED_OPT_VAL = "false" @@ -372,6 +375,7 @@ object DataSourceWriteOptions { val DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "false" val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false" val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false" + val DEFAULT_HIVE_SYNC_AS_DATA_SOURCE_TABLE = "true" // Async Compaction - Enabled by default for MOR val ASYNC_COMPACT_ENABLE_OPT_KEY = "hoodie.datasource.compaction.async.enable" diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index 340ac14..3a5b51e 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -36,6 +36,7 @@ import org.apache.hudi.common.util.{CommitUtils, ReflectionUtils} import org.apache.hudi.config.HoodieBootstrapConfig.{BOOTSTRAP_BASE_PATH_PROP, BOOTSTRAP_INDEX_CLASS_PROP, DEFAULT_BOOTSTRAP_INDEX_CLASS} import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.exception.HoodieException +import org.apache.hudi.hive.util.ConfigUtils import org.apache.hudi.hive.{HiveSyncConfig, HiveSyncTool} import org.apache.hudi.internal.DataSourceInternalWriterHelper import org.apache.hudi.sync.common.AbstractSyncTool @@ -44,7 +45,10 @@ import org.apache.spark.SPARK_VERSION import org.apache.spark.SparkContext import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.rdd.RDD -import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.StaticSQLConf.SCHEMA_STRING_LENGTH_THRESHOLD +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession} import scala.collection.JavaConversions._ import scala.collection.mutable.ListBuffer @@ -220,7 +224,8 @@ private[hudi] object HoodieSparkSqlWriter { // Check for errors and commit the write. val (writeSuccessful, compactionInstant) = -commitAndPerformPostOperations(writeResult, parameters, writeClient, tableConfig, jsc, +commitAndPerformPostOperations(sqlContext.sparkSession, df.schema, + writeResult, parameters, writeClient, tableConfig, jsc, TableInstantInfo(basePath, instantTime, commitActionType, operation)) def unpersistRdd(rdd: RD
[GitHub] [hudi] umehrot2 merged pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table
umehrot2 merged pull request #2283: URL: https://github.com/apache/hudi/pull/2283 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion
[ https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325964#comment-17325964 ] sivabalan narayanan commented on HUDI-1343: --- [~liujinhui] [~vbalaji]: Do you folks think if this is still required after this fix [https://github.com/apache/hudi/pull/2765] . This fixes AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first entry in union and default value is set to null if a field is nullable in spark structtype. I mean, we have enabled the post schema processor by default. so wanted to double check if it's still applicable. > Add standard schema postprocessor which would rewrite the schema using > spark-avro conversion > > > Key: HUDI-1343 > URL: https://issues.apache.org/jira/browse/HUDI-1343 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > Fix For: 0.7.0 > > > When we use Transformer, the final Schema which we use to convert avro record > to bytes is auto generated by spark. This could be different (due to the way > Avro treats it) from the target schema that is being used to write (as the > target schema could be coming from Schema Registry). > > For example : > Schema generated by spark-avro when converting Row to avro > { > "type" : "record", > "name" : "hoodie_source", > "namespace" : "hoodie.source", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "long", "null" ] > }, { > "name" : "_op", > "type" : "string" > }, { > "name" : "inc_id", > "type" : "int" > }, { > "name" : "year", > "type" : [ "int", "null" ] > }, { > "name" : "violation_desc", > "type" : [ "string", "null" ] > }, { > "name" : "violation_code", > "type" : [ "string", "null" ] > }, { > "name" : "case_individual_id", > "type" : [ "int", "null" ] > }, { > "name" : "flag", > "type" : [ "string", "null" ] > }, { > "name" : "last_modified_ts", > "type" : "long" > } ] > } > > is not compatible with the Avro Schema: > > { > "type" : "record", > "name" : "formatted_debezium_payload", > "fields" : [ { > "name" : "_ts_ms", > "type" : [ "null", "long" ], > "default" : null > }, { > "name" : "_op", > "type" : "string", > "default" : null > }, { > "name" : "inc_id", > "type" : "int", > "default" : null > }, { > "name" : "year", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "violation_desc", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "violation_code", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "case_individual_id", > "type" : [ "null", "int" ], > "default" : null > }, { > "name" : "flag", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "last_modified_ts", > "type" : "long", > "default" : null > } ] > } > > Note that the type order is different for individual fields : > "type" : [ "null", "string" ], vs "type" : [ "string", "null" ] > Unexpectedly, Avro decoding fails when bytes written with first schema is > read using second schema. > > One way to fix is to use configured target schema when generating record > bytes but this is not easy without breaking Record payload constructor API > used by deltastreamer. > The other option is to apply a post-processor on target schema to make it > schema consistent with Transformer generated records. > > This ticket is to use the later approach of creating a standard schema > post-processor and adding it by default when Transformer is used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes
[ https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325963#comment-17325963 ] Vinoth Chandar commented on HUDI-648: - I see it linked now. I queued the PR up for review. > Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction > writes > > > Key: HUDI-648 > URL: https://issues.apache.org/jira/browse/HUDI-648 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer, Spark Integration, Writer Core >Reporter: Vinoth Chandar >Assignee: liujinhui >Priority: Major > Labels: pull-request-available, sev:normal, user-support-issues > Attachments: image-2021-03-03-11-40-21-083.png > > > We would like a way to hand the erroring records from writing or compaction > back to the users, in a separate table or log. This needs to work generically > across all the different writer paths. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] satishkotha commented on pull request #2809: [HUDI-1789] Support reading older snapshots
satishkotha commented on pull request #2809: URL: https://github.com/apache/hudi/pull/2809#issuecomment-823443803 @jsbali added few comments. Can you also check why CI is failing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2809: [HUDI-1789] Support reading older snapshots
satishkotha commented on a change in pull request #2809: URL: https://github.com/apache/hudi/pull/2809#discussion_r616870583 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java ## @@ -438,11 +437,20 @@ public static HoodieMetadataConfig buildMetadataConfig(Configuration conf) { if (LOG.isDebugEnabled()) { LOG.debug("Hoodie Metadata initialized with completed commit instant as :" + metaClient); } - HoodieTimeline timeline = HoodieHiveUtils.getTableTimeline(metaClient.getTableConfig().getTableName(), job, metaClient); + Review comment: can we combine this into getTableTimeline method? HoodieHiveUtils.getTableTimeline is anyway getting all the config. So i think that provides better abstraction to get relevant timeline. ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java ## @@ -137,4 +139,16 @@ public static HoodieTimeline getTableTimeline(final String tableName, final JobC // by default return all completed commits. return metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(); } + + public static Option getSnapshotMaxCommitTime(JobConf job, String tableName) { +String maxCommitTime = job.get(getSnapshotMaxCommitKey(tableName)); +if (maxCommitTime != null) { Review comment: consider using !StringUtils.isNullorEmpty() or just simply return Option.ofNullable(maxCommitTime) ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java ## @@ -68,6 +69,7 @@ public static final String HOODIE_STOP_AT_COMPACTION_PATTERN = "hoodie.%s.ro.stop.at.compaction"; public static final String INCREMENTAL_SCAN_MODE = "INCREMENTAL"; public static final String SNAPSHOT_SCAN_MODE = "SNAPSHOT"; + public static final String HOODIE_SNAPSHOT_CONSUME_COMMIT_PATTERN = "hoodie.%s.consume.snapshot.time"; Review comment: Do you think we can reuse existing config? perhaps HOODIE_CONSUME_COMMIT? ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java ## @@ -137,4 +139,16 @@ public static HoodieTimeline getTableTimeline(final String tableName, final JobC // by default return all completed commits. return metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(); } + + public static Option getSnapshotMaxCommitTime(JobConf job, String tableName) { Review comment: nit: consider adding javadoc for all public methods (I know we dont follow this consistently, but would be great to add for all new code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325961#comment-17325961 ] Vinoth Chandar commented on HUDI-251: - Please also feel free to take over the RFC as well. I can give you perms > JDBC incremental load to HUDI with DeltaStreamer > > > Key: HUDI-251 > URL: https://issues.apache.org/jira/browse/HUDI-251 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Affects Versions: 0.9.0 >Reporter: Taher Koitawala >Assignee: Sagar Sumit >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, > for such use cases, DeltaStreamer should provide inbuilt support. > DeltaSteamer should accept something like jdbc-source.properties where users > can define the RDBMS connection properties along with a timestamp column and > an interval which allows users to express how frequently HUDI should check > with RDBMS data source for new inserts or updates. > Details are documented in RFC-14 > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325958#comment-17325958 ] Vinoth Chandar commented on HUDI-251: - On 2, I think we have to enforce some sorting when limiting (if you are pulling very incrementally, hopefully it won't be as bad) and given we persist, to derive the checkpoint, we will pick the maximum value of the `ckpt` column value each time and we should be okay. >where ckpt > last_ckpt order by ckpt desc limit x yes. we are on the same page. We have to sort and paginate like this. >Can you please elaborate more on the tailing mechanism? What I meant was there could be scenarios, where we could still miss data in this JDBC based approach. We should clearly document these. For e.g As we fetch `ckpt > 10` there could be a long running transaction that just committed an earlier `ckpt=8` value. We would just fetch all records from 10 and move on. Let's think through also other issues like this? I think its okay, since everybody understands JDBC pulling is more for convenience than anything, works correctly when you don't run into these cases. Does that make sense? > JDBC incremental load to HUDI with DeltaStreamer > > > Key: HUDI-251 > URL: https://issues.apache.org/jira/browse/HUDI-251 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Affects Versions: 0.9.0 >Reporter: Taher Koitawala >Assignee: Purushotham Pushpavanthar >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, > for such use cases, DeltaStreamer should provide inbuilt support. > DeltaSteamer should accept something like jdbc-source.properties where users > can define the RDBMS connection properties along with a timestamp column and > an interval which allows users to express how frequently HUDI should check > with RDBMS data source for new inserts or updates. > Details are documented in RFC-14 > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar reassigned HUDI-251: --- Assignee: Sagar Sumit (was: Purushotham Pushpavanthar) > JDBC incremental load to HUDI with DeltaStreamer > > > Key: HUDI-251 > URL: https://issues.apache.org/jira/browse/HUDI-251 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Affects Versions: 0.9.0 >Reporter: Taher Koitawala >Assignee: Sagar Sumit >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, > for such use cases, DeltaStreamer should provide inbuilt support. > DeltaSteamer should accept something like jdbc-source.properties where users > can define the RDBMS connection properties along with a timestamp column and > an interval which allows users to express how frequently HUDI should check > with RDBMS data source for new inserts or updates. > Details are documented in RFC-14 > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] satishkotha merged pull request #2773: [HUDI-1764] Add Hudi-CLI support for clustering
satishkotha merged pull request #2773: URL: https://github.com/apache/hudi/pull/2773 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1764] Add Hudi-CLI support for clustering (#2773)
This is an automated email from the ASF dual-hosted git repository. satish pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 3253079 [HUDI-1764] Add Hudi-CLI support for clustering (#2773) 3253079 is described below commit 3253079507d74f6d52e78ad7f88b297daf969455 Author: Jintao Guan AuthorDate: Tue Apr 20 09:46:42 2021 -0700 [HUDI-1764] Add Hudi-CLI support for clustering (#2773) * tmp base * update * update unit test * update * update * update CLI parameters * linting * update doSchedule in HoodieClusteringJob * update * update diff according to comments --- .../hudi/cli/commands/ClusteringCommand.java | 107 + .../org/apache/hudi/cli/commands/SparkMain.java| 43 - .../apache/hudi/utilities/HoodieClusteringJob.java | 5 +- 3 files changed, 153 insertions(+), 2 deletions(-) diff --git a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java new file mode 100644 index 000..092f927 --- /dev/null +++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ClusteringCommand.java @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.cli.commands; + +import org.apache.hudi.cli.HoodieCLI; +import org.apache.hudi.cli.commands.SparkMain.SparkCommand; +import org.apache.hudi.cli.utils.InputStreamConsumer; +import org.apache.hudi.cli.utils.SparkUtil; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.utilities.UtilHelpers; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.launcher.SparkLauncher; +import org.apache.spark.util.Utils; +import org.springframework.shell.core.CommandMarker; +import org.springframework.shell.core.annotation.CliCommand; +import org.springframework.shell.core.annotation.CliOption; +import org.springframework.stereotype.Component; +import scala.collection.JavaConverters; + +@Component +public class ClusteringCommand implements CommandMarker { + + private static final Logger LOG = LogManager.getLogger(ClusteringCommand.class); + + @CliCommand(value = "clustering schedule", help = "Schedule Clustering") + public String scheduleClustering( + @CliOption(key = "sparkMemory", help = "Spark executor memory", + unspecifiedDefaultValue = "1G") final String sparkMemory, + @CliOption(key = "propsFilePath", help = "path to properties file on localfs or dfs with configurations for hoodie client for clustering", + unspecifiedDefaultValue = "") final String propsFilePath, + @CliOption(key = "hoodieConfigs", help = "Any configuration that can be set in the properties file can be passed here in the form of an array", + unspecifiedDefaultValue = "") final String[] configs) throws Exception { +HoodieTableMetaClient client = HoodieCLI.getTableMetaClient(); +boolean initialized = HoodieCLI.initConf(); +HoodieCLI.initFS(initialized); + +String sparkPropertiesPath = + Utils.getDefaultPropertiesFile(JavaConverters.mapAsScalaMapConverter(System.getenv()).asScala()); +SparkLauncher sparkLauncher = SparkUtil.initLauncher(sparkPropertiesPath); + +// First get a clustering instant time and pass it to spark launcher for scheduling clustering +String clusteringInstantTime = HoodieActiveTimeline.createNewInstantTime(); + +sparkLauncher.addAppArgs(SparkCommand.CLUSTERING_SCHEDULE.toString(), client.getBasePath(), +client.getTableConfig().getTableName(), clusteringInstantTime, sparkMemory, propsFilePath); +UtilHelpers.validateAndAddProperties(configs, sparkLauncher); +Process process = sparkLauncher.launch(); +InputStreamConsumer.captureOutput(process); +int exitCode = process.waitFor(); +if (exitCode != 0) { + return "Failed to schedule clustering for " + clusteringInstantTime; +} +return "Succ
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325954#comment-17325954 ] Vinoth Chandar commented on HUDI-1138: -- [~309637554] Please let me know if you are interested in taking a swing at this. > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325952#comment-17325952 ] Vinoth Chandar edited comment on HUDI-1138 at 4/20/21, 4:41 PM: yes. basic idea here is to cc [~309637554] 0) Maintain the marker file list, in a single file called `markers` under .hoodie/temp// (or whatever path we write this today) 1) Add a new endpoint to timeline server, `/createMarkerFile`, which only returns 200 only if successfully reads `markers` file, adds an entry to it, overwrites the `markers` on underlying cloud storage. 2) We employ some batching here, such that we can batch all requests that arrive in a 100-500ms window in a single overwrite operation. I think this will work really well (based on similar things I have done before). wdyt? Before this, we should also study how effective the current parallelization is. So hacking up a PoC to see the perf gains would be interesting first step. was (Author: vc): yes. basic idea here is to 0) Maintain the marker file list, in a single file called `markers` under .hoodie/temp// (or whatever path we write this today) 1) Add a new endpoint to timeline server, `/createMarkerFile`, which only returns 200 only if successfully reads `markers` file, adds an entry to it, overwrites the `markers` on underlying cloud storage. 2) We employ some batching here, such that we can batch all requests that arrive in a 100-500ms window in a single overwrite operation. I think this will work really well (based on similar things I have done before). wdyt? Before this, we should also study how effective the current parallelization is. So hacking up a PoC to see the perf gains would be interesting first step. > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325952#comment-17325952 ] Vinoth Chandar commented on HUDI-1138: -- yes. basic idea here is to 0) Maintain the marker file list, in a single file called `markers` under .hoodie/temp// (or whatever path we write this today) 1) Add a new endpoint to timeline server, `/createMarkerFile`, which only returns 200 only if successfully reads `markers` file, adds an entry to it, overwrites the `markers` on underlying cloud storage. 2) We employ some batching here, such that we can batch all requests that arrive in a 100-500ms window in a single overwrite operation. I think this will work really well (based on similar things I have done before). wdyt? Before this, we should also study how effective the current parallelization is. So hacking up a PoC to see the perf gains would be interesting first step. > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1652: -- Status: Patch Available (was: In Progress) > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-1652. --- Fix Version/s: 0.7.0 Resolution: Fixed > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > Fix For: 0.7.0 > > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1652: -- Status: In Progress (was: Open) > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-1652: - Assignee: Balaji Varadarajan > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1652: -- Status: Closed (was: Patch Available) > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1652) DiskBasedMap:As time goes by, the number of /temp/***** file handles held by the executor process is increasing
[ https://issues.apache.org/jira/browse/HUDI-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reopened HUDI-1652: --- > DiskBasedMap:As time goes by, the number of /temp/* file handles held by > the executor process is increasing > --- > > Key: HUDI-1652 > URL: https://issues.apache.org/jira/browse/HUDI-1652 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Affects Versions: 0.6.0 >Reporter: wangmeng >Assignee: Balaji Varadarajan >Priority: Major > Labels: sev:critical, user-support-issues > > We encountered a problem in the hudi production environment, which is very > similar to the HUDI-945 problem. > *Software environment:* spark 2.4.5, hudi 0.6 > *Scenario:* consume Kafka data and write hudi, using spark streaming > (non-StructedStreaming). > *Problem:* As time goes by, the number of /temp/* file handles held by > the executor process is increasing. > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *Reason analysis:* ExternalSpillableMap is used in HoodieMergeHandle class, > and DiskBasedMap is used to flush overflowed data to the disk. But the file > stream can only be closed and deleted by the hook when the jvm exits. When > the clear method is executed in the program, the stream is not closed and the > file is not deleted. As a result, over time, more and more file handles are > still held, leading to errors. This error is similar to Hudi-945. > > *软件环境:*spark 2.4.5、hudi 0.6 > *场景:*消费kafka数据写入hudi,采用spark streaming(非StructedStreaming)。 > *问题:executor 进程随着时间的推移,所持有的/temp/*文件句柄数越来越多。 > " > /tmp/10ded0f7-1bcc-4316-91e9-9b4d0507e1e0 > /tmp/49251680-0efd-4cc4-a55e-1af2038d3900 > /tmp/cc7dd284-3444-4c17-a5c8-84b3090c17f9 > " > *原因分析:*HoodieMergeHandle类中采用ExternalSpillableMap,使用DiskBasedMap将溢出的数据刷新到磁盘上。但是文件流只有在jvm退出的时候通过钩子关闭且删除文件。程序中执行clear方法时,并不关闭流及删除文件。从而导致随着时间推移,越来越多的文件句柄还持有,导致报错。此错误和Hudi-945挺相似的。 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 4f86e1d Travis CI build asf-site 4f86e1d is described below commit 4f86e1d7d5f030bee450d5e8f6a760337fa6977b Author: CI AuthorDate: Tue Apr 20 16:32:11 2021 + Travis CI build asf-site --- content/assets/js/lunr/lunr-store.js| 2 +- content/blog/hudi-key-generators/index.html | 21 - 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/content/assets/js/lunr/lunr-store.js b/content/assets/js/lunr/lunr-store.js index d73f822..02f99da 100644 --- a/content/assets/js/lunr/lunr-store.js +++ b/content/assets/js/lunr/lunr-store.js @@ -1680,7 +1680,7 @@ var store = [{ "url": "https://hudi.apache.org/blog/hudi-clustering-intro/";, "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{ "title": "Apache Hudi Key Generators", -"excerpt":"Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of record key and partition path where the record belongs to. Hudi has imposed this constraint so that updates and deletes can be applied to the record of interest. Hudi relies on the partition path field...","categories": ["blog"], +"excerpt":"Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint b) enable fast updates and deletes on records. One should choose the...","categories": ["blog"], "tags": [], "url": "https://hudi.apache.org/blog/hudi-key-generators/";, "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{ diff --git a/content/blog/hudi-key-generators/index.html b/content/blog/hudi-key-generators/index.html index 88d128c..6f87f06 100644 --- a/content/blog/hudi-key-generators/index.html +++ b/content/blog/hudi-key-generators/index.html @@ -197,18 +197,21 @@ } -Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of record key and partition path where the -record belongs to. Hudi has imposed this constraint so that updates and deletes can be applied to the record of interest. -Hudi relies on the partition path field to partition your dataset and records within a partition have unique record keys. -Since uniqueness is guaranteed only within the partition, there could be records with same record keys across different -partitions. One should choose the partition field wisely as it could be a determining factor for your ingestion and -query latency. +Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where +the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint +b) enable fast updates and deletes on records. One should choose the partitioning scheme wisely as it could be a +determining factor for your ingestion and query latency. + +In general, Hudi supports both partitioned and global indexes. For a dataset with partitioned index(which is most +commonly used), each record is uniquely identified by a pair of record key and partition path. But for a dataset with +global index, each record is uniquely identified by just the record key. There won’t be any duplicate record keys across +partitions. Key Generators -Hudi exposes a number of out of the box key generators that customers can use based on their need. Or can have their -own implementation for the KeyGenerator. This blog goes over all different types of key generators that are readily -available to use. +Hudi provides several key generators out of the box that users can use based on their need, while having a pluggable +implementation for users to implement and use their own KeyGenerator. This blog goes over all different types of key +generators that are readily available to use. https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java";>Here is the interface for KeyGenerator in Hudi for your reference.
[GitHub] [hudi] nsivabalan commented on a change in pull request #2847: [HUDI-1769]Add download page to the site
nsivabalan commented on a change in pull request #2847: URL: https://github.com/apache/hudi/pull/2847#discussion_r616847950 ## File path: docs/_pages/download.cn.md ## @@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- ## Release 0.8.0 -* Source Release : [Apache Hudi 0.8.0 Source Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) +* Source Release : [Apache Hudi 0.8.0 Source Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) Review comment: probably it chooses a mirror location closer to your geo location. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2847: [HUDI-1769]Add download page to the site
nsivabalan commented on a change in pull request #2847: URL: https://github.com/apache/hudi/pull/2847#discussion_r616847043 ## File path: docs/_pages/download.cn.md ## @@ -7,29 +7,29 @@ last_modified_at: 2019-12-30T15:59:57-04:00 --- ## Release 0.8.0 -* Source Release : [Apache Hudi 0.8.0 Source Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) +* Source Release : [Apache Hudi 0.8.0 Source Release](https://www.apache.org/dyn/closer.lua/hudi/0.8.0/hudi-0.8.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512)) Review comment: https://www.apache.org/dyn/closer.lua/hudi/ redirects me to https://apache.osuosl.org/hudi/ fyi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [MINOR] Fixing key generators blog content (#2739)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 860abd0 [MINOR] Fixing key generators blog content (#2739) 860abd0 is described below commit 860abd04cbb3e78265ba9a300bb0cd849fff7e44 Author: Sivabalan Narayanan AuthorDate: Tue Apr 20 12:17:30 2021 -0400 [MINOR] Fixing key generators blog content (#2739) --- docs/_posts/2021-02-13-hudi-key-generators.md | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/_posts/2021-02-13-hudi-key-generators.md b/docs/_posts/2021-02-13-hudi-key-generators.md index 5076ec6..fc3faa8 100644 --- a/docs/_posts/2021-02-13-hudi-key-generators.md +++ b/docs/_posts/2021-02-13-hudi-key-generators.md @@ -5,18 +5,21 @@ author: shivnarayan category: blog --- -Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of record key and partition path where the -record belongs to. Hudi has imposed this constraint so that updates and deletes can be applied to the record of interest. -Hudi relies on the partition path field to partition your dataset and records within a partition have unique record keys. -Since uniqueness is guaranteed only within the partition, there could be records with same record keys across different -partitions. One should choose the partition field wisely as it could be a determining factor for your ingestion and -query latency. +Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where +the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint +b) enable fast updates and deletes on records. One should choose the partitioning scheme wisely as it could be a +determining factor for your ingestion and query latency. + +In general, Hudi supports both partitioned and global indexes. For a dataset with partitioned index(which is most +commonly used), each record is uniquely identified by a pair of record key and partition path. But for a dataset with +global index, each record is uniquely identified by just the record key. There won't be any duplicate record keys across +partitions. ## Key Generators -Hudi exposes a number of out of the box key generators that customers can use based on their need. Or can have their -own implementation for the KeyGenerator. This blog goes over all different types of key generators that are readily -available to use. +Hudi provides several key generators out of the box that users can use based on their need, while having a pluggable +implementation for users to implement and use their own KeyGenerator. This blog goes over all different types of key +generators that are readily available to use. [Here](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenerator.java) is the interface for KeyGenerator in Hudi for your reference.
[GitHub] [hudi] nsivabalan merged pull request #2739: [MINOR] Fixing key generators blog content
nsivabalan merged pull request #2739: URL: https://github.com/apache/hudi/pull/2739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] PavelPetukhov opened a new issue #2856: [SUPPORT] Metrics Prometheus pushgateway
PavelPetukhov opened a new issue #2856: URL: https://github.com/apache/hudi/issues/2856 I have discovered that you've added prometheus related changes like here https://issues.apache.org/jira/browse/HUDI-210 But unfortunately there is no documentation related to pushing hudi metrics to Prometheus Push Gateway https://hudi.apache.org/docs/metrics.html#hoodiemetrics What parameters should be set in order to do that? **Environment Description** * Hudi version : 0.6.0 * Spark version : 2.4.7 * Hadoop version : 2.7 * Storage (HDFS/S3/GCS..) : hdfs * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] raphaelauv opened a new issue #2855: [SUPPORT] hudi-utilities documentation
raphaelauv opened a new issue #2855: URL: https://github.com/apache/hudi/issues/2855 **Describe the problem you faced** The hudi-utilities are use in the [Docker Demo](https://hudi.apache.org/docs/docker_demo.html) , but there is no documentation on there purpose and if they can be considered prod-ready jobs ? **Expected behavior** A Readme inside the folder hudi-utilities or some lines in the documentation explaining the purpose and if they can be considered prod ready jobs. Thank you all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on pull request #2847: [HUDI-1769]Add download page to the site
garyli1019 commented on pull request #2847: URL: https://github.com/apache/hudi/pull/2847#issuecomment-823322655 > @garyli1019 the download links are pointing to dist.apache.org tar balls?? > > while we are at it, can we also update release cwiki page with updating this for each release? @vinothchandar fixed. Changed the link to the mirror site like other apache projects. And added the updating `download.md` step to the README instructions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1809) Flink merge on read input split uses wrong base file path for default merge type
[ https://issues.apache.org/jira/browse/HUDI-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1809. -- Resolution: Fixed d6d52c60636ae6a0c16469fa6761d0080fddf72f > Flink merge on read input split uses wrong base file path for default merge > type > > > Key: HUDI-1809 > URL: https://issues.apache.org/jira/browse/HUDI-1809 > Project: Apache Hudi > Issue Type: Bug > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Should use the base file path instead of the table path. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new d6d52c6 [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846) d6d52c6 is described below commit d6d52c60636ae6a0c16469fa6761d0080fddf72f Author: Danny Chan AuthorDate: Tue Apr 20 21:27:09 2021 +0800 [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846) --- .../table/format/mor/MergeOnReadInputFormat.java | 38 +- .../org/apache/hudi/util/RowDataProjection.java| 61 ++ .../apache/hudi/table/format/TestInputFormat.java | 1 + 3 files changed, 88 insertions(+), 12 deletions(-) diff --git a/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java b/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java index 12bebdf..1186cff 100644 --- a/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java +++ b/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java @@ -29,6 +29,7 @@ import org.apache.hudi.table.format.FormatUtils; import org.apache.hudi.table.format.cow.ParquetColumnarRowSplitReader; import org.apache.hudi.table.format.cow.ParquetSplitReaderUtil; import org.apache.hudi.util.AvroToRowDataConverters; +import org.apache.hudi.util.RowDataProjection; import org.apache.hudi.util.RowDataToAvroConverters; import org.apache.hudi.util.StreamerUtil; import org.apache.hudi.util.StringToRowDataConverter; @@ -63,6 +64,7 @@ import java.util.stream.IntStream; import static org.apache.flink.table.data.vector.VectorizedColumnBatch.DEFAULT_SIZE; import static org.apache.flink.table.filesystem.RowPartitionComputer.restorePartValueFromType; +import static org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_COMMIT_TIME_COL_POS; import static org.apache.hudi.hadoop.utils.HoodieInputFormatUtils.HOODIE_RECORD_KEY_COL_POS; import static org.apache.hudi.table.format.FormatUtils.buildAvroRecordBySchema; @@ -180,7 +182,7 @@ public class MergeOnReadInputFormat new Schema.Parser().parse(this.tableState.getAvroSchema()), new Schema.Parser().parse(this.tableState.getRequiredAvroSchema()), this.requiredPos, - getFullSchemaReader(split.getTablePath())); + getFullSchemaReader(split.getBasePath().get())); } else { throw new HoodieException("Unable to select an Iterator to read the Hoodie MOR File Split for " + "file path: " + split.getBasePath() @@ -337,7 +339,7 @@ public class MergeOnReadInputFormat // efficient. if (split.getInstantRange().isPresent()) { // based on the fact that commit time is always the first field - String commitTime = curAvroRecord.get().get(0).toString(); + String commitTime = curAvroRecord.get().get(HOODIE_COMMIT_TIME_COL_POS).toString(); if (!split.getInstantRange().get().isInRange(commitTime)) { // filter out the records that are not in range return hasNext(); @@ -431,6 +433,11 @@ public class MergeOnReadInputFormat // iterator for log files private final Iterator iterator; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: +// method #reachedEnd() returns false after it returns true. +// refactor it out once FLINK-22370 is resolved. +private boolean readLogs = false; + private RowData currentRecord; SkipMergeIterator(ParquetColumnarRowSplitReader reader, Iterator iterator) { @@ -440,10 +447,11 @@ public class MergeOnReadInputFormat @Override public boolean reachedEnd() throws IOException { - if (!this.reader.reachedEnd()) { + if (!readLogs && !this.reader.reachedEnd()) { currentRecord = this.reader.nextRecord(); return false; } + readLogs = true; if (this.iterator.hasNext()) { currentRecord = this.iterator.next(); return false; @@ -479,6 +487,12 @@ public class MergeOnReadInputFormat private final AvroToRowDataConverters.AvroToRowDataConverter avroToRowDataConverter; private final GenericRecordBuilder recordBuilder; +private final RowDataProjection projection; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: +// method #reachedEnd() returns false after it returns true. +// refactor it out once FLINK-22370 is resolved. +private boolean readLogs = false; + private Set keyToSkip = new HashSet<>(); private RowData currentRecord; @@ -501,11 +515,12 @@ public class MergeOnReadInputFormat this.recordBuilder = new GenericRecordBuilder(requiredSchema); this.rowDataToAvroConve
[GitHub] [hudi] yanghua merged pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
yanghua merged pull request #2846: URL: https://github.com/apache/hudi/pull/2846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2739: [MINOR] Fixing key generators blog content
nsivabalan commented on a change in pull request #2739: URL: https://github.com/apache/hudi/pull/2739#discussion_r616674663 ## File path: docs/_posts/2021-02-13-hudi-key-generators.md ## @@ -5,18 +5,16 @@ author: shivnarayan category: blog --- -Every record in Hudi is uniquely identified by a HoodieKey, which is a pair of record key and partition path where the -record belongs to. Hudi has imposed this constraint so that updates and deletes can be applied to the record of interest. -Hudi relies on the partition path field to partition your dataset and records within a partition have unique record keys. -Since uniqueness is guaranteed only within the partition, there could be records with same record keys across different -partitions. One should choose the partition field wisely as it could be a determining factor for your ingestion and -query latency. +Every record in Hudi is uniquely identified by a primary key, which is a pair of record key and partition path where +the record belongs to. Using primary keys, Hudi can impose a) partition level uniqueness integrity constraint +b) enable fast updates and deletes on records. One should choose the partitioning scheme wisely as it could be a +determining factor for your ingestion and query latency. ## Key Generators -Hudi exposes a number of out of the box key generators that customers can use based on their need. Or can have their -own implementation for the KeyGenerator. This blog goes over all different types of key generators that are readily -available to use. +Hudi provides several key generators out of the box that customers can use based on their need while having a pluggable Review comment: ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2767: [HUDI-1761] Adding support for Test your own Schema with QuickStart
nsivabalan commented on pull request #2767: URL: https://github.com/apache/hudi/pull/2767#issuecomment-823266697 if not in source bundle, somewhere in util packages or somewhere would help. For new customers who are looking to try out hudi, would be easy to do sanity check if their schema works w/ hudi end to end. if not, they might have to manually generate data or read from elsewhere and inject into hudi. this is just out of the shelf option just to test any complex schemas. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2776: [HUDI-1768] spark datasource support schema validate add column
nsivabalan commented on pull request #2776: URL: https://github.com/apache/hudi/pull/2776#issuecomment-823264108 yes, this is still valid. @lw309637554 : ping me here once the PR is ready to be reviewed again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
nsivabalan commented on pull request #2720: URL: https://github.com/apache/hudi/pull/2720#issuecomment-823254679 @xiarixiaoyao : LGTM. ignore the disabled test for now. can you add a UT for the fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
nsivabalan commented on a change in pull request #2716: URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java ## @@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim createInputFormatShim() { if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").isEmpty()) { List partitions = new ArrayList<>(part.getPartSpec().keySet()); if (!partitions.isEmpty()) { -String partitionStr = String.join(",", partitions); Review comment: I am just getting started to understand the query side (and hence not very conversant). I tried looking in hive repo for [CombineHiveInputFormat](https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java) to find the usage of delimiter, but couldn't find this piece of code. Would you mind pointing me to file where I can find this code snippet in hive repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1747) Deltastreamer incremental read is not working on the MOR table
[ https://issues.apache.org/jira/browse/HUDI-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325773#comment-17325773 ] sivabalan narayanan commented on HUDI-1747: --- awesome, thanks. > Deltastreamer incremental read is not working on the MOR table > -- > > Key: HUDI-1747 > URL: https://issues.apache.org/jira/browse/HUDI-1747 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core >Reporter: Vinoth Govindarajan >Priority: Critical > Labels: sev:critical > > I was trying to read the MOR HUDI table incrementally using delta streamer, > while doing that I ran into this issue where it says: > {code:java} > Found recursive reference in Avro schema, which can not be processed by > Spark:{code} > Spark Version: 2.4 > Hudi Version: 0.7.0-SNAPSHOT or the latest master > > Full Stack Trace: > {code:java} > Found recursive reference in Avro schema, which can not be processed by Spark: > { > "type" : "record", > "name" : "meta", > "fields" : [ { > "name" : "verified", > "type" : [ "null", "boolean" ], > "default" : null > }, { > "name" : "zip", > "type" : [ "null", "string" ], > "default" : null > }, { > "name" : "lname", > "type" : [ "null", "string" ], > "default" : null > }] > } > > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:75) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:95) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105) > at > org.apache.spark.sql
[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
nsivabalan commented on a change in pull request #2716: URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java ## @@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim createInputFormatShim() { if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").isEmpty()) { List partitions = new ArrayList<>(part.getPartSpec().keySet()); if (!partitions.isEmpty()) { -String partitionStr = String.join(",", partitions); Review comment: I am just getting started to understand the query side (and hence not very conversant). I tried looking in hive repo for CombineHiveInputFormat to find the usage of delimiter, but couldn't find this piece of code. Would you mind pointing me to file where I can find this code snippet in hive repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2716: [HUDI-1718] when query incr view of mor table which has Multi level partitions, the query failed
nsivabalan commented on a change in pull request #2716: URL: https://github.com/apache/hudi/pull/2716#discussion_r616621197 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java ## @@ -170,7 +170,7 @@ protected HoodieCombineFileInputFormatShim createInputFormatShim() { if (job.get(hive_metastoreConstants.META_TABLE_PARTITION_COLUMNS, "").isEmpty()) { List partitions = new ArrayList<>(part.getPartSpec().keySet()); if (!partitions.isEmpty()) { -String partitionStr = String.join(",", partitions); Review comment: I am just getting started to understand the query side (and hence not very conversant). I tried looking in hive for CombineHiveInputFormat to find the usage of delimiter, but couldn't find this piece of code. Would you mind pointing me to file where I can find this code snippet in hive repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tooptoop4 commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?
tooptoop4 commented on issue #2284: URL: https://github.com/apache/hudi/issues/2284#issuecomment-823247153 https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] sbernauer commented on pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution
sbernauer commented on pull request #2012: URL: https://github.com/apache/hudi/pull/2012#issuecomment-823214232 @sathyaprakashg @n3nash and others thanks for your work! I have rebased the commit for the current master and resolved all the conflicts here https://github.com/sbernauer/hudi/commit/b383883742ad63899fa43584ab7a10cd72d533fe @sathyaprakashg this may help you while rebasing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2720: [HUDI-1719]hive on spark/mr,Incremental query of the mor table, the partition field is incorrect
nsivabalan commented on pull request #2720: URL: https://github.com/apache/hudi/pull/2720#issuecomment-823203510 @xiarixiaoyao : I was asking Raymond (@xushiyan ) as to why this test is disabled. From git, I found that he was the one who disabled the test and wanted to get info from him. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #2845: [HUDI-1723] Fix path selector listing files with the same mod date
nsivabalan commented on a change in pull request #2845: URL: https://github.com/apache/hudi/pull/2845#discussion_r616593119 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java ## @@ -121,28 +121,30 @@ public static DFSPathSelector createSourceSelector(TypedProperties props, eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); Review comment: don't we need any fix in listEligibleFiles()? this method filters files based on mod time > checkpoint time. I thought fix is to make this mod time >**=** checkpoint time. ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java ## @@ -121,28 +121,30 @@ public static DFSPathSelector createSourceSelector(TypedProperties props, eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); // Filter based on checkpoint & input size, if needed long currentBytes = 0; - long maxModificationTime = Long.MIN_VALUE; + long newCheckpointTime = lastCheckpointTime; List filteredFiles = new ArrayList<>(); for (FileStatus f : eligibleFiles) { -if (currentBytes + f.getLen() >= sourceLimit) { +if (currentBytes + f.getLen() >= sourceLimit && f.getModificationTime() > newCheckpointTime) { Review comment: won't this lead to overflow. in the sense that, this could lead to reading ```2*sourceLimit``` or even ```10* sourceLimit```, we never know. ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java ## @@ -121,28 +121,30 @@ public static DFSPathSelector createSourceSelector(TypedProperties props, eligibleFiles.sort(Comparator.comparingLong(FileStatus::getModificationTime)); // Filter based on checkpoint & input size, if needed long currentBytes = 0; - long maxModificationTime = Long.MIN_VALUE; + long newCheckpointTime = lastCheckpointTime; List filteredFiles = new ArrayList<>(); for (FileStatus f : eligibleFiles) { -if (currentBytes + f.getLen() >= sourceLimit) { +if (currentBytes + f.getLen() >= sourceLimit && f.getModificationTime() > newCheckpointTime) { Review comment: guess I get the gist now and why we don't need any fix in listEligibleFiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
codecov-commenter edited a comment on pull request #2846: URL: https://github.com/apache/hudi/pull/2846#issuecomment-822274991 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > :exclamation: No coverage uploaded for pull request base (`master@4e050cc`). [Click here to learn what that means](https://docs.codecov.io/docs/error-reference?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#section-missing-base-commit). > The diff coverage is `82.14%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2846/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@Coverage Diff@@ ## master#2846 +/- ## = Coverage ? 52.99% Complexity? 3726 = Files ? 486 Lines ?23247 Branches ? 2469 = Hits ?12320 Misses? 9846 Partials ? 1081 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `40.29% <ø> (?)` | `215.00 <ø> (?)` | | | hudiclient | `∅ <ø> (?)` | `0.00 <ø> (?)` | | | hudicommon | `50.68% <ø> (?)` | `1976.00 <ø> (?)` | | | hudiflink | `59.00% <82.14%> (?)` | `534.00 <5.00> (?)` | | | hudihadoopmr | `33.33% <ø> (?)` | `198.00 <ø> (?)` | | | hudisparkdatasource | `72.11% <ø> (?)` | `237.00 <ø> (?)` | | | hudisync | `45.70% <ø> (?)` | `131.00 <ø> (?)` | | | huditimelineservice | `64.36% <ø> (?)` | `62.00 <ø> (?)` | | | hudiutilities | `69.79% <ø> (?)` | `373.00 <ø> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../hudi/table/format/mor/MergeOnReadInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS90YWJsZS9mb3JtYXQvbW9yL01lcmdlT25SZWFkSW5wdXRGb3JtYXQuamF2YQ==) | `75.00% <69.23%> (ø)` | `18.00 <0.00> (?)` | | | [...n/java/org/apache/hudi/util/RowDataProjection.java](https://codecov.io/gh/apache/hudi/pull/2846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS91dGlsL1Jvd0RhdGFQcm9qZWN0aW9uLmphdmE=) | `93.33% <93.33%> (ø)` | `5.00 <5.00> (?)` | | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 closed pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
danny0405 closed pull request #2846: URL: https://github.com/apache/hudi/pull/2846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
codecov-commenter edited a comment on pull request #2853: URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1e379c9) into [master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (62bb9e1) will **decrease** coverage by `43.21%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #2853 +/- ## - Coverage 52.60% 9.38% -43.22% + Complexity 3709 48 -3661 Files 485 54 -431 Lines 232241993-21231 Branches 2465 235 -2230 - Hits 12216 187-12029 + Misses 99291793 -8136 + Partials 1079 13 -1066 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.38% <ø> (-60.42%)` | `48.00 <ø> (-325.00)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <
[GitHub] [hudi] codecov-commenter commented on pull request #2854: [HUDI-1771] Propagate CDC format for hoodie
codecov-commenter commented on pull request #2854: URL: https://github.com/apache/hudi/pull/2854#issuecomment-823120634 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2854](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (aeca8e7) into [master](https://codecov.io/gh/apache/hudi/commit/9a288ccbebf1aee3164e7bc472a3e795bb83652b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9a288cc) will **increase** coverage by `17.20%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2854/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#2854 +/- ## = + Coverage 52.58% 69.79% +17.20% + Complexity 3708 373 -3335 = Files 485 54 -431 Lines 23227 1993-21234 Branches 2466 235 -2231 = - Hits 12215 1391-10824 + Misses 9934 471 -9463 + Partials 1078 131 -947 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.79% <ø> (ø)` | `373.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2854?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...ava/org/apache/hudi/cli/commands/UtilsCommand.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1V0aWxzQ29tbWFuZC5qYXZh) | | | | | [...hadoop/realtime/RealtimeCompactedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lQ29tcGFjdGVkUmVjb3JkUmVhZGVyLmphdmE=) | | | | | [...ache/hudi/hadoop/utils/HoodieInputFormatUtils.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUlucHV0Rm9ybWF0VXRpbHMuamF2YQ==) | | | | | [.../hudi/common/table/timeline/dto/FileStatusDTO.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlU3RhdHVzRFRPLmphdmE=) | | | | | [.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL01lcmdlT25SZWFkU25hcHNob3RSZWxhdGlvbi5zY2FsYQ==) | | | | | [...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_
[GitHub] [hudi] danny0405 commented on a change in pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
danny0405 commented on a change in pull request #2853: URL: https://github.com/apache/hudi/pull/2853#discussion_r616499589 ## File path: hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java ## @@ -80,6 +80,12 @@ private FlinkOptions() { .defaultValue(false) .withDescription("Whether to bootstrap the index state from existing hoodie table, default false"); + public static final ConfigOption INDEX_STATE_TTL = ConfigOptions + .key("index.state.ttl") + .doubleType() + .defaultValue(1.5D) + .withDescription("index state ttl in days. default is 1.5 day."); + Review comment: Index state ttl in days, default 1.5 day -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
codecov-commenter edited a comment on pull request #2846: URL: https://github.com/apache/hudi/pull/2846#issuecomment-822274991 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > :exclamation: No coverage uploaded for pull request base (`master@4e050cc`). [Click here to learn what that means](https://docs.codecov.io/docs/error-reference?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#section-missing-base-commit). > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2846/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff@@ ## master #2846 +/- ## Coverage ? 9.38% Complexity? 48 Files ? 54 Lines ?1993 Branches ? 235 Hits ? 187 Misses?1793 Partials ? 13 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudiutilities | `9.38% <ø> (?)` | `48.00 <ø> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nevgin commented on issue #2832: [SUPPORT] Hive on Spark dont work
nevgin commented on issue #2832: URL: https://github.com/apache/hudi/issues/2832#issuecomment-823117798 Directly from the spark, queries are being handled wonderfully. From spark for hive, according to the documentation, I removed the hive*.jar libraries. If you do not delete the hive does not work with the spark engine. my guess is that the spark needs to be built with my version of the hive - 2.3.8. This is true? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter commented on pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
codecov-commenter commented on pull request #2853: URL: https://github.com/apache/hudi/pull/2853#issuecomment-823113459 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2853](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9bd245c) into [master](https://codecov.io/gh/apache/hudi/commit/62bb9e10d9d2f2a9807ee46b0ed094ef2fcc89e5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (62bb9e1) will **decrease** coverage by `43.21%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2853/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #2853 +/- ## - Coverage 52.60% 9.38% -43.22% + Complexity 3709 48 -3661 Files 485 54 -431 Lines 232241993-21231 Branches 2465 235 -2230 - Hits 12216 187-12029 + Misses 99291793 -8136 + Partials 1079 13 -1066 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.38% <ø> (-60.42%)` | `48.00 <ø> (-325.00)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2853?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2853/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%>
[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer
[ https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 谢波 updated HUDI-1812: - Description: Add option: {code:java} public static final ConfigOption INDEX_STATE_TTL = ConfigOptions .key("index.state.ttl") .doubleType() .defaultValue(1.5D) .withDescription("index state ttl in days. default is 1.5 day."); {code} If the state expires but there are still updates for old records, the records would be recognized as INSERT instead of UPDATE thus some data duplication. was: Add option: {code:java} public static final ConfigOption INDEX_STATE_TTL = ConfigOptions .key("index.state.ttl") .longType() .defaultValue(24 * 60 * 60 * 1000L) .withDescription("index state ttl in milliseconds. default is 1 day."); {code} If the state expires but there are still updates for old records, the records would be recognized as INSERT instead of UPDATE thus some data duplication. > Add explicit index state TTL option for Flink writer > > > Key: HUDI-1812 > URL: https://issues.apache.org/jira/browse/HUDI-1812 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: Danny Chen >Assignee: 谢波 >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Add option: > {code:java} > public static final ConfigOption INDEX_STATE_TTL = ConfigOptions > .key("index.state.ttl") > .doubleType() > .defaultValue(1.5D) > .withDescription("index state ttl in days. default is 1.5 day."); > {code} > If the state expires but there are still updates for old records, the records > would be recognized as INSERT instead of UPDATE thus some data duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] danny0405 commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
danny0405 commented on a change in pull request #2846: URL: https://github.com/apache/hudi/pull/2846#discussion_r616460746 ## File path: hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java ## @@ -431,6 +433,10 @@ public void close() { // iterator for log files private final Iterator iterator; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1771) Propagate CDC format for hoodie
[ https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1771: - Labels: pull-request-available (was: ) > Propagate CDC format for hoodie > --- > > Key: HUDI-1771 > URL: https://issues.apache.org/jira/browse/HUDI-1771 > Project: Apache Hudi > Issue Type: New Feature > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Like what we discussed in the dev mailing list: > https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E > Keep the change flags make new use cases possible: using HUDI as the unified > storage format for DWD and DWS layer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer
[ https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1812: - Labels: pull-request-available (was: ) > Add explicit index state TTL option for Flink writer > > > Key: HUDI-1812 > URL: https://issues.apache.org/jira/browse/HUDI-1812 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: Danny Chen >Assignee: 谢波 >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Add option: > {code:java} > public static final ConfigOption INDEX_STATE_TTL = ConfigOptions > .key("index.state.ttl") > .longType() > .defaultValue(24 * 60 * 60 * 1000L) > .withDescription("index state ttl in milliseconds. default is 1 day."); > {code} > If the state expires but there are still updates for old records, the records > would be recognized as INSERT instead of UPDATE thus some data duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1771) Propagate CDC format for hoodie
[ https://issues.apache.org/jira/browse/HUDI-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-1771: - Summary: Propagate CDC format for hoodie (was: Keep the change flags from CDC source for Flink writer) > Propagate CDC format for hoodie > --- > > Key: HUDI-1771 > URL: https://issues.apache.org/jira/browse/HUDI-1771 > Project: Apache Hudi > Issue Type: New Feature > Components: Flink Integration >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 0.9.0 > > > Like what we discussed in the dev mailing list: > https://lists.apache.org/thread.html/r31b2d1404e4e043a5f875b78105ba6f9a801e78f265ad91242ad5eb2%40%3Cdev.hudi.apache.org%3E > Keep the change flags make new use cases possible: using HUDI as the unified > storage format for DWD and DWS layer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] danny0405 opened a new pull request #2854: [HUDI-1771] Propagate CDC format for hoodie
danny0405 opened a new pull request #2854: URL: https://github.com/apache/hudi/pull/2854 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] MyLanPangzi opened a new pull request #2853: [HUDI-1812] Add explicit index state TTL option for Flink writer
MyLanPangzi opened a new pull request #2853: URL: https://github.com/apache/hudi/pull/2853 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Add explicit index state TTL option for Flink writer ## Brief change log - *FlinkOptions add INDEX_STATE_TTL* - org.apache.hudi.sink.partitioner.BucketAssignFunction#indexState enable ttl. ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1812) Add explicit index state TTL option for Flink writer
[ https://issues.apache.org/jira/browse/HUDI-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 谢波 updated HUDI-1812: - Description: Add option: {code:java} public static final ConfigOption INDEX_STATE_TTL = ConfigOptions .key("index.state.ttl") .longType() .defaultValue(24 * 60 * 60 * 1000L) .withDescription("index state ttl in milliseconds. default is 1 day."); {code} If the state expires but there are still updates for old records, the records would be recognized as INSERT instead of UPDATE thus some data duplication. was: Add option: {code:java} public static final ConfigOption INDEX_STATE_TTL = ConfigOptions .key("index.state.ttl") .doubleType() .defaultValue(1.5D)// default 1.5 days .withDescription("Index state TTL in DAYs, default 1.5 days"); {code} If the state expires but there are still updates for old records, the records would be recognized as INSERT instead of UPDATE thus some data duplication. > Add explicit index state TTL option for Flink writer > > > Key: HUDI-1812 > URL: https://issues.apache.org/jira/browse/HUDI-1812 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: Danny Chen >Assignee: 谢波 >Priority: Major > Fix For: 0.9.0 > > > Add option: > {code:java} > public static final ConfigOption INDEX_STATE_TTL = ConfigOptions > .key("index.state.ttl") > .longType() > .defaultValue(24 * 60 * 60 * 1000L) > .withDescription("index state ttl in milliseconds. default is 1 day."); > {code} > If the state expires but there are still updates for old records, the records > would be recognized as INSERT instead of UPDATE thus some data duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
yanghua commented on a change in pull request #2846: URL: https://github.com/apache/hudi/pull/2846#discussion_r616439410 ## File path: hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java ## @@ -431,6 +433,10 @@ public void close() { // iterator for log files private final Iterator iterator; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: Review comment: sounds good, can we add it to the comment of this PR(I mean this file.)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
danny0405 commented on a change in pull request #2846: URL: https://github.com/apache/hudi/pull/2846#discussion_r616437364 ## File path: hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java ## @@ -431,6 +433,10 @@ public void close() { // iterator for log files private final Iterator iterator; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: Review comment: see https://issues.apache.org/jira/browse/FLINK-22370 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2846: [HUDI-1809] Flink merge on read input split uses wrong base file path…
yanghua commented on a change in pull request #2846: URL: https://github.com/apache/hudi/pull/2846#discussion_r616415317 ## File path: hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java ## @@ -431,6 +433,10 @@ public void close() { // iterator for log files private final Iterator iterator; +// add the flag because the flink ParquetColumnarRowSplitReader is buggy: Review comment: Can we file a Jira ticket to the Flink community and paste the jira id here so that we can track the progress of Flink. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org