[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
nsivabalan commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100580381 @kasured : before I dive in, few pointers on the write configs used. 1. I see you have enabled both inline and async compaction. Guess w/ streaming sink to hudi, only async compaction is possible and for MOR table, hudi automatically does async compaction. So, probably you can remove these configs. ``` "hoodie.compact.inline" = "true" "hoodie.datasource.compaction.async.enable" = "true" ``` 2. and I also see you have enabled clustering. can we disable clustering and see if the issue is still reproducible. with these changes, can you let us know if the problem still persists? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] puchangchun commented on issue #4825: [SUPPORT] flink hudi some class not found
puchangchun commented on issue #4825: URL: https://github.com/apache/hudi/issues/4825#issuecomment-1100577825 I'm running fine locally, but I reported this error in the Flink cluster environment, and I'm Jar already include on the HiveConf.class -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5326: [SUPPORT] prometheus metrics labels
nsivabalan commented on issue #5326: URL: https://github.com/apache/hudi/issues/5326#issuecomment-1100577811 @zxding : guess you are asking for adding arbitrary tags to each metrics right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5326: [SUPPORT] prometheus metrics labels
nsivabalan commented on issue #5326: URL: https://github.com/apache/hudi/issues/5326#issuecomment-1100577643 @harsh1231 : Can you chime in here please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java
[ https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3892: -- Description: We might need a hoodie read client in java similar to the one we have for spark. [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and package them into Pulsar format and dispatch to consumer side. However, we found current Hudi implementation doesn't support read Hudi table records by Java API, and we couldn't read the target data out from Hudi into Pulsar Broker, which will block the Pulsar & Hudi integration. h3. What we need # We need Hudi to support reading records by Java API # We need Hudi to support read records out which keep the writer order, or support order by specific fields. was: We might need a hoodie read client in java similar to the one we have for spark. > Add HoodieReadClient with java > -- > > Key: HUDI-3892 > URL: https://issues.apache.org/jira/browse/HUDI-3892 > Project: Apache Hudi > Issue Type: Task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > We might need a hoodie read client in java similar to the one we have for > spark. > > > [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with > Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. > When consumers fetch cold data from topic, Pulsar broker will locate the > target data is stored in Pulsar or not. If the target data stored in tiered > storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and > package them into Pulsar format and dispatch to consumer side. > However, we found current Hudi implementation doesn't support read Hudi table > records by Java API, and we couldn't read the target data out from Hudi into > Pulsar Broker, which will block the Pulsar & Hudi integration. > h3. What we need > # We need Hudi to support reading records by Java API > # We need Hudi to support read records out which keep the writer order, or > support order by specific fields. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on issue #5313: [SUPPORT] Do we have plan to support java reader for Hudi?
nsivabalan commented on issue #5313: URL: https://github.com/apache/hudi/issues/5313#issuecomment-1100577433 @hangc0276 : We can definitely take this up. excited for hudi used as tiered storage :) As @simonsssu showed interest to work on it, I will coordinate w/ him/her and get this going. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5313: [SUPPORT] Do we have plan to support java reader for Hudi?
nsivabalan commented on issue #5313: URL: https://github.com/apache/hudi/issues/5313#issuecomment-1100577200 cool. @simonsssu : I have created a tracking jira [here](https://issues.apache.org/jira/browse/HUDI-3892). Can you let me know your jira id. I can assign it to you. Also, this might be time sensitive, since its blocking pulsar integration. Just wanted to send out a gentle reminder. Once you have the patch, do ping me. I can help review it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java
[ https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3892: -- Priority: Critical (was: Major) > Add HoodieReadClient with java > -- > > Key: HUDI-3892 > URL: https://issues.apache.org/jira/browse/HUDI-3892 > Project: Apache Hudi > Issue Type: Task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > We might need a hoodie read client in java similar to the one we have for > spark. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java
[ https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3892: -- Fix Version/s: 0.12.0 > Add HoodieReadClient with java > -- > > Key: HUDI-3892 > URL: https://issues.apache.org/jira/browse/HUDI-3892 > Project: Apache Hudi > Issue Type: Task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.12.0 > > > We might need a hoodie read client in java similar to the one we have for > spark. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3892) Add HoodieReadClient with java
sivabalan narayanan created HUDI-3892: - Summary: Add HoodieReadClient with java Key: HUDI-3892 URL: https://issues.apache.org/jira/browse/HUDI-3892 Project: Apache Hudi Issue Type: Task Components: reader-core Reporter: sivabalan narayanan We might need a hoodie read client in java similar to the one we have for spark. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on issue #5301: [SUPPORT]Support Show Data Files Command Based on Call Procedure Command for Spark SQL
nsivabalan commented on issue #5301: URL: https://github.com/apache/hudi/issues/5301#issuecomment-1100576398 @XuQianJin-Stars : Can you file a tracking jira and follow up please. and close out the github issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5291: [SUPPORT] How to use hudi-defaults.conf with Glue
nsivabalan commented on issue #5291: URL: https://github.com/apache/hudi/issues/5291#issuecomment-1100576020 @zhedoubushishi : can you chime in here please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5281: [SUPPORT] .hoodie/hoodie.properties file can be deleted due to retention settings of cloud providers
nsivabalan commented on issue #5281: URL: https://github.com/apache/hudi/issues/5281#issuecomment-1100574209 Interesting. whats your lifecycle policy btw? any objects that was never updated in the last X days to be deleted? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5262: [SUPPORT] Deltastreamer Error upserting bucketType UPDATE for partition :0
nsivabalan commented on issue #5262: URL: https://github.com/apache/hudi/issues/5262#issuecomment-1100572710 @stym06 : likely schema has changed. Can you inspect let us know if thats the case. related jira https://issues.apache.org/jira/browse/HUDI-1711 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5258: [SUPPORT] Write hudi data throws NoSuchMethodError with spark v2.4.4 and hudi v0.10.1
nsivabalan commented on issue #5258: URL: https://github.com/apache/hudi/issues/5258#issuecomment-1100571599 can you try w/ scala 11 bundle and let us know if it succeeds. hudi-spark-bundle_2.11-0.10.1.jar and for spark-avro, can you try setting it via `--packages org.apache.spark:spark-avro_2.11:2.4.4` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5249: [SUPPORT] Deltastreamer job does not terminate on Kubernetes when hoodie.metrics.on=true
nsivabalan commented on issue #5249: URL: https://github.com/apache/hudi/issues/5249#issuecomment-1100570805 @harsh1231 : Can you take a stab at this please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5248: [QUESION] Should filter prop "hoodie.datasource.write.operation" when use spark sql create table?
nsivabalan commented on issue #5248: URL: https://github.com/apache/hudi/issues/5248#issuecomment-1100570586 @XuQianJin-Stars : Can you file a tracking jira and follow up on the issue. seems like we need to fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5242: [SUPPORT] Hudi embedded timeline server in 0.9 vs 0.10 with `hoodie.embed.timeline.server.port`
nsivabalan commented on issue #5242: URL: https://github.com/apache/hudi/issues/5242#issuecomment-1100570312 @yihua : timeline server port related issue. Can you chime in here please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5233: [SUPPORT] _hoodie_is_deleted not working for spark Datasource.
nsivabalan commented on issue #5233: URL: https://github.com/apache/hudi/issues/5233#issuecomment-1100569034 did you set default value for "_hoodie_is_deleted" to null or false? can you post the schema for the table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #5231: [SUPPORT] Inconsistent query result using GetLatestBaseFiles compared to Snapshot Query
nsivabalan closed issue #5231: [SUPPORT] Inconsistent query result using GetLatestBaseFiles compared to Snapshot Query URL: https://github.com/apache/hudi/issues/5231 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5231: [SUPPORT] Inconsistent query result using GetLatestBaseFiles compared to Snapshot Query
nsivabalan commented on issue #5231: URL: https://github.com/apache/hudi/issues/5231#issuecomment-1100568785 thanks @alexeykudinkin to find the root cause and fixing it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark
nsivabalan commented on issue #5211: URL: https://github.com/apache/hudi/issues/5211#issuecomment-1100568451 So you want to read multiple hudi tables w/ one spark.read? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5198: [SUPPORT] Querying data genereated by TimestampBasedKeyGenerator failed to parse timestamp in EPOCHMILLISECONDS column to date format
nsivabalan commented on issue #5198: URL: https://github.com/apache/hudi/issues/5198#issuecomment-1100568252 @babumahesh-koo : do you have any updates on this end -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5189: [SUPPORT] Multiple chaining of hudi tables via incremental source results in duplicate partition meta column
nsivabalan commented on issue #5189: URL: https://github.com/apache/hudi/issues/5189#issuecomment-1100568175 @harsh1231 : in the mean time (until @bvaradar responds), can you investigate as to why we are encountering duplicate issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
hudi-bot commented on PR #5337: URL: https://github.com/apache/hudi/pull/5337#issuecomment-1100525225 ## CI report: * 3da31d0812e520a29079c628c7a134bc66f066f1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8085) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
hudi-bot commented on PR #5337: URL: https://github.com/apache/hudi/pull/5337#issuecomment-1100510773 ## CI report: * 3da31d0812e520a29079c628c7a134bc66f066f1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8085) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #5057: [HUDI-3651] optimize the hoodie hive client and ddl executor code wit…
danny0405 commented on PR #5057: URL: https://github.com/apache/hudi/pull/5057#issuecomment-1100510685 @JerryYue-M You may need to rebase the code with latest master, would take a look soon ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #5087: [HUDI-3614] [DO_NOT_MERGE]Replace List with HoodieData in HoodieFlink/JavaTable and commit executors
danny0405 commented on PR #5087: URL: https://github.com/apache/hudi/pull/5087#issuecomment-1100510039 > @danny0405 : can you follow up on the patch when you get a chance. guess author is waiting for review follow up from you. a gentle reminder. I don't see there is any gains for current stage of code, besides the duplicate code reduction, and with this patch, this is regression for performance for unnecessary copy of objects. So i'm not very sure we should work in this direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
hudi-bot commented on PR #5337: URL: https://github.com/apache/hudi/pull/5337#issuecomment-1100509989 ## CI report: * 3da31d0812e520a29079c628c7a134bc66f066f1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3891: - Labels: pull-request-available (was: ) > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] alexeykudinkin opened a new pull request, #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
alexeykudinkin opened a new pull request, #5337: URL: https://github.com/apache/hudi/pull/5337 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Fixing files partitioning sequence for `BaseFileOnlyRelation` to make sure we efficiently bucket small files. This brings Hudi tables on par w/ raw Parquet tables. ## Brief change log - Make sure we reverse sort the files before bucketing ## Verify this pull request This pull request is a trivial rework / code cleanup without any test coverage. This pull request is already covered by existing tests, such as *(please describe tests)*. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #5064: [HUDI-3654] Initialize hudi metastore module.
xiarixiaoyao commented on PR #5064: URL: https://github.com/apache/hudi/pull/5064#issuecomment-1100508616 @minihippo could you pls rebase the code and run azure again, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
Alexey Kudinkin created HUDI-3891: - Summary: Investigate Hudi vs Raw Parquet table discrepancy Key: HUDI-3891 URL: https://issues.apache.org/jira/browse/HUDI-3891 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin While benchmarking querying raw Parquet tables against Hudi tables, i've run the test against the same (Hudi) table: # In one query path i'm reading it as just a raw Parquet table # In another, i'm reading it as Hudi RO (read_optimized) table Surprisingly enough, those 2 diverge in the # of files being read: _Raw Parquet_ !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! _Hudi_ !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #5336: [DOCS] Add commit activity, twitter badgers, and Hudi logo in README
hudi-bot commented on PR #5336: URL: https://github.com/apache/hudi/pull/5336#issuecomment-1100475164 ## CI report: * 2d1fc1b7ff81bff43152335b8135a31467c53674 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8084) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5336: [DOCS] Add commit activity, twitter badgers, and Hudi logo in README
hudi-bot commented on PR #5336: URL: https://github.com/apache/hudi/pull/5336#issuecomment-1100448918 ## CI report: * 2d1fc1b7ff81bff43152335b8135a31467c53674 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8084) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5336: [DOCS] Add commit activity, twitter badgers, and Hudi logo in README
hudi-bot commented on PR #5336: URL: https://github.com/apache/hudi/pull/5336#issuecomment-1100447825 ## CI report: * 2d1fc1b7ff81bff43152335b8135a31467c53674 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #5336: [DOCS] Add commit activity, twitter badgers, and Hudi logo in README
yihua opened a new pull request, #5336: URL: https://github.com/apache/hudi/pull/5336 ## What is the purpose of the pull request This PR adds commit activity, twitter badgers, and Hudi logo in README. The medium-definition Hudi logo image is added to the Hudi site in #5331 . ## Verify this pull request Only README.md updates. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3883) File-sizing issues when writing COW table to S3
[ https://issues.apache.org/jira/browse/HUDI-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3883: -- Fix Version/s: 0.12.0 > File-sizing issues when writing COW table to S3 > --- > > Key: HUDI-3883 > URL: https://issues.apache.org/jira/browse/HUDI-3883 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.0 > > Attachments: Screen Shot 2022-04-14 at 1.08.19 PM.png > > > Even after HUDI-3709, i still see that when writing partitioned-table > file-sizing doesn't seem to be properly respected: in that case i was running > ingestion job with following configs which was supposed to yield me ~100Mb > files > {code:java} > Map( > "hoodie.parquet.small.file.limit" -> String.valueOf(100 * 1024 * 1024), // > 100Mb > "hoodie.parquet.max.file.size"-> String.valueOf(120 * 1024 * 1024) // > 120Mb > ) {code} > > Instead, my table contains a lot of very small (~1Mb) files: > !Screen Shot 2022-04-14 at 1.08.19 PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua merged pull request #5334: [MINOR] - updated external article list on Hudi docs
yihua merged PR #5334: URL: https://github.com/apache/hudi/pull/5334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Updated external article list on Hudi docs (#5334)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new ca6752b8a1 [DOCS] Updated external article list on Hudi docs (#5334) ca6752b8a1 is described below commit ca6752b8a1b51a44916e813ded88c205645fc5e8 Author: Kyle Weller AuthorDate: Fri Apr 15 15:20:32 2022 -0700 [DOCS] Updated external article list on Hudi docs (#5334) --- website/src/pages/talks-articles.md | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/website/src/pages/talks-articles.md b/website/src/pages/talks-articles.md index 13024b6d95..dff15f5bd6 100644 --- a/website/src/pages/talks-articles.md +++ b/website/src/pages/talks-articles.md @@ -94,6 +94,8 @@ Data Summit Connect, May, 2021 39. ["Apache Hudi Meetup at Uber with talks from Philips, Moveworks & Uber (including Hudi OSS roadmap 2022)"](https://youtu.be/8Q0kM-emMyo) - By Felix Kizhakkel Jose (Philips), Bhavani Sudha (Moveworks), Prashant Wason (Uber), March 2022 +40. ["Apache Hudi with Vinoth Chandar"](https://softwareengineeringdaily.com/2022/03/08/apache-hudi-with-vinoth-chandar/) By Software Engineering Daily. Mar 5, 2022 + ## Articles You can check out [our blog pages](https://hudi.apache.org/blog.html) for content written by our committers/contributors. @@ -135,4 +137,18 @@ You can check out [our blog pages](https://hudi.apache.org/blog.html) for conten 34. ["https://www.xenonstack.com/insights/what-is-hudi;](https://www.xenonstack.com/insights/what-is-hudi) by Chandan Gaur. Nov 22, 2021 35. ["https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/;](https://aws.amazon.com/blogs/big-data/new-features-from-apache-hudi-0-7-0-and-0-8-0-available-on-amazon-emr/) by Udit Mehotra and Gagan Brahmi. Dec 20, 2021 36. ["Designing the Analytics patterns using a Lake House approach on AWS"](https://dev.to/aws-builders/designing-the-analytics-patterns-using-a-lake-house-approach-on-aws-2hh6) by Adit Modi. Dec 30, 2021 -37. ["The Art of Building Open Data Lakes with Apache Hudi, Kafka, Hive, and Debezium"](https://garystafford.medium.com/the-art-of-building-open-data-lakes-with-apache-hudi-kafka-hive-and-debezium-3d2f71c5981f) by Gary Stafford. Dec 31, 2021 \ No newline at end of file +37. ["The Art of Building Open Data Lakes with Apache Hudi, Kafka, Hive, and Debezium"](https://garystafford.medium.com/the-art-of-building-open-data-lakes-with-apache-hudi-kafka-hive-and-debezium-3d2f71c5981f) by Gary Stafford. Dec 31, 2021 +38. ["Why and How I Integrated Airbyte and Apache Hudi"](https://selectfrom.dev/why-and-how-i-integrated-airbyte-and-apache-hudi-c18aff3af21a) by Harsha Kanna. Jan 18, 2022 +39. ["Hudi powering data lake efforts at Walmart and Disney+ Hotstar"](https://www.techtarget.com/searchdatamanagement/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar) by Sean Kerner. Jan 20, 2022 +40. ["Cost Efficiency @ Scale in Big Data File Format"](https://eng.uber.com/cost-efficiency-big-data/) by Xinli Shang, Kai Jiang, Zheng Shao, and Mohammad Islam. Jan 25, 2022 +41. ["Onehouse Commitment to Openness"](https://www.onehouse.ai/blog/onehouse-commitment-to-openness) by Vinoth Chandar. Feb 2, 2022 +42. ["Onehouse brings a fully-managed lakehouse to Apache Hudi"](https://venturebeat.com/2022/02/03/onehouse-brings-a-fully-managed-lakehouse-to-apache-hudi/) by Paul Sawers. Feb 3, 2022 +43. ["ACID transformations on Distributed file system"](https://medium.com/walmartglobaltech/acid-transformations-on-distributed-file-system-fdec5301c1b1) by Rajasekhar. Feb 9, 2022 +44. ["Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption"](https://garystafford.medium.com/data-lake-table-formats-interest-and-adoption-rate-40817b87be9e) by Gary Stafford. Feb 12, 2022 +45. ["Fresher Data Lake on AWS S3"](https://robinhood.engineering/author-balaji-varadarajan-e3f496815ebf) by Balaji Varadarajan. Feb 17, 2022 +46. ["Understanding its core concepts from hudi persistence files"](https://programmer.ink/think/understanding-its-core-concepts-from-hudi-persistence-files.html) by QbertsBrother. Feb 20, 2022 +47. ["Create a low-latency source-to-data lake pipeline using Amazon MSK Connect, Apache Flink, and Apache Hudi"](https://aws.amazon.com/blogs/big-data/create-a-low-latency-source-to-data-lake-pipeline-using-amazon-msk-connect-apache-flink-and-apache-hudi/) by Ali Alemi. Mar 1, 2022 +48. ["Build a serverless pipeline to analyze streaming data using AWS Glue, Apache Hudi, and Amazon S3"](https://aws.amazon.com/blogs/big-data/build-a-serverless-pipeline-to-analyze-streaming-data-using-aws-glue-apache-hudi-and-amazon-s3/) by Nikhil Khokhar and Dipta Bhattacharya. Mar 9, 2022 +49. ["Zendesk - Insights for CTOs: Part 3 – Growing
[hudi] branch asf-site updated (d926276036 -> ab49d9bcd8)
This is an automated email from the ASF dual-hosted git repository. github-bot pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git from d926276036 [MINOR] Fix docs build due to std-env (#5335) add ab49d9bcd8 GitHub Actions build asf-site No new revisions were added by this update. Summary of changes: content/404.html | 4 ++-- content/404/index.html| 4 ++-- content/assets/js/{main.3d83d4a7.js => main.21fb549d.js} | 4 ++-- .../js/{main.3d83d4a7.js.LICENSE.txt => main.21fb549d.js.LICENSE.txt} | 0 content/blog/2016/12/30/strata-talk-2017/index.html | 4 ++-- content/blog/2019/01/18/asf-incubation/index.html | 4 ++-- content/blog/2019/03/07/batch-vs-incremental/index.html | 4 ++-- content/blog/2019/05/14/registering-dataset-to-hive/index.html| 4 ++-- content/blog/2019/09/09/ingesting-database-changes/index.html | 4 ++-- content/blog/2020/01/15/delete-support-in-hudi/index.html | 4 ++-- content/blog/2020/01/20/change-capture-using-aws/index.html | 4 ++-- content/blog/2020/03/22/exporting-hudi-datasets/index.html| 4 ++-- content/blog/2020/04/27/apache-hudi-apache-zepplin/index.html | 4 ++-- .../blog/2020/05/28/monitoring-hudi-metrics-with-datadog/index.html | 4 ++-- .../2020/08/18/hudi-incremental-processing-on-data-lakes/index.html | 4 ++-- .../2020/08/20/efficient-migration-of-large-parquet-tables/index.html | 4 ++-- content/blog/2020/08/21/async-compaction-deployment-model/index.html | 4 ++-- content/blog/2020/08/22/ingest-multiple-tables-using-hudi/index.html | 4 ++-- content/blog/2020/10/06/cdc-solution-using-hudi-by-nclouds/index.html | 4 ++-- content/blog/2020/10/15/apache-hudi-meets-apache-flink/index.html | 4 ++-- content/blog/2020/10/19/hudi-meets-aws-emr-and-aws-dms/index.html | 4 ++-- content/blog/2020/11/11/hudi-indexing-mechanisms/index.html | 4 ++-- .../12/01/high-perf-data-lake-with-hudi-and-alluxio-t3go/index.html | 4 ++-- content/blog/2021/01/27/hudi-clustering-intro/index.html | 4 ++-- content/blog/2021/02/13/hudi-key-generators/index.html| 4 ++-- content/blog/2021/03/01/hudi-file-sizing/index.html | 4 ++-- .../06/10/employing-right-configurations-for-hudi-cleaner/index.html | 4 ++-- content/blog/2021/07/21/streaming-data-lake-platform/index.html | 4 ++-- content/blog/2021/08/16/kafka-custom-deserializer/index.html | 4 ++-- content/blog/2021/08/18/improving-marker-mechanism/index.html | 4 ++-- content/blog/2021/08/18/virtual-keys/index.html | 4 ++-- content/blog/2021/08/23/async-clustering/index.html | 4 ++-- content/blog/2021/08/23/s3-events-source/index.html | 4 ++-- .../01/building-eb-level-data-lake-using-hudi-at-bytedance/index.html | 4 ++-- .../16/lakehouse-concurrency-control-are-we-too-optimistic/index.html | 4 ++-- .../12/29/hudi-zorder-and-hilbert-space-filling-curves/index.html | 4 ++-- content/blog/2022/01/06/apache-hudi-2021-a-year-in-review/index.html | 4 ++-- .../14/change-data-capture-with-debezium-and-apache-hudi/index.html | 4 ++-- content/blog/archive/index.html | 4 ++-- content/blog/index.html | 4 ++-- content/blog/page/2/index.html| 4 ++-- content/blog/page/3/index.html| 4 ++-- content/blog/streaming-data-lake-platform/index.html | 4 ++-- content/community/get-involved/index.html | 4 ++-- content/community/syncs/index.html| 4 ++-- content/community/team/index.html | 4 ++-- content/contribute/developer-setup/index.html | 4 ++-- content/contribute/how-to-contribute/index.html | 4 ++-- content/contribute/report-security-issues/index.html | 4 ++-- content/contribute/rfc-process/index.html | 4 ++-- content/docs/0.10.0/azure_hoodie/index.html | 4 ++-- content/docs/0.10.0/bos_hoodie/index.html | 4 ++-- content/docs/0.10.0/cli/index.html| 4 ++-- content/docs/0.10.0/cloud/index.html | 4 ++-- content/docs/0.10.0/clustering/index.html | 4 ++-- content/docs/0.10.0/compaction/index.html | 4 ++-- content/docs/0.10.0/comparison/index.html | 4 ++-- content/docs/0.10.0/concepts/index.html | 4 ++--
[GitHub] [hudi] yihua merged pull request #5335: [MINOR] Fix docs build due to std-env
yihua merged PR #5335: URL: https://github.com/apache/hudi/pull/5335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated (805b893a71 -> d926276036)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git from 805b893a71 GitHub Actions build asf-site add d926276036 [MINOR] Fix docs build due to std-env (#5335) No new revisions were added by this update. Summary of changes: website/package.json | 1 + 1 file changed, 1 insertion(+)
[GitHub] [hudi] yihua commented on pull request #5335: [MINOR] Fix docs build due to std-env
yihua commented on PR #5335: URL: https://github.com/apache/hudi/pull/5335#issuecomment-1100400894 cc @vingov @bhasudha -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #5335: [MINOR] Fix docs build due to std-env
yihua opened a new pull request, #5335: URL: https://github.com/apache/hudi/pull/5335 ## What is the purpose of the pull request This PR fixes the docs build due to the latest std-env 3.1.1 release. ## Brief change log - Uses "std-env" module from 3.0.1 instead in package.json. ## Verify this pull request The website can successfully be built after the fix. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Fix typos in log4j-surefire.properties (#5212)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) b8e465fdfc is described below commit b8e465fdfcac1961fe05ed44993c8c6139e13b31 Author: 董可伦 AuthorDate: Sat Apr 16 04:33:37 2022 +0800 [MINOR] Fix typos in log4j-surefire.properties (#5212) --- .../hudi-client-common/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-flink-client/src/main/resources/log4j-surefire.properties| 4 ++-- .../hudi-flink-client/src/test/resources/log4j-surefire.properties| 4 ++-- .../hudi-java-client/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-spark-client/src/test/resources/log4j-surefire.properties| 4 ++-- hudi-common/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-examples-flink/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-examples-spark/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-flink/src/test/resources/log4j-surefire.properties | 4 ++-- hudi-hadoop-mr/src/test/resources/log4j-surefire.properties | 4 ++-- hudi-integ-test/src/test/resources/log4j-surefire.properties | 4 ++-- hudi-kafka-connect/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-spark/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-spark2/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-spark3/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-datahub-sync/src/test/resources/log4j-surefire.properties| 4 ++-- hudi-sync/hudi-dla-sync/src/test/resources/log4j-surefire.properties | 4 ++-- hudi-sync/hudi-hive-sync/src/test/resources/log4j-surefire.properties | 4 ++-- .../hudi-sync-common/src/test/resources/log4j-surefire.properties | 4 ++-- hudi-timeline-service/src/test/resources/log4j-surefire.properties| 4 ++-- hudi-utilities/src/test/resources/log4j-surefire.properties | 4 ++-- 21 files changed, 42 insertions(+), 42 deletions(-) diff --git a/hudi-client/hudi-client-common/src/test/resources/log4j-surefire.properties b/hudi-client/hudi-client-common/src/test/resources/log4j-surefire.properties index 32af462093..14bbb08972 100644 --- a/hudi-client/hudi-client-common/src/test/resources/log4j-surefire.properties +++ b/hudi-client/hudi-client-common/src/test/resources/log4j-surefire.properties @@ -20,9 +20,9 @@ log4j.logger.org.apache=INFO log4j.logger.org.apache.hudi=DEBUG log4j.logger.org.apache.hadoop.hbase=ERROR -# A1 is set to be a ConsoleAppender. +# CONSOLE is set to be a ConsoleAppender. log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender -# A1 uses PatternLayout. +# CONSOLE uses PatternLayout. log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n log4j.appender.CONSOLE.filter.a=org.apache.log4j.varia.LevelRangeFilter diff --git a/hudi-client/hudi-flink-client/src/main/resources/log4j-surefire.properties b/hudi-client/hudi-flink-client/src/main/resources/log4j-surefire.properties index 32af462093..14bbb08972 100644 --- a/hudi-client/hudi-flink-client/src/main/resources/log4j-surefire.properties +++ b/hudi-client/hudi-flink-client/src/main/resources/log4j-surefire.properties @@ -20,9 +20,9 @@ log4j.logger.org.apache=INFO log4j.logger.org.apache.hudi=DEBUG log4j.logger.org.apache.hadoop.hbase=ERROR -# A1 is set to be a ConsoleAppender. +# CONSOLE is set to be a ConsoleAppender. log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender -# A1 uses PatternLayout. +# CONSOLE uses PatternLayout. log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n log4j.appender.CONSOLE.filter.a=org.apache.log4j.varia.LevelRangeFilter diff --git a/hudi-client/hudi-flink-client/src/test/resources/log4j-surefire.properties b/hudi-client/hudi-flink-client/src/test/resources/log4j-surefire.properties index 32af462093..14bbb08972 100644 --- a/hudi-client/hudi-flink-client/src/test/resources/log4j-surefire.properties +++ b/hudi-client/hudi-flink-client/src/test/resources/log4j-surefire.properties @@ -20,9 +20,9 @@ log4j.logger.org.apache=INFO log4j.logger.org.apache.hudi=DEBUG log4j.logger.org.apache.hadoop.hbase=ERROR -# A1 is set to be a ConsoleAppender. +# CONSOLE is set to be a ConsoleAppender. log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender -# A1 uses PatternLayout. +# CONSOLE uses PatternLayout. log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n log4j.appender.CONSOLE.filter.a=org.apache.log4j.varia.LevelRangeFilter diff --git
[GitHub] [hudi] yihua merged pull request #5212: [MINOR] Fix typos in log4j-surefire.properties
yihua merged PR #5212: URL: https://github.com/apache/hudi/pull/5212 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5064: [HUDI-3654] Initialize hudi metastore module.
nsivabalan commented on PR #5064: URL: https://github.com/apache/hudi/pull/5064#issuecomment-1100372635 @xiarixiaoyao : can you review this when you get a chance. I have assigned it to myself as well. So, will try to review in a weeks time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5057: [HUDI-3651] optimize the hoodie hive client and ddl executor code wit…
nsivabalan commented on PR #5057: URL: https://github.com/apache/hudi/pull/5057#issuecomment-1100370034 @wangxianghu : can you review the patch when you get a chance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5057: [HUDI-3651] optimize the hoodie hive client and ddl executor code wit…
nsivabalan commented on PR #5057: URL: https://github.com/apache/hudi/pull/5057#issuecomment-1100369298 @JerryYue-M : can you rebase w/ latest master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5071: [HUDI-1881]: draft implementation for trigger based on data availability
nsivabalan commented on PR #5071: URL: https://github.com/apache/hudi/pull/5071#issuecomment-1100367520 @pratyakshsharma : once the patch is ready, do ping me here. I can review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5087: [HUDI-3614] [DO_NOT_MERGE]Replace List with HoodieData in HoodieFlink/JavaTable and commit executors
nsivabalan commented on PR #5087: URL: https://github.com/apache/hudi/pull/5087#issuecomment-1100350369 @danny0405 : can you follow up on the patch when you get a chance. guess author is waiting for review follow up from you. a gentle reminder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kywe665 opened a new pull request, #5334: [MINOR] - updated external article list on Hudi docs
kywe665 opened a new pull request, #5334: URL: https://github.com/apache/hudi/pull/5334 ## What is the purpose of the pull request updated the external articles for hudi docs ## Committer checklist - [X] Has a corresponding JIRA in PR title & commit - [X] Commit message is descriptive of the change - [X] CI is green - [X] Necessary doc changes done or have another open PR - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha opened a new pull request, #5333: [DOCS] update broken links
bhasudha opened a new pull request, #5333: URL: https://github.com/apache/hudi/pull/5333 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request update broken links across the website *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5111: [HUDI-3695] Add a ORC reader in HoodieBaseRelation
nsivabalan commented on PR #5111: URL: https://github.com/apache/hudi/pull/5111#issuecomment-1100343028 @alexeykudinkin : can you follow up on the review when you get a chance. @miomiocat : can you rebase w/ latest master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5139: [WIP][HUDI-3579] Add timeline commands in hudi-cli
nsivabalan commented on PR #5139: URL: https://github.com/apache/hudi/pull/5139#issuecomment-1100337466 @yihua : ping me once the patch is ready to be reviewed again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5177: [HUDI-3746][DO_NOT_MERGE] Test CI
nsivabalan commented on PR #5177: URL: https://github.com/apache/hudi/pull/5177#issuecomment-1100334046 can we close this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed pull request #5192: [WIP][DO_NOT_MERGE] Enable inline reading
nsivabalan closed pull request #5192: [WIP][DO_NOT_MERGE] Enable inline reading URL: https://github.com/apache/hudi/pull/5192 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3779) Add docs regarding caveats for disabling and re-enabling MDT
[ https://issues.apache.org/jira/browse/HUDI-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3779: Status: In Progress (was: Open) > Add docs regarding caveats for disabling and re-enabling MDT > > > Key: HUDI-3779 > URL: https://issues.apache.org/jira/browse/HUDI-3779 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > After disabling MDT, the user should make sure that MDT is completely deleted > after a few commits, before re-enabling MDT again. The user should not flip > the flag off and on frequently. Otherwise, there can be correctness issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3779) Add docs regarding caveats for disabling and re-enabling MDT
[ https://issues.apache.org/jira/browse/HUDI-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3779: Status: Patch Available (was: In Progress) > Add docs regarding caveats for disabling and re-enabling MDT > > > Key: HUDI-3779 > URL: https://issues.apache.org/jira/browse/HUDI-3779 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > After disabling MDT, the user should make sure that MDT is completely deleted > after a few commits, before re-enabling MDT again. The user should not flip > the flag off and on frequently. Otherwise, there can be correctness issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3779) Add docs regarding caveats for disabling and re-enabling MDT
[ https://issues.apache.org/jira/browse/HUDI-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3779: - Labels: pull-request-available (was: ) > Add docs regarding caveats for disabling and re-enabling MDT > > > Key: HUDI-3779 > URL: https://issues.apache.org/jira/browse/HUDI-3779 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > After disabling MDT, the user should make sure that MDT is completely deleted > after a few commits, before re-enabling MDT again. The user should not flip > the flag off and on frequently. Otherwise, there can be correctness issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua opened a new pull request, #5332: [HUDI-3779] Update metadata table docs
yihua opened a new pull request, #5332: URL: https://github.com/apache/hudi/pull/5332 ## What is the purpose of the pull request This PR updates metadata table docs with more detailed configurations and deployment considerations based on 0.11.0 release. ## Brief change log - Revised `metadata.md` ## Verify this pull request The website and the page can be built and visualized properly. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5246: [HUDI-3813] [RFC-33] Schema Evolution Support DDL And DML Concurrency.
nsivabalan commented on PR #5246: URL: https://github.com/apache/hudi/pull/5246#issuecomment-1100299905 @xushiyan : for now, I have assigned the PR to you. let me know if you can't take this up. I will find someone or I will take this up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5264: [HUDI-3818] encode bytes column value when generate HoodieKey
nsivabalan commented on PR #5264: URL: https://github.com/apache/hudi/pull/5264#issuecomment-1100297837 generally record key, partition path and precombine should be comparable and so likely primitive types. wondering whats the use-case which demands byte[] to be chosen as a field for record key or partition path. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-3835] Add UT for delete in java client (#5270)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 99dd1cb6e6 [HUDI-3835] Add UT for delete in java client (#5270) 99dd1cb6e6 is described below commit 99dd1cb6e63600681aa11b3a03bc16d1401d8055 Author: 董可伦 AuthorDate: Sat Apr 16 03:03:48 2022 +0800 [HUDI-3835] Add UT for delete in java client (#5270) --- .../commit/TestJavaCopyOnWriteActionExecutor.java | 86 +- 1 file changed, 85 insertions(+), 1 deletion(-) diff --git a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java index 1bf1b4cccb..518414d614 100644 --- a/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java +++ b/hudi-client/hudi-java-client/src/test/java/org/apache/hudi/table/action/commit/TestJavaCopyOnWriteActionExecutor.java @@ -318,7 +318,7 @@ public class TestJavaCopyOnWriteActionExecutor extends HoodieJavaClientTestBase } @Test -public void testInsertRecords() throws Exception { + public void testInsertRecords() throws Exception { HoodieWriteConfig config = makeHoodieClientConfig(); String instantTime = makeNewCommitTime(); metaClient = HoodieTableMetaClient.reload(metaClient); @@ -465,6 +465,90 @@ public class TestJavaCopyOnWriteActionExecutor extends HoodieJavaClientTestBase verifyStatusResult(returnedStatuses, generateExpectedPartitionNumRecords(inputRecords)); } + @Test + public void testDeleteRecords() throws Exception { +// Prepare the AvroParquetIO +HoodieWriteConfig config = makeHoodieClientConfig(); +int startInstant = 1; +String firstCommitTime = makeNewCommitTime(startInstant++, "%09d"); +HoodieJavaWriteClient writeClient = getHoodieWriteClient(config); +writeClient.startCommitWithTime(firstCommitTime); +metaClient = HoodieTableMetaClient.reload(metaClient); +BaseFileUtils fileUtils = BaseFileUtils.getInstance(metaClient); + +String partitionPath = "2022/04/09"; + +// Get some records belong to the same partition (2016/01/31) +String recordStr1 = "{\"_row_key\":\"8eb5b87a-1feh-4edd-87b4-6ec96dc405a0\"," ++ "\"time\":\"2022-04-09T03:16:41.415Z\",\"number\":1}"; +String recordStr2 = "{\"_row_key\":\"8eb5b87b-1feu-4edd-87b4-6ec96dc405a0\"," ++ "\"time\":\"2022-04-09T03:20:41.415Z\",\"number\":2}"; +String recordStr3 = "{\"_row_key\":\"8eb5b87c-1fej-4edd-87b4-6ec96dc405a0\"," ++ "\"time\":\"2022-04-09T03:16:41.415Z\",\"number\":3}"; + +List records = new ArrayList<>(); +RawTripTestPayload rowChange1 = new RawTripTestPayload(recordStr1); +records.add(new HoodieAvroRecord(new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath()), rowChange1)); +RawTripTestPayload rowChange2 = new RawTripTestPayload(recordStr2); +records.add(new HoodieAvroRecord(new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath()), rowChange2)); +RawTripTestPayload rowChange3 = new RawTripTestPayload(recordStr3); +records.add(new HoodieAvroRecord(new HoodieKey(rowChange3.getRowKey(), rowChange3.getPartitionPath()), rowChange3)); + +// Insert new records +writeClient.insert(records, firstCommitTime); + +FileStatus[] allFiles = getIncrementalFiles(partitionPath, "0", -1); +assertEquals(1, allFiles.length); + +// Read out the bloom filter and make sure filter can answer record exist or not +Path filePath = allFiles[0].getPath(); +BloomFilter filter = fileUtils.readBloomFilterFromMetadata(hadoopConf, filePath); +for (HoodieRecord record : records) { + assertTrue(filter.mightContain(record.getRecordKey())); +} + +// Read the base file, check the record content +List fileRecords = fileUtils.readAvroRecords(hadoopConf, filePath); +int index = 0; +for (GenericRecord record : fileRecords) { + assertEquals(records.get(index).getRecordKey(), record.get("_row_key").toString()); + index++; +} + +String newCommitTime = makeNewCommitTime(startInstant++, "%09d"); +writeClient.startCommitWithTime(newCommitTime); + +// Test delete two records +List keysForDelete = new ArrayList(Arrays.asList(records.get(0).getKey(), records.get(2).getKey())); +writeClient.delete(keysForDelete, newCommitTime); + +allFiles = getIncrementalFiles(partitionPath, "0", -1); +assertEquals(1, allFiles.length); + +filePath = allFiles[0].getPath(); +// Read the base file, check the record content +fileRecords = fileUtils.readAvroRecords(hadoopConf, filePath); +// Check that the two records are deleted successfully +
[GitHub] [hudi] nsivabalan merged pull request #5270: [HUDI-3835] Add UT for delete in java client
nsivabalan merged PR #5270: URL: https://github.com/apache/hudi/pull/5270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5292: [WIP] Upgrade to Hadoop 3.x Hive 3.x
nsivabalan commented on PR #5292: URL: https://github.com/apache/hudi/pull/5292#issuecomment-1100295279 please prefix w/ right jira. I understand, its still WIP. but a gentle reminder. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5319: [WIP] Adjusting `DeltaStreamer` shutdown sequence to avoid awaiting for 24h
nsivabalan commented on PR #5319: URL: https://github.com/apache/hudi/pull/5319#issuecomment-1100291962 please create a jira and tag -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #5329: [HUDI-3886] Adding default null for some of the fields in col stats in MDT schema
alexeykudinkin commented on PR #5329: URL: https://github.com/apache/hudi/pull/5329#issuecomment-1100291410 @nsivabalan done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (57612c5c32 -> e8ab915aff)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 57612c5c32 [HUDI-3848] Fixing restore with cleaned up commits (#5288) add e8ab915aff [MINOR] Removing invalid code to close parquet reader iterator (#5182) No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/hudi/HoodieBaseRelation.scala | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-)
[GitHub] [hudi] nsivabalan commented on pull request #5329: [HUDI-3886] Adding default null for some of the fields in col stats in MDT schema
nsivabalan commented on PR #5329: URL: https://github.com/apache/hudi/pull/5329#issuecomment-1100289976 @alexeykudinkin : can you stamp this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #5182: [MINOR] Fixing parquet reader iterator close
nsivabalan merged PR #5182: URL: https://github.com/apache/hudi/pull/5182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (9e8664f4d2 -> 57612c5c32)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 9e8664f4d2 [HOTFIX] add missing license (#5322) (#5324) add 57612c5c32 [HUDI-3848] Fixing restore with cleaned up commits (#5288) No new revisions were added by this update. Summary of changes: .../rollback/ListingBasedRollbackStrategy.java | 10 ++- .../TestHoodieSparkMergeOnReadTableRollback.java | 88 ++ 2 files changed, 97 insertions(+), 1 deletion(-)
[GitHub] [hudi] nsivabalan merged pull request #5288: [HUDI-3848] Fixing restore with cleaned up commits
nsivabalan merged PR #5288: URL: https://github.com/apache/hudi/pull/5288 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3749) Run latest hudi w/ EMR spark and report to aws folks
[ https://issues.apache.org/jira/browse/HUDI-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522920#comment-17522920 ] sivabalan narayanan commented on HUDI-3749: --- Handing it off to [~uditme] to take it from here. [~xushiyan] : I will let Udit drive this since aws folks needs to upstream the changes they have internally to OSS anyways. > Run latest hudi w/ EMR spark and report to aws folks > > > Key: HUDI-3749 > URL: https://issues.apache.org/jira/browse/HUDI-3749 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3749) Try out 0.11 hudi w/ EMR spark
[ https://issues.apache.org/jira/browse/HUDI-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3749: -- Summary: Try out 0.11 hudi w/ EMR spark (was: Run latest hudi w/ EMR spark and report to aws folks) > Try out 0.11 hudi w/ EMR spark > --- > > Key: HUDI-3749 > URL: https://issues.apache.org/jira/browse/HUDI-3749 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3749) Run latest hudi w/ EMR spark and report to aws folks
[ https://issues.apache.org/jira/browse/HUDI-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-3749: - Assignee: Udit Mehrotra (was: sivabalan narayanan) > Run latest hudi w/ EMR spark and report to aws folks > > > Key: HUDI-3749 > URL: https://issues.apache.org/jira/browse/HUDI-3749 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: sivabalan narayanan >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3749) Run latest hudi w/ EMR spark and report to aws folks
[ https://issues.apache.org/jira/browse/HUDI-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522919#comment-17522919 ] sivabalan narayanan commented on HUDI-3749: --- regular hive sync worked out of the box. {code:java} df.write.format("hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "tpep_dropoff_datetime"). option(RECORDKEY_FIELD_OPT_KEY, "tpep_pickup_datetime"). option(PARTITIONPATH_FIELD_OPT_KEY, "date_col"). option(TABLE_NAME, "hudi_tbl1"). option("hoodie.embed.timeline.server","false"). option("hoodie.datasource.hive_sync.enable","true"). option("hoodie.datasource.hive_sync.database","default"). option("hoodie.datasource.hive_sync.table","test_tbl3"). option("hoodie.datasource.hive_sync.mode","hms"). option("hoodie.datasource.hive_sync.partition_fields","_hoodie_partition_path"). mode(Overwrite). save(basePath) {code} via beeline: {code:java} select * from test_tbl3 limit 5;{code} {code:java} ++-+---++-+-+--++--+---+---+-+-+-++--++---+-+--+-+-+-+---+ | test_tbl3._hoodie_commit_time | test_tbl3._hoodie_commit_seqno | test_tbl3._hoodie_record_key | test_tbl3._hoodie_file_name | test_tbl3.vendorid | test_tbl3.tpep_pickup_datetime | test_tbl3.tpep_dropoff_datetime | test_tbl3.passenger_count | test_tbl3.trip_distance | test_tbl3.ratecodeid | test_tbl3.store_and_fwd_flag | test_tbl3.pulocationid | test_tbl3.dolocationid | test_tbl3.payment_type | test_tbl3.fare_amount | test_tbl3.extra | test_tbl3.mta_tax | test_tbl3.tip_amount | test_tbl3.tolls_amount | test_tbl3.improvement_surcharge | test_tbl3.total_amount | test_tbl3.congestion_surcharge | test_tbl3.date_col | test_tbl3._hoodie_partition_path | ++-+---++-+-+--++--+---+---+-+-+-++--++---+-+--+-+-+-+---+ | 20220415180627021 | 20220415180627021_7_1085992 | 2008-12-31 23:02:59 | e78169d4-03a8-40e0-ad11-9ae43a52b565-0_7-155-6608_20220415180627021.parquet | 2 | 2008-12-31 23:02:59 | 2009-01-01 18:22:41 | 1 | 0.99 | 1 | N | 249 | 90 | 2 | 7.0 | 1.0 | 0.5 | 0.0 | 0.0 | 0.3 | 11.3 | 2.5 | 2008-12-31 | 2008-12-31 | | 20220415180627021 | 20220415180627021_7_1085996 | 2008-12-31 23:07:03 | e78169d4-03a8-40e0-ad11-9ae43a52b565-0_7-155-6608_20220415180627021.parquet | 2 | 2008-12-31 23:07:03 | 2008-12-31 23:19:26 | 1 | 1.39 | 1 | N | 107 | 162 | 2 | 8.5 | 0.0 | 0.5 | 0.0 | 0.0 | 0.3 | 11.8 | 2.5 | 2008-12-31 | 2008-12-31 | | 20220415180627021 | 20220415180627021_7_1085998 | 2008-12-31 23:43:51 | e78169d4-03a8-40e0-ad11-9ae43a52b565-0_7-155-6608_20220415180627021.parquet | 2 | 2008-12-31 23:43:51 | 2009-01-01 10:32:34 | 1 | 0.79 | 1 | N |
[jira] [Updated] (HUDI-3890) Fix apache rat check to detect all missing license
[ https://issues.apache.org/jira/browse/HUDI-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3890: - Priority: Critical (was: Major) > Fix apache rat check to detect all missing license > -- > > Key: HUDI-3890 > URL: https://issues.apache.org/jira/browse/HUDI-3890 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Priority: Critical > > these 2 files which didn't have license were not reported > ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.source_schema_tab.sql > ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.target_schema_tab.sql -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3890) Fix apache rat check to detect all missing license
[ https://issues.apache.org/jira/browse/HUDI-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3890: - Fix Version/s: 0.12.0 > Fix apache rat check to detect all missing license > -- > > Key: HUDI-3890 > URL: https://issues.apache.org/jira/browse/HUDI-3890 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Priority: Critical > Fix For: 0.12.0 > > > these 2 files which didn't have license were not reported > ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.source_schema_tab.sql > ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.target_schema_tab.sql -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3890) Fix apache rat check to detect all missing license
Raymond Xu created HUDI-3890: Summary: Fix apache rat check to detect all missing license Key: HUDI-3890 URL: https://issues.apache.org/jira/browse/HUDI-3890 Project: Apache Hudi Issue Type: Task Reporter: Raymond Xu these 2 files which didn't have license were not reported ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.source_schema_tab.sql ./hudi-utilities/src/test/resources/delta-streamer-config/schema_registry.target_schema_tab.sql -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua opened a new pull request, #5331: [MINOR] Add a medium-definition Hudi logo
yihua opened a new pull request, #5331: URL: https://github.com/apache/hudi/pull/5331 ## What is the purpose of the pull request As above. ## Brief change log - Adds `website/static/assets/images/hudi-logo-medium.png`. ## Verify this pull request The website is built locally and the new image can be accessed by `http://localhost:3000/assets/images/hudi-logo-medium.png`. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3889) Do not validate table config if save mode is set to Overwrite
sivabalan narayanan created HUDI-3889: - Summary: Do not validate table config if save mode is set to Overwrite Key: HUDI-3889 URL: https://issues.apache.org/jira/browse/HUDI-3889 Project: Apache Hudi Issue Type: Task Components: spark Reporter: sivabalan narayanan with spark datasource write, if Overwrite is set as save mode, we should not do table config validation {code:java} scala> df.write.format("hudi"). | option(PRECOMBINE_FIELD_OPT_KEY, "tpep_dropoff_datetime"). | option(RECORDKEY_FIELD_OPT_KEY, "tpep_pickup_datetime"). | option(PARTITIONPATH_FIELD_OPT_KEY, "date_col"). | option(TABLE_NAME, "hudi_tbl1"). | option("hoodie.embed.timeline.server","false"). | mode(Overwrite). | save(basePath) warning: one deprecation; for details, enable `:setting -deprecation' or `:replay -deprecation' org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value): RecordKey: tpep_pickup_datetimeid PreCombineKey: tpep_dropoff_datetime created_at at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:161) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:87) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:161) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3889) Do not validate table config if save mode is set to Overwrite
[ https://issues.apache.org/jira/browse/HUDI-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3889: -- Priority: Critical (was: Major) > Do not validate table config if save mode is set to Overwrite > - > > Key: HUDI-3889 > URL: https://issues.apache.org/jira/browse/HUDI-3889 > Project: Apache Hudi > Issue Type: Task > Components: spark >Reporter: sivabalan narayanan >Priority: Critical > > with spark datasource write, if Overwrite is set as save mode, we should not > do table config validation > > {code:java} > scala> df.write.format("hudi"). > | option(PRECOMBINE_FIELD_OPT_KEY, "tpep_dropoff_datetime"). > | option(RECORDKEY_FIELD_OPT_KEY, "tpep_pickup_datetime"). > | option(PARTITIONPATH_FIELD_OPT_KEY, "date_col"). > | option(TABLE_NAME, "hudi_tbl1"). > | option("hoodie.embed.timeline.server","false"). > | mode(Overwrite). > | save(basePath) > warning: one deprecation; for details, enable `:setting -deprecation' or > `:replay -deprecation' > org.apache.hudi.exception.HoodieException: Config conflict(keycurrent > value existing value): > RecordKey:tpep_pickup_datetimeid > PreCombineKey:tpep_dropoff_datetime created_at > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:161) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:87) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:161) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3889) Do not validate table config if save mode is set to Overwrite
[ https://issues.apache.org/jira/browse/HUDI-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3889: -- Fix Version/s: 0.12.0 > Do not validate table config if save mode is set to Overwrite > - > > Key: HUDI-3889 > URL: https://issues.apache.org/jira/browse/HUDI-3889 > Project: Apache Hudi > Issue Type: Task > Components: spark >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > with spark datasource write, if Overwrite is set as save mode, we should not > do table config validation > > {code:java} > scala> df.write.format("hudi"). > | option(PRECOMBINE_FIELD_OPT_KEY, "tpep_dropoff_datetime"). > | option(RECORDKEY_FIELD_OPT_KEY, "tpep_pickup_datetime"). > | option(PARTITIONPATH_FIELD_OPT_KEY, "date_col"). > | option(TABLE_NAME, "hudi_tbl1"). > | option("hoodie.embed.timeline.server","false"). > | mode(Overwrite). > | save(basePath) > warning: one deprecation; for details, enable `:setting -deprecation' or > `:replay -deprecation' > org.apache.hudi.exception.HoodieException: Config conflict(keycurrent > value existing value): > RecordKey:tpep_pickup_datetimeid > PreCombineKey:tpep_dropoff_datetime created_at > at > org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:161) > at > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:87) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:161) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path
hudi-bot commented on PR #5328: URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100234074 ## CI report: * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path
nsivabalan commented on PR #5328: URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100196310 high level comment. I would prefer to introduce a new sort mode instead of fixing NONE. and add documentation around when to use which sort mode so that users are aware of diff sort modes and their implications -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path
hudi-bot commented on PR #5328: URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100192710 ## CI report: * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) * 6812e0065e1411107d7d53ad2997d02e7ce34d06 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8079) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5328: [WIP] Fix Bulk Insert to repartition the dataset based on Partition Path
hudi-bot commented on PR #5328: URL: https://github.com/apache/hudi/pull/5328#issuecomment-1100190821 ## CI report: * 96b33942edf6a1d6d89361d2e056ed1c3a8d326b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8077) * 6812e0065e1411107d7d53ad2997d02e7ce34d06 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3826) Make truncate partition use delete_partition operation
[ https://issues.apache.org/jira/browse/HUDI-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3826: - Reviewers: Alexey Kudinkin, Raymond Xu, sivabalan narayanan (was: Alexey Kudinkin, sivabalan narayanan) > Make truncate partition use delete_partition operation > -- > > Key: HUDI-3826 > URL: https://issues.apache.org/jira/browse/HUDI-3826 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Forward Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Currently, `TruncateHoodieTableCommand` as well as > `AlterHoodieTableDropPartitionCommand` deletes partitions from Hudi table by > simply removing corresponding partition folders w/o committing any changes > (and correspondingly updating the MT for ex) > Instead it should go t/h WriteClient's `deletePartitions` API, similar to > Spark DS does when gets Hudi's DELETE command > You can see that when enable Column Stats Index by default and running our CI > (Setting "hoodie.metadata.index.column.stats.enable" > and "hoodie.metadata.enable" to true) > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=7926=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0=746585d8-b50a-55c3-26c5-517d93af9934 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3888) Triage drop partition col with CI
Raymond Xu created HUDI-3888: Summary: Triage drop partition col with CI Key: HUDI-3888 URL: https://issues.apache.org/jira/browse/HUDI-3888 Project: Apache Hudi Issue Type: Task Components: tests-ci Reporter: Raymond Xu Assignee: Ethan Guo Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3888) Triage drop partition col with CI
[ https://issues.apache.org/jira/browse/HUDI-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3888: - Sprint: Hudi-Sprint-Apr-12 > Triage drop partition col with CI > - > > Key: HUDI-3888 > URL: https://issues.apache.org/jira/browse/HUDI-3888 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: Raymond Xu >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3888) Triage drop partition col with CI
[ https://issues.apache.org/jira/browse/HUDI-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3888: - Status: In Progress (was: Open) > Triage drop partition col with CI > - > > Key: HUDI-3888 > URL: https://issues.apache.org/jira/browse/HUDI-3888 > Project: Apache Hudi > Issue Type: Task > Components: tests-ci >Reporter: Raymond Xu >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3707) Fix deltastreamer test with schema provider and transformer enabled
[ https://issues.apache.org/jira/browse/HUDI-3707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-3707: - Assignee: sivabalan narayanan (was: Sagar Sumit) > Fix deltastreamer test with schema provider and transformer enabled > --- > > Key: HUDI-3707 > URL: https://issues.apache.org/jira/browse/HUDI-3707 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Raymond Xu >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.11.0, 0.12.0 > > > Fix cases like this > @Disabled("To investigate problem with schema provider and transformer") > in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3707) Fix deltastreamer test with schema provider and transformer enabled
[ https://issues.apache.org/jira/browse/HUDI-3707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3707: -- Status: In Progress (was: Open) > Fix deltastreamer test with schema provider and transformer enabled > --- > > Key: HUDI-3707 > URL: https://issues.apache.org/jira/browse/HUDI-3707 > Project: Apache Hudi > Issue Type: Test > Components: tests-ci >Reporter: Raymond Xu >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.11.0, 0.12.0 > > > Fix cases like this > @Disabled("To investigate problem with schema provider and transformer") > in org.apache.hudi.utilities.functional.TestHoodieDeltaStreamer -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3867) Disable Data Skipping by default in 0.11
[ https://issues.apache.org/jira/browse/HUDI-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3867. Resolution: Fixed > Disable Data Skipping by default in 0.11 > > > Key: HUDI-3867 > URL: https://issues.apache.org/jira/browse/HUDI-3867 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Since it nor relies on MT's Column Stats Index which is off by default in 0.11 > > We should re-enable it right after the release. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] Guanpx commented on issue #5330: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
Guanpx commented on issue #5330: URL: https://github.com/apache/hudi/issues/5330#issuecomment-1100019832 cc @danny0405 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
kasured commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-104068 Upon further investigation and after enabling additional logs on EMR, the deletion of the file during compaction is happening in the class org.apache.hudi.table.HoodieTable#reconcileAgainstMarkers ``` if (!invalidDataPaths.isEmpty()) { LOG.info("Removing duplicate data files created due to spark retries before committing. Paths=" + invalidDataPaths);` ``` However, later in the logs this file is written and commited in the instant ``` INFO SparkRDDWriteClient: Committing Compaction 20220414232316. Finished with result HoodieCommitMetadata{partitionToWriteStats={cluster=96/shard=14377=[HoodieWriteStat{fileId='9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0', path='cluster=96/shard=14377/9d9f72e9-9381-40d0-af0c-cb48c25bd78d-0_0-617-7132_20220414232316.parquet', prevCommit='20220414225217', numWrites=122886, numDeletes=0, numUpdateWrites=121939, totalWriteBytes=23331178, totalWriteErrors=0, tempPath='null', partitionPath='cluster=96/shard=14377', totalLogRecords=341027, totalLogFilesCompacted=3, totalLogSizeCompacted=285373803, totalUpdatedRecordsCompacted=121939, totalLogBlocks=9, totalCorruptLogBlock=0, totalRollbackBlocks=0}]}, compacted=true, ``` So it leaves the system in an inconsistent state. It looks like some concurrency issues to me I will try to submit multiple StreamingQuery in different threads by leveraging spark scheduling pool. Will update about the status -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Guanpx opened a new issue, #5330: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
Guanpx opened a new issue, #5330: URL: https://github.com/apache/hudi/issues/5330 **Describe the problem you faced** use flink1.13 ,bucket index , cow ,hudi-0.11.0(not latest) **To Reproduce** Steps to reproduce the behavior: 1. start flink job 2. cancel flink job 3. repeat 1-2 some times 4. start job,then that Exception was occured **Environment Description** * Hudi version : 0.11.0 * Flink version : 1.13.2 * Hadoop version : 3.0.0 * Storage (HDFS/S3/GCS..) :HDFS * Running on Docker? (yes/no) : no **Additional context** ![image](https://user-images.githubusercontent.com/29246713/163552259-4e5f0215-e696-4b2a-a11c-4b555a2aa220.png) **Stacktrace** ``` java.lang.RuntimeException: Duplicate fileID 0007----40bee2bd5a70 from bucket 7 of partition found during the BucketStreamWriteFunction index bootstrap. at org.apache.hudi.sink.bucket.BucketStreamWriteFunction.lambda$bootstrapIndexIfNeed$1(BucketStreamWriteFunction.java:179) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at org.apache.hudi.sink.bucket.BucketStreamWriteFunction.bootstrapIndexIfNeed(BucketStreamWriteFunction.java:173) at org.apache.hudi.sink.bucket.BucketStreamWriteFunction.processElement(BucketStreamWriteFunction.java:123) at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:205) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681) at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636) at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.lang.Thread.run(Thread.java:748) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] XuQianJin-Stars commented on issue #5327: [SUPPORT]Mor table hive synchronization supports more flexible configuration
XuQianJin-Stars commented on issue #5327: URL: https://github.com/apache/hudi/issues/5327#issuecomment-1099967428 > > Here we need to add some configuration of synchronization rules. > > Is there some solution design for synchronization rules now? In addition to the two points mentioned above, are there other optimizations? Because the above two points have been optimized in our practice, I don't know if we can contribute. Well, can contribute. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org