[GitHub] [hudi] hudi-bot commented on pull request #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API

2022-11-17 Thread GitBox


hudi-bot commented on PR #7241:
URL: https://github.com/apache/hudi/pull/7241#issuecomment-1319652082

   
   ## CI report:
   
   * 3045f14ac99e049be4b40d14906b8aef0f3ed34d UNKNOWN
   * e9344436bb6ece2731b3a97ce13d3764686609ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7220: [HUDI-5230] Lazy init secondaryView in PriorityBasedFileSystemView

2022-11-17 Thread GitBox


hudi-bot commented on PR #7220:
URL: https://github.com/apache/hudi/pull/7220#issuecomment-1319651958

   
   ## CI report:
   
   * b5bb91f69dffcbb35b2ae69925e8aa2354832925 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13053)
 
   * 163886f2adf52086b859d0a6fb7c4cfe34d8aec2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API

2022-11-17 Thread GitBox


hudi-bot commented on PR #7241:
URL: https://github.com/apache/hudi/pull/7241#issuecomment-1319648240

   
   ## CI report:
   
   * 3045f14ac99e049be4b40d14906b8aef0f3ed34d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7240: [HUDI-5239] support HoodieJavaWriteClient compact

2022-11-17 Thread GitBox


hudi-bot commented on PR #7240:
URL: https://github.com/apache/hudi/pull/7240#issuecomment-1319648171

   
   ## CI report:
   
   * 94ecac2f8d8bd18d080a7d6b03fa498f812705f7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13104)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


hudi-bot commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319647857

   
   ## CI report:
   
   * 1c66c4283d9daf64548806289c4ccb0467976d21 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13097)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a diff in pull request #7212: [HUDI-5179] Optimized release guide document

2022-11-17 Thread GitBox


nsivabalan commented on code in PR #7212:
URL: https://github.com/apache/hudi/pull/7212#discussion_r1026100732


##
release/release_guide.md:
##
@@ -0,0 +1,678 @@
+
+
+# Introduction
+
+This release process document is based on [Apache Beam Release 
Guide](https://beam.apache.org/contribute/release-guide/)
+and [Apache Flink Release 
Guide](https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release).
+
+The Apache Hudi project periodically declares and publishes releases. A 
release is one or more packages of the project
+artifact(s) that are approved for general public distribution and use. They 
may come with various degrees of caveat
+regarding their perceived quality and potential for change, such as “alpha”, 
“beta”, “stable”, etc.
+
+Hudi community treats releases with great importance. They are a public face 
of the project and most users interact with
+the project only through the releases. Releases are signed off by the entire 
Hudi community in a public vote.
+
+Each release is executed by a Release Manager, who is selected among the Hudi 
PMC members. This document describes the
+process that the Release Manager follows to perform a release. Any changes to 
this process should be discussed and
+adopted on the dev@ mailing list.
+
+Please remember that publishing software has legal consequences. This guide 
complements the
+foundation-wide [Product Release 
Policy](http://www.apache.org/dev/release.html)
+and [Release Distribution 
Policy](http://www.apache.org/dev/release-distribution).
+
+# Overview
+
+![](release_guide_overview.jpg)
+
+The release process consists of several steps:
+
+1. Decide to release
+2. Prepare for the release
+3. Build a release candidate
+4. Vote on the release candidate
+5. During vote process, run validation tests
+6. If necessary, fix any issues and go back to step 3.
+7. Finalize the release
+8. Promote the release
+
+# Decide to release
+
+Deciding to release and selecting a Release Manager is the first step of the 
release process. This is a consensus-based
+decision of the entire community.
+
+Anybody can propose a release on the dev@ mailing list, giving a solid 
argument and nominating a committer as the
+Release Manager (including themselves). There’s no formal process, no vote 
requirements, and no timing requirements. Any
+objections should be resolved by consensus before starting the release.
+
+In general, the community prefers to have a rotating set of 3-5 Release 
Managers. Keeping a small core set of managers
+allows enough people to build expertise in this area and improve processes 
over time, without Release Managers needing
+to re-learn the processes for each release. That said, if you are a committer 
interested in serving the community in
+this way, please reach out to the community on the dev@ mailing list.
+
+## Checklist to proceed to the next step
+
+1. Community agrees to release
+2. Community selects a Release Manager
+
+# Prepare for the release
+
+As a release manager, you should create a private Slack channel, named 
`hudi-_release_work` (e.g.
+hudi-0_12_0_release_work) in Apache Hudi Slack for coordination. Invite all 
committers to the channel.
+
+Before your first release, you should perform one-time configuration steps. 
This will set up your security keys for
+signing the release and access to various release repositories.
+
+To prepare for each release, you should audit the project status in the JIRA 
issue tracker, and do the necessary
+bookkeeping. Finally, you should create a release branch from which individual 
release candidates will be built.
+
+**NOTE**: If you are
+using [GitHub two-factor 
authentication](https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/)
+and haven’t configure HTTPS access, please
+follow [the 
guide](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)
 to configure
+command line access.
+
+## One-time Setup Instructions
+
+You need to have a GPG key to sign the release artifacts. Please be aware of 
the
+ASF-wide [release signing 
guidelines](https://www.apache.org/dev/release-signing.html). If you don’t have 
a GPG key
+associated with your Apache account, please follow the section below.
+
+### For Linux users
+
+There are 2 ways to configure your GPG key for release, either using release 
automation script(which is recommended), or
+running all commands manually. If using Mac, please see below to handle known 
issues.
+
+ Use preparation_before_release.sh to setup GPG
+
+- Script: preparation_before_release.sh
+- Usage ./hudi/scripts/release/preparation_before_release.sh
+- Tasks included
+1. Help you create a new GPG key if you want.
+2. Configure git user.signingkey with chosen pubkey.
+3. Add chosen pubkey into dev KEYS and release KEYS **NOTES**: Only PMC 
can write into release repo.
+4. Start GPG agents.
+
+ Run all commands manually
+
+- Get more entropy for 

[jira] [Updated] (HUDI-5241) Optimize HoodieDefaultTimeline API

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5241:
-
Labels: pull-request-available  (was: )

> Optimize HoodieDefaultTimeline API
> --
>
> Key: HUDI-5241
> URL: https://issues.apache.org/jira/browse/HUDI-5241
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] YannByron opened a new pull request, #7241: [HUDI-5241] Optimize HoodieDefaultTimeline API

2022-11-17 Thread GitBox


YannByron opened a new pull request, #7241:
URL: https://github.com/apache/hudi/pull/7241

   ### Change Logs
   
   - rename the origin `getInstants` to `getInstantsAsStream`.
   - add a new `getInstants` that return a list.
   - make sure that only use `getInstants` interface when using `this.instants`.
   
   
   ### Impact
   
   LOW
   
   ### Risk level (write none, low medium or high below)
   
   LOW
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5239) support hoddiejavawriteclient compact

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5239:
-
Labels: pull-request-available  (was: )

> support hoddiejavawriteclient compact
> -
>
> Key: HUDI-5239
> URL: https://issues.apache.org/jira/browse/HUDI-5239
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zhaoyangming
>Priority: Major
>  Labels: pull-request-available
>
> support hoddiejavawriteclient compact



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7240: [HUDI-5239] support HoodieJavaWriteClient compact

2022-11-17 Thread GitBox


hudi-bot commented on PR #7240:
URL: https://github.com/apache/hudi/pull/7240#issuecomment-1319643797

   
   ## CI report:
   
   * 94ecac2f8d8bd18d080a7d6b03fa498f812705f7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7239: [HUDI-4442] add in field sanitization and use of aliases

2022-11-17 Thread GitBox


hudi-bot commented on PR #7239:
URL: https://github.com/apache/hudi/pull/7239#issuecomment-1319643757

   
   ## CI report:
   
   * 43b69a7a0fa1a6ca57f651d61bfca3113ffcf47d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13103)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5241) Optimize HoodieDefaultTimeline API

2022-11-17 Thread Yann Byron (Jira)
Yann Byron created HUDI-5241:


 Summary: Optimize HoodieDefaultTimeline API
 Key: HUDI-5241
 URL: https://issues.apache.org/jira/browse/HUDI-5241
 Project: Apache Hudi
  Issue Type: Improvement
  Components: core
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4442) Converting from json to avro does not sanitize field names

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4442:
-
Labels: pull-request-available  (was: )

> Converting from json to avro does not sanitize field names
> --
>
> Key: HUDI-4442
> URL: https://issues.apache.org/jira/browse/HUDI-4442
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> There are cases where a source of json data will have `$` and other illegal 
> characters in the field name. If the user provides a valid schema with those 
> chars sanitized in the field name, the MercifulJsonConverter should be able 
> to translate the json into those sanitized field names. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7239: [HUDI-4442] add in field sanitization and use of aliases

2022-11-17 Thread GitBox


hudi-bot commented on PR #7239:
URL: https://github.com/apache/hudi/pull/7239#issuecomment-1319639798

   
   ## CI report:
   
   * 43b69a7a0fa1a6ca57f651d61bfca3113ffcf47d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-11-17 Thread GitBox


hudi-bot commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319638954

   
   ## CI report:
   
   * ce84f60bac968a89090d4091845c4dd15ea70ee4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] Zouxxyy commented on pull request #7140: [HUDI-5163] Fixing failure handling with spark datasource write

2022-11-17 Thread GitBox


Zouxxyy commented on PR #7140:
URL: https://github.com/apache/hudi/pull/7140#issuecomment-1319635657

   @nsivabalan I made a [fix](https://github.com/nsivabalan/hudi/pull/12) based 
on the comments, hope that helps, but sorry couldn't find a concrete test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j updated HUDI-5240:
---
 Attachment: image-2022-11-18-14-57-06-393.png
Description: !image-2022-11-18-14-57-06-393.png!

> Clean content when recursive Invocation inflate
> ---
>
> Key: HUDI-5240
> URL: https://issues.apache.org/jira/browse/HUDI-5240
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
> Attachments: image-2022-11-18-14-57-06-393.png
>
>
> !image-2022-11-18-14-57-06-393.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

loukey_j reassigned HUDI-5240:
--

Assignee: loukey_j

> Clean content when recursive Invocation inflate
> ---
>
> Key: HUDI-5240
> URL: https://issues.apache.org/jira/browse/HUDI-5240
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: loukey_j
>Assignee: loukey_j
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5240) Clean content when recursive Invocation inflate

2022-11-17 Thread loukey_j (Jira)
loukey_j created HUDI-5240:
--

 Summary: Clean content when recursive Invocation inflate
 Key: HUDI-5240
 URL: https://issues.apache.org/jira/browse/HUDI-5240
 Project: Apache Hudi
  Issue Type: Bug
Reporter: loukey_j






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] zhangyue19921010 commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls

2022-11-17 Thread GitBox


zhangyue19921010 commented on PR #7238:
URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319611318

   Ack Will finish my review this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-5239) support hoddiejavawriteclient compact

2022-11-17 Thread zhaoyangming (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635684#comment-17635684
 ] 

zhaoyangming commented on HUDI-5239:


https://github.com/apache/hudi/pull/7240

> support hoddiejavawriteclient compact
> -
>
> Key: HUDI-5239
> URL: https://issues.apache.org/jira/browse/HUDI-5239
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zhaoyangming
>Priority: Major
>
> support hoddiejavawriteclient compact



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


leesf commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1026069202


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data
+
+Next, we explain these use cases in detail.
+
+### Use Case 1: Pruning files based on functions defined on a column
+
+Let's consider a non-partitioned table containing the events with a 
`timestamp` column. The events with naturally
+increasing time are ingested into the table with bulk inserts every hour. In 
this case, assume that each file should
+contain rows for a particular hour:
+
+| File Name   | Min of `timestamp` | Max of `timestamp` | Note 
  |
+|-||||
+| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 
12-1 AM |
+| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 
1-2 AM  |
+| ... | ...| ...| ...  
  |
+| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 
12-1 PM |
+| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 
1-2 PM  |
+| ... | ...| ...| ...  
  |
+| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 
12-1 PM |
+| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 
1-2 PM  |
+
+For a query to get the number of events between 12PM and 2PM each day in a 
month for time-of-day analysis, the
+predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' 
and '2022-10-31'`
+and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a 
good layout as above, we only need to scan
+two files (instead of 24 files) for each day of data, e.g., 
`base_file_13.parquet` and `base_file_14.parquet` containing
+the data for 2022-10-01 12-2 PM.
+
+Currently, such a fine-grained file pruning based on a function on a column 
cannot be achieved in Hudi, because
+transforming the `timestamp` to the hour of day is not order-preserving, thus 
the file pruning cannot directly leverage
+the file-level column stats of the original column of `timestamp`. In this 
case, Hudi has to scan all the files for a
+day and push the predicate down when reading parquet files, increasing the 
amount of data to be scanned.
+
+### Use Case 2: Efficient file pruning for files without physical partitioning
+
+Let's consider the same non-partitioned table as in the Use Case 1, containing 
the events with a 

[GitHub] [hudi] shengchiqu commented on issue #7229: [SUPPORT] flink connector sink Update the partition value, the old data is still there

2022-11-17 Thread GitBox


shengchiqu commented on issue #7229:
URL: https://github.com/apache/hudi/issues/7229#issuecomment-1319602176

   i try set changelog.enabled=false, the problem solved
   Is this because the changelog mode does not support global indexes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ymZhao1001 opened a new pull request, #7240: support hoddiejavawriteclient compact

2022-11-17 Thread GitBox


ymZhao1001 opened a new pull request, #7240:
URL: https://github.com/apache/hudi/pull/7240

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5239) support hoddiejavawriteclient compact

2022-11-17 Thread zhaoyangming (Jira)
zhaoyangming created HUDI-5239:
--

 Summary: support hoddiejavawriteclient compact
 Key: HUDI-5239
 URL: https://issues.apache.org/jira/browse/HUDI-5239
 Project: Apache Hudi
  Issue Type: Bug
Reporter: zhaoyangming


support hoddiejavawriteclient compact



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5239) support hoddiejavawriteclient compact

2022-11-17 Thread zhaoyangming (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyangming updated HUDI-5239:
---
Issue Type: Improvement  (was: Bug)

> support hoddiejavawriteclient compact
> -
>
> Key: HUDI-5239
> URL: https://issues.apache.org/jira/browse/HUDI-5239
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zhaoyangming
>Priority: Major
>
> support hoddiejavawriteclient compact



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] the-other-tim-brown opened a new pull request, #7239: add in field sanitization and use of aliases

2022-11-17 Thread GitBox


the-other-tim-brown opened a new pull request, #7239:
URL: https://github.com/apache/hudi/pull/7239

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the 
risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


leesf commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1026062508


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data
+
+Next, we explain these use cases in detail.
+
+### Use Case 1: Pruning files based on functions defined on a column
+
+Let's consider a non-partitioned table containing the events with a 
`timestamp` column. The events with naturally
+increasing time are ingested into the table with bulk inserts every hour. In 
this case, assume that each file should
+contain rows for a particular hour:
+
+| File Name   | Min of `timestamp` | Max of `timestamp` | Note 
  |
+|-||||
+| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 
12-1 AM |
+| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 
1-2 AM  |
+| ... | ...| ...| ...  
  |
+| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 
12-1 PM |
+| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 
1-2 PM  |
+| ... | ...| ...| ...  
  |
+| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 
12-1 PM |
+| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 
1-2 PM  |
+
+For a query to get the number of events between 12PM and 2PM each day in a 
month for time-of-day analysis, the
+predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' 
and '2022-10-31'`
+and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a 
good layout as above, we only need to scan
+two files (instead of 24 files) for each day of data, e.g., 
`base_file_13.parquet` and `base_file_14.parquet` containing
+the data for 2022-10-01 12-2 PM.
+
+Currently, such a fine-grained file pruning based on a function on a column 
cannot be achieved in Hudi, because
+transforming the `timestamp` to the hour of day is not order-preserving, thus 
the file pruning cannot directly leverage

Review Comment:
   so here we will use spark defined transformers first?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7021: [Minor] fix multi deser avro payload

2022-11-17 Thread GitBox


hudi-bot commented on PR #7021:
URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319595538

   
   ## CI report:
   
   * 06cbb491c812065b5078d4fcc02415af561928e2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13034)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13043)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13050)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13056)
 
   * f634430fecf9464d734dc6b5abfec8461ec59866 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13102)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


leesf commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1026061223


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data

Review Comment:
   partition evolution here means change partition column or sth else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


leesf commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1026061223


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data

Review Comment:
   partition evolution here means change partition column?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] leesf commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


leesf commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1026060609


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column

Review Comment:
   what functions do we going to support? years/months/days/hours defined in 
spark?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7174: [HUDI-5190] Consuming records from Iterator directly instead of using inner message queue

2022-11-17 Thread GitBox


alexeykudinkin commented on code in PR #7174:
URL: https://github.com/apache/hudi/pull/7174#discussion_r1026059829


##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/SimpleHoodieExecutor.java:
##
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util.queue;
+
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.concurrent.CompletableFuture;
+import java.util.function.Function;
+
+/**
+ * Single Writer and Single Reader mode. Also this SimpleHoodieExecutor has no 
inner message queue and no inner lock.
+ * Consuming and writing records from iterator directly.
+ *
+ * Compared with queue based Executor
+ * Advantages: there is no need for additional memory and cpu resources due to 
lock or multithreading.
+ * Disadvantages: lost some benefits such as speed limit. And maybe lower 
throughput.
+ */
+public class SimpleHoodieExecutor extends HoodieExecutorBase 
{

Review Comment:
   Let's actually simplify this even further and just inherit from 
`HoodieExecutor`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.

2022-11-17 Thread GitBox


hudi-bot commented on PR #7236:
URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319592122

   
   ## CI report:
   
   * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098)
 
   * 29bb43cc6562348d81a80210d65e49a81e03a2e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13101)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7230: core flow tests working, but issues still to tackle and documentation…

2022-11-17 Thread GitBox


hudi-bot commented on PR #7230:
URL: https://github.com/apache/hudi/pull/7230#issuecomment-1319592054

   
   ## CI report:
   
   * b52fb7392f7257ab5ef1d6dd35f6dbdfffc0a4f1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7021: [Minor] fix multi deser avro payload

2022-11-17 Thread GitBox


hudi-bot commented on PR #7021:
URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319591609

   
   ## CI report:
   
   * 06cbb491c812065b5078d4fcc02415af561928e2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13034)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13043)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13050)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13056)
 
   * f634430fecf9464d734dc6b5abfec8461ec59866 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6782: [HUDI-4911][HUDI-3301] Fixing `HoodieMetadataLogRecordReader` to avoid flushing cache for every lookup

2022-11-17 Thread GitBox


alexeykudinkin commented on code in PR #6782:
URL: https://github.com/apache/hudi/pull/6782#discussion_r1026051823


##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -18,37 +18,32 @@
 
 package org.apache.hudi.common.table.log;
 
-import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.avro.Schema;

Review Comment:
   It's done automatically by the IDEA whenever it cleans up dead imports.
   Let me see if i can adjust it to respect checkstyle instead.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, 
String basePath, List close() {
 throw new HoodieUpsertException("Failed to close UpdateHandle", e);
   }
 }
+
 newRecordKeysSorted.clear();

Review Comment:
   It's final unfortunately, and there's not a lot of value in setting it null 
(since handles goes out of scope anyway)



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -734,15 +696,22 @@ private void 
processQueuedBlocksForInstant(Deque logBlocks, int
 progress = (numLogFilesSeen - 1) / logFilePaths.size();

Review Comment:
   We wouldn't get to this method if it would be empty



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -106,30 +109,85 @@ protected HoodieMergedLogRecordScanner(FileSystem fs, 
String basePath, List keys) {
+if (forceFullScan) {
+  return; // no-op
+}
+
+List missingKeys = keys.stream()
+.filter(key -> !records.containsKey(key))
+.collect(Collectors.toList());
+
+if (missingKeys.isEmpty()) {
+  // All the required records are already fetched, no-op
+  return;
+}
+
+scanInternal(Option.of(KeySpec.fullKeySpec(missingKeys)), false);
+  }
+
+  /**
+   * Provides incremental scanning capability where only keys matching 
provided key-prefixes
+   * will be looked up in the delta-log files, scanned and subsequently 
materialized into
+   * the internal cache
+   *
+   * @param keyPrefixes to be looked up
+   */
+  public void scanByKeyPrefixes(List keyPrefixes) {
+// TODO add caching for queried prefixes

Review Comment:
   I think i'll actually address it in this PR



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -330,6 +390,16 @@ public Builder withUseScanV2(boolean useScanV2) {
   return this;
 }
 
+public Builder withKeyFiledOverride(String keyFieldOverride) {

Review Comment:
   How did you make such suggestion change? It's pretty cool



##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##
@@ -240,38 +239,35 @@ public List>>> getRecord
 return result;
   }
 
-  private Map>> 
readLogRecords(HoodieMetadataMergedLogRecordReader logRecordScanner,
+  private Map>> 
readLogRecords(HoodieMetadataLogRecordReader logRecordReader,

   List keys,

   boolean fullKey,

   List timings) {
 HoodieTimer timer = HoodieTimer.start();
 
-if (logRecordScanner == null) {
+if (logRecordReader == null) {
   timings.add(timer.endTimer());
   return Collections.emptyMap();
 }
 
-String partitionName = logRecordScanner.getPartitionName().get();
+Map>> logRecords = new 
HashMap<>(keys.size());
 
-Map>> logRecords = new 
HashMap<>();
-if (isFullScanAllowedForPartition(partitionName)) {
-  checkArgument(fullKey, "If full-scan is required, only full keys could 
be used!");
-  // Path which does full scan of log files
-  for (String key : keys) {
-logRecords.put(key, 
logRecordScanner.getRecordByKey(key).get(0).getValue());
-  }
-} else {
-  // This path will do seeks pertaining to the keys passed in
-  List>>> 
logRecordsList =
-  fullKey ? logRecordScanner.getRecordsByKeys(keys)
-  : logRecordScanner.getRecordsByKeyPrefixes(keys)
-  .stream()
-  .map(record -> Pair.of(record.getRecordKey(), 
Option.of(record)))
-  .collect(Collectors.toList());
-
-  for (Pair>> entry : 
logRecordsList) {
-logRecords.put(entry.getKey(), entry.getValue());
-  }
+// First, fetch the keys being looked up
+List>>> 
logRecordsList =

Review Comment:
   So this PR makes sure that we're not flushing the records cache w/in the 
Scanner whenever we do `getRecord*` (previously batch APIs, were always 
flushing it). As such, there's now essentially no difference b/w these 2 
branches.




[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.

2022-11-17 Thread GitBox


hudi-bot commented on PR #7236:
URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319587735

   
   ## CI report:
   
   * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098)
 
   * 29bb43cc6562348d81a80210d65e49a81e03a2e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider

2022-11-17 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown updated HUDI-4904:

Status: In Progress  (was: Open)

> Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
> ---
>
> Key: HUDI-4904
> URL: https://issues.apache.org/jira/browse/HUDI-4904
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> In proto we can have a schema that is recursive. We should limit the 
> "unraveling" of a schema to N levels and let the user specify that amount of 
> levels as a config. After hitting depth N in the recursion, we will create a 
> Record with a byte array and string. The remaining data for that branch of 
> the recursion will be written out as a proto byte array and we record the 
> descriptor string for context of what is in the byte array.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider

2022-11-17 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown resolved HUDI-4904.
-

> Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
> ---
>
> Key: HUDI-4904
> URL: https://issues.apache.org/jira/browse/HUDI-4904
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> In proto we can have a schema that is recursive. We should limit the 
> "unraveling" of a schema to N levels and let the user specify that amount of 
> levels as a config. After hitting depth N in the recursion, we will create a 
> Record with a byte array and string. The remaining data for that branch of 
> the recursion will be written out as a proto byte array and we record the 
> descriptor string for context of what is in the byte array.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-4905) Protobuf type handling improvements

2022-11-17 Thread Timothy Brown (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Brown resolved HUDI-4905.
-

> Protobuf type handling improvements
> ---
>
> Key: HUDI-4905
> URL: https://issues.apache.org/jira/browse/HUDI-4905
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Timothy Brown
>Assignee: Timothy Brown
>Priority: Major
>  Labels: pull-request-available
>
> Two improvements have come out of discussions with others trying to use 
> protobuf and Hudi.
>  
>  # We can support uint64 as a decimal without losing precision and 
> representing the value in the lake as a positive value
>  # Proto Timestamps can be converted to long with LogicalType timestamp-micros
>  # Treat elements within a `oneof` as nullable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Sprint: 2022/11/15

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin commented on issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken

2022-11-17 Thread GitBox


alexeykudinkin commented on issue #7234:
URL: https://github.com/apache/hudi/issues/7234#issuecomment-1319578145

   Created HUDI-5238


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin closed issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken

2022-11-17 Thread GitBox


alexeykudinkin closed issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or 
above fails w/ Pipe broken
URL: https://github.com/apache/hudi/issues/7234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-5238:
-

Assignee: Alexey Kudinkin

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Story Points: 4

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Description: 
Originally reported at [https://github.com/apache/hudi/issues/7234]

---

 

Root-cause:
Basically, the reason it’s failing is following: # GCS uses 
PipeInputStream/PipeOutputStream comprising reading/writing ends of the “pipe” 
it’s using for unidirectional comm b/w Threads
 # PipeInputStream (for whatever reason) remembers the thread that actually 
wrote into the pipe
 # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
bulk-insert)
 # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s failing

 

Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Originally reported at [https://github.com/apache/hudi/issues/7234]
> ---
>  
> Root-cause:
> Basically, the reason it’s failing is following: # GCS uses 
> PipeInputStream/PipeOutputStream comprising reading/writing ends of the 
> “pipe” it’s using for unidirectional comm b/w Threads
>  # PipeInputStream (for whatever reason) remembers the thread that actually 
> wrote into the pipe
>  # In BoundedInMemoryQueue we’re bootstrapping new executors (read, threads) 
> for reading and _writing_ (it’s only used in HoodieMergeHandle, and in 
> bulk-insert)
>  # When we’re done writing in HoodieMergeHelper, we’re shutting down *first* 
> BIMQ, then the HoodieMergeHandle, and that’s exactly the reason why it’s 
> failing
>  
> Issue has been introduced at [https://github.com/apache/hudi/pull/4264/files]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-5238) Hudi throws PipeBroken

2022-11-17 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-5238:
-

 Summary: Hudi throws PipeBroken 
 Key: HUDI-5238
 URL: https://issues.apache.org/jira/browse/HUDI-5238
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Fix Version/s: 0.12.1

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Priority: Blocker  (was: Major)

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Summary: Hudi throwing "PipeBroken" exception during Merging on GCS  (was: 
Hudi throws PipeBroken )

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5238) Hudi throwing "PipeBroken" exception during Merging on GCS

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5238:
--
Affects Version/s: 0.12.1

> Hudi throwing "PipeBroken" exception during Merging on GCS
> --
>
> Key: HUDI-5238
> URL: https://issues.apache.org/jira/browse/HUDI-5238
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.1
>Reporter: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7021: [Minor] fix multi deser avro payload

2022-11-17 Thread GitBox


alexeykudinkin commented on code in PR #7021:
URL: https://github.com/apache/hudi/pull/7021#discussion_r1025888585


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -364,7 +364,8 @@ private void processAppendResult(AppendResult result, 
List recordL
   updateWriteStatus(stat, result);
 }
 
-if (config.isMetadataColumnStatsIndexEnabled()) {
+// TODO MetadataColumnStatsIndex for spark record
+if (config.isMetadataColumnStatsIndexEnabled() && 
recordMerger.getRecordType() == HoodieRecordType.AVRO) {

Review Comment:
   Let's create a ticket for this. We need to fix this before 0.13



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -215,18 +216,16 @@ private Option 
prepareRecord(HoodieRecord hoodieRecord) {
   // If the format can not record the operation field, nullify the DELETE 
payload manually.
   boolean nullifyPayload = 
HoodieOperation.isDelete(hoodieRecord.getOperation()) && 
!config.allowOperationMetadataField();
   
recordProperties.put(HoodiePayloadProps.PAYLOAD_IS_UPDATE_RECORD_FOR_MOR, 
String.valueOf(isUpdateRecord));
-  Option finalRecord = Option.empty();
-  if (!nullifyPayload && !hoodieRecord.isDelete(tableSchema, 
recordProperties)) {
-if (hoodieRecord.shouldIgnore(tableSchema, recordProperties)) {
-  return Option.of(hoodieRecord);
+  Option finalRecord = nullifyPayload ? Option.empty() : 
Option.of(hoodieRecord.deserialization(tableSchema, recordProperties));
+  // Check for delete
+  if (finalRecord.isPresent() && !finalRecord.get().isDelete(tableSchema, 
recordProperties)) {
+// Check for ignore ExpressionPayload
+if (finalRecord.get().shouldIgnore(tableSchema, recordProperties)) {
+  return finalRecord;

Review Comment:
   This is actually incorrect -- this will delete the record



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #7021: [Minor] fix multi deser avro payload

2022-11-17 Thread GitBox


alexeykudinkin commented on PR #7021:
URL: https://github.com/apache/hudi/pull/7021#issuecomment-1319572883

   @wzx140 i pushed some changes to handle deleted/ignored records w/o the need 
to deserialize the payload.
   
   With these changes we don't actually need a separate materialization step of 
`deserialization`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] codope commented on a diff in pull request #6782: [HUDI-4911][HUDI-3301] Fixing `HoodieMetadataLogRecordReader` to avoid flushing cache for every lookup

2022-11-17 Thread GitBox


codope commented on code in PR #6782:
URL: https://github.com/apache/hudi/pull/6782#discussion_r1025952971


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java:
##
@@ -127,8 +127,8 @@ public List close() {
 throw new HoodieUpsertException("Failed to close UpdateHandle", e);
   }
 }
+
 newRecordKeysSorted.clear();

Review Comment:
   Should we set it to null as the collection is already empty at this point?



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -124,7 +123,7 @@ public abstract class AbstractHoodieLogRecordReader {
   // Total log files read - for metrics
   private AtomicLong totalLogFiles = new AtomicLong(0);
   // Internal schema, used to support full schema evolution.
-  private InternalSchema internalSchema;
+  private final InternalSchema internalSchema;
   // Hoodie table path.
   private final String path;

Review Comment:
   Looks like `path` is not used anywhere.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -18,37 +18,32 @@
 
 package org.apache.hudi.common.table.log;
 
-import org.apache.hudi.common.model.DeleteRecord;
+import org.apache.avro.Schema;

Review Comment:
   Let's try to remove unrelated changes like reordering of imports. Our 
checkstyle puts hudi imports above others.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, 
String basePath, List keys) {
+if (forceFullScan) {
+  return; // no-op
+}
+
+List missingKeys = keys.stream()
+.filter(key -> !records.containsKey(key))
+.collect(Collectors.toList());
+
+if (missingKeys.isEmpty()) {
+  // All the required records are already fetched, no-op
+  return;
+}
+
+scanInternal(Option.of(KeySpec.fullKeySpec(missingKeys)), false);
+  }
+
+  /**
+   * Provides incremental scanning capability where only keys matching 
provided key-prefixes
+   * will be looked up in the delta-log files, scanned and subsequently 
materialized into
+   * the internal cache
+   *
+   * @param keyPrefixes to be looked up
+   */
+  public void scanByKeyPrefixes(List keyPrefixes) {
+// TODO add caching for queried prefixes

Review Comment:
   Let's add a JIRA to track this.



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -734,15 +696,22 @@ private void 
processQueuedBlocksForInstant(Deque logBlocks, int
 progress = (numLogFilesSeen - 1) / logFilePaths.size();

Review Comment:
   Should we guard against the empty `logFilePaths` list?



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java:
##
@@ -188,40 +179,41 @@ protected AbstractHoodieLogRecordReader(FileSystem fs, 
String basePath, List>>> getRecord
 return result;
   }
 
-  private Map>> 
readLogRecords(HoodieMetadataMergedLogRecordReader logRecordScanner,
+  private Map>> 
readLogRecords(HoodieMetadataLogRecordReader logRecordReader,

   List keys,

   boolean fullKey,

   List timings) {
 HoodieTimer timer = HoodieTimer.start();
 
-if (logRecordScanner == null) {
+if (logRecordReader == null) {
   timings.add(timer.endTimer());
   return Collections.emptyMap();
 }
 
-String partitionName = logRecordScanner.getPartitionName().get();
+Map>> logRecords = new 
HashMap<>(keys.size());
 
-Map>> logRecords = new 
HashMap<>();
-if (isFullScanAllowedForPartition(partitionName)) {
-  checkArgument(fullKey, "If full-scan is required, only full keys could 
be used!");
-  // Path which does full scan of log files
-  for (String key : keys) {
-logRecords.put(key, 
logRecordScanner.getRecordByKey(key).get(0).getValue());
-  }
-} else {
-  // This path will do seeks pertaining to the keys passed in
-  List>>> 
logRecordsList =
-  fullKey ? logRecordScanner.getRecordsByKeys(keys)
-  : logRecordScanner.getRecordsByKeyPrefixes(keys)
-  .stream()
-  .map(record -> Pair.of(record.getRecordKey(), 
Option.of(record)))
-  .collect(Collectors.toList());
-
-  for (Pair>> entry : 
logRecordsList) {
-logRecords.put(entry.getKey(), entry.getValue());
-  }
+// First, fetch the keys being looked up
+List>>> 
logRecordsList =

Review Comment:
   Shouldn't we still check for `if 
(isFullScanAllowedForPartition(partitionName))`? What if full scan is not 
enabled 

[GitHub] [hudi] hudi-bot commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls

2022-11-17 Thread GitBox


hudi-bot commented on PR #7238:
URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319547843

   
   ## CI report:
   
   * 6241a829bc4fbcbff3b3cbcf2a8efddcdb667344 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13100)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] cravib4u commented on issue #7234: [SUPPORT] Upsert to Hudi table w/ 0.11.1 or above fails w/ Pipe broken

2022-11-17 Thread GitBox


cravib4u commented on issue #7234:
URL: https://github.com/apache/hudi/issues/7234#issuecomment-1319546216

   We tried even Hudi 0.12.1, Spark -2.4.8 as well and we see same issue while 
writing data.
   We tried both SQL Merge and DataSource write upsert operation as same issue.
   But with Datasource insert_overwrite operation is working fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls

2022-11-17 Thread GitBox


hudi-bot commented on PR #7238:
URL: https://github.com/apache/hudi/pull/7238#issuecomment-1319545281

   
   ## CI report:
   
   * 6241a829bc4fbcbff3b3cbcf2a8efddcdb667344 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.

2022-11-17 Thread GitBox


hudi-bot commented on PR #7236:
URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319545214

   
   ## CI report:
   
   * 80ffbed9a906d526cdf712942cb2cd52309e1f17 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13098)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema

2022-11-17 Thread GitBox


hudi-bot commented on PR #7237:
URL: https://github.com/apache/hudi/pull/7237#issuecomment-1319545247

   
   ## CI report:
   
   * 5c08745b59494bfafa9a8591576290aaed317059 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13099)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


hudi-bot commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319545082

   
   ## CI report:
   
   * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094)
 
   * 1c66c4283d9daf64548806289c4ccb0467976d21 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13097)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-11-17 Thread GitBox


hudi-bot commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319544395

   
   ## CI report:
   
   * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095)
 
   * ce84f60bac968a89090d4091845c4dd15ea70ee4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


xiarixiaoyao commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1025986523


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data
+
+Next, we explain these use cases in detail.
+
+### Use Case 1: Pruning files based on functions defined on a column
+
+Let's consider a non-partitioned table containing the events with a 
`timestamp` column. The events with naturally
+increasing time are ingested into the table with bulk inserts every hour. In 
this case, assume that each file should
+contain rows for a particular hour:
+
+| File Name   | Min of `timestamp` | Max of `timestamp` | Note 
  |
+|-||||
+| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 
12-1 AM |
+| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 
1-2 AM  |
+| ... | ...| ...| ...  
  |
+| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 
12-1 PM |
+| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 
1-2 PM  |
+| ... | ...| ...| ...  
  |
+| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 
12-1 PM |
+| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 
1-2 PM  |
+
+For a query to get the number of events between 12PM and 2PM each day in a 
month for time-of-day analysis, the
+predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' 
and '2022-10-31'`
+and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a 
good layout as above, we only need to scan
+two files (instead of 24 files) for each day of data, e.g., 
`base_file_13.parquet` and `base_file_14.parquet` containing
+the data for 2022-10-01 12-2 PM.
+
+Currently, such a fine-grained file pruning based on a function on a column 
cannot be achieved in Hudi, because
+transforming the `timestamp` to the hour of day is not order-preserving, thus 
the file pruning cannot directly leverage
+the file-level column stats of the original column of `timestamp`. In this 
case, Hudi has to scan all the files for a
+day and push the predicate down when reading parquet files, increasing the 
amount of data to be scanned.
+
+### Use Case 2: Efficient file pruning for files without physical partitioning
+
+Let's consider the same non-partitioned table as in the Use Case 1, containing 
the events with a 

[GitHub] [hudi] hudi-bot commented on pull request #7236: [MINOR] Fix the npe caused by alter table add column.

2022-11-17 Thread GitBox


hudi-bot commented on PR #7236:
URL: https://github.com/apache/hudi/pull/7236#issuecomment-1319538703

   
   ## CI report:
   
   * 80ffbed9a906d526cdf712942cb2cd52309e1f17 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema

2022-11-17 Thread GitBox


hudi-bot commented on PR #7237:
URL: https://github.com/apache/hudi/pull/7237#issuecomment-1319538717

   
   ## CI report:
   
   * 5c08745b59494bfafa9a8591576290aaed317059 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


hudi-bot commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319538553

   
   ## CI report:
   
   * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094)
 
   * 1c66c4283d9daf64548806289c4ccb0467976d21 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-11-17 Thread GitBox


hudi-bot commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319538301

   
   ## CI report:
   
   * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073)
 
   * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095)
 
   * ce84f60bac968a89090d4091845c4dd15ea70ee4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin opened a new pull request, #7238: [HUDI-3963] Cleaning up `QueueBasedExecutor` impls

2022-11-17 Thread GitBox


alexeykudinkin opened a new pull request, #7238:
URL: https://github.com/apache/hudi/pull/7238

   ### Change Logs
   
   This is a follow-up PR after https://github.com/apache/hudi/pull/5416, 
further cleaning up some of the historically inherited artifacts.
   
   ### Impact
   
   No impact
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


xiarixiaoyao commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1025981630


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data
+
+Next, we explain these use cases in detail.
+
+### Use Case 1: Pruning files based on functions defined on a column
+
+Let's consider a non-partitioned table containing the events with a 
`timestamp` column. The events with naturally
+increasing time are ingested into the table with bulk inserts every hour. In 
this case, assume that each file should
+contain rows for a particular hour:
+
+| File Name   | Min of `timestamp` | Max of `timestamp` | Note 
  |
+|-||||
+| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 
12-1 AM |
+| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 
1-2 AM  |
+| ... | ...| ...| ...  
  |
+| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 
12-1 PM |
+| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 
1-2 PM  |
+| ... | ...| ...| ...  
  |
+| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 
12-1 PM |
+| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 
1-2 PM  |
+
+For a query to get the number of events between 12PM and 2PM each day in a 
month for time-of-day analysis, the
+predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' 
and '2022-10-31'`
+and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a 
good layout as above, we only need to scan
+two files (instead of 24 files) for each day of data, e.g., 
`base_file_13.parquet` and `base_file_14.parquet` containing
+the data for 2022-10-01 12-2 PM.
+
+Currently, such a fine-grained file pruning based on a function on a column 
cannot be achieved in Hudi, because

Review Comment:
   Looking forward to fine-grained pruning 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #7235: [HUDI-5148][RFC-63] RFC for Index Function

2022-11-17 Thread GitBox


xiarixiaoyao commented on code in PR #7235:
URL: https://github.com/apache/hudi/pull/7235#discussion_r1025980898


##
rfc/rfc-63/rfc-63.md:
##
@@ -0,0 +1,370 @@
+
+
+# RFC-63: Index Function for Optimizing Query Performance
+
+## Proposers
+
+- @yihua
+- @alexeykudinkin
+
+## Approvers
+
+- @vinothchandar
+- @xushiyan
+- @nsivabalan
+
+## Status
+
+JIRA: [HUDI-512](https://issues.apache.org/jira/browse/HUDI-512)
+
+## Abstract
+
+In this RFC, we address the problem of accelerating queries containing 
predicates based on functions defined on a
+column, by introducing **Index Function**, a new indexing capability for 
efficient file pruning.
+
+## Background
+
+To make the queries finish faster, one major optimization technique is to scan 
less data by pruning rows that are not
+needed by the query. This is usually done in two ways:
+
+- **Partition pruning**:  The partition pruning relies on a table with 
physical partitioning, such as Hive partitioning.
+  A partitioned table uses a chosen column such as the date of `timestamp` and 
stores the rows with the same date to the
+  files under the same folder or physical partition, such as 
`date=2022-10-01/`. When the predicate in a query
+  references the partition column of the physical partitioning, the files in 
the partitions not matching the predicate
+  are filtered out, without scanning. For example, for the predicate `date 
between '2022-10-01' and '2022-10-02'`, the
+  partition pruning only returns the files from two partitions, `2022-10-01` 
and `2022-10-02`, for further processing.
+  The granularity of the pruning is at the partition level.
+
+
+- **File pruning**:  The file pruning carries out the pruning of the data at 
the file level, with the help of file-level
+  or record-level index. For example, with column stats index containing 
minimum and maximum values of a column for each
+  file, the files falling out of the range of the values compared to the 
predicate can be pruned. For a predicate
+  with `age < 20`, the file pruning filters out a file with columns stats of 
`[30, 40]` as the minimum and maximum
+  values of the column `age`.
+
+While Apache Hudi already supports partition pruning and file pruning with 
data skipping for different query engines, we
+recognize that the following use cases need better query performance and 
usability:
+
+- File pruning based on functions defined on a column
+- Efficient file pruning for files without physical partitioning
+- Effective file pruning after partition evolution, without rewriting data
+
+Next, we explain these use cases in detail.
+
+### Use Case 1: Pruning files based on functions defined on a column
+
+Let's consider a non-partitioned table containing the events with a 
`timestamp` column. The events with naturally
+increasing time are ingested into the table with bulk inserts every hour. In 
this case, assume that each file should
+contain rows for a particular hour:
+
+| File Name   | Min of `timestamp` | Max of `timestamp` | Note 
  |
+|-||||
+| base_file_1.parquet | 1664582400 | 1664586000 | 2022-10-01 
12-1 AM |
+| base_file_2.parquet | 1664586000 | 1664589600 | 2022-10-01 
1-2 AM  |
+| ... | ...| ...| ...  
  |
+| base_file_13.parquet | 1664625600 | 1664629200 | 2022-10-01 
12-1 PM |
+| base_file_14.parquet | 1664629200 | 1664632800 | 2022-10-01 
1-2 PM  |
+| ... | ...| ...| ...  
  |
+| base_file_37.parquet | 1664712000 | 1664715600 | 2022-10-02 
12-1 PM |
+| base_file_38.parquet | 1664715600 | 1664719200 | 2022-10-02 
1-2 PM  |
+
+For a query to get the number of events between 12PM and 2PM each day in a 
month for time-of-day analysis, the
+predicates look like `DATE_FORMAT(timestamp, '%Y-%m-%d') between '2022-10-01' 
and '2022-10-31'`
+and `DATE_FORMAT(timestamp, '%H') between '12' and '13'`. If the data is in a 
good layout as above, we only need to scan
+two files (instead of 24 files) for each day of data, e.g., 
`base_file_13.parquet` and `base_file_14.parquet` containing
+the data for 2022-10-01 12-2 PM.
+
+Currently, such a fine-grained file pruning based on a function on a column 
cannot be achieved in Hudi, because
+transforming the `timestamp` to the hour of day is not order-preserving, thus 
the file pruning cannot directly leverage
+the file-level column stats of the original column of `timestamp`. In this 
case, Hudi has to scan all the files for a
+day and push the predicate down when reading parquet files, increasing the 
amount of data to be scanned.
+
+### Use Case 2: Efficient file pruning for files without physical partitioning
+
+Let's consider the same non-partitioned table as in the Use Case 1, containing 
the events with a 

[jira] [Updated] (HUDI-5237) Support for HoodieUnMergedLogRecordScanner with InternalSchema

2022-11-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-5237:
-
Labels: pull-request-available  (was: )

> Support for HoodieUnMergedLogRecordScanner with InternalSchema
> --
>
> Key: HUDI-5237
> URL: https://issues.apache.org/jira/browse/HUDI-5237
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Alexander Trushev
>Assignee: Alexander Trushev
>Priority: Major
>  Labels: pull-request-available
>
> Currently, only HoodieMergedLogRecordScanner has support of InternalSchema. 
> Implementing schema evolution in flink is required supporting for 
> HoodieUnMergedLogRecordScanner with InternalSchema as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] trushev opened a new pull request, #7237: [HUDI-5237] Support for HoodieUnMergedLogRecordScanner with InternalSchema

2022-11-17 Thread GitBox


trushev opened a new pull request, #7237:
URL: https://github.com/apache/hudi/pull/7237

   ### Change Logs
   
   Currently, only `HoodieMergedLogRecordScanner` has support of InternalSchema.
   Implementing schema evolution in flink 
https://github.com/apache/hudi/pull/5830 is required supporting for 
`HoodieUnMergedLogRecordScanner` with InternalSchema as well.
   
   ### Impact
   
   Support for HoodieUnMergedLogRecordScanner with InternalSchema.
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-5237) Support for HoodieUnMergedLogRecordScanner with InternalSchema

2022-11-17 Thread Alexander Trushev (Jira)
Alexander Trushev created HUDI-5237:
---

 Summary: Support for HoodieUnMergedLogRecordScanner with 
InternalSchema
 Key: HUDI-5237
 URL: https://issues.apache.org/jira/browse/HUDI-5237
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Alexander Trushev
Assignee: Alexander Trushev


Currently, only HoodieMergedLogRecordScanner has support of InternalSchema. 
Implementing schema evolution in flink is required supporting for 
HoodieUnMergedLogRecordScanner with InternalSchema as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #7231: [HUDI-5234] streaming read skip clustering

2022-11-17 Thread GitBox


danny0405 commented on code in PR #7231:
URL: https://github.com/apache/hudi/pull/7231#discussion_r1025961537


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/IncrementalInputSplits.java:
##
@@ -482,13 +487,12 @@ private List filterInstantsWithRange(
   final String endCommit = this.conf.get(FlinkOptions.READ_END_COMMIT);
   instantStream = instantStream.filter(s -> 
HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN_OR_EQUALS, 
endCommit));
 }
-return maySkipCompaction(instantStream).collect(Collectors.toList());
+return 
maySkipOverwriteInstants(instantStream).collect(Collectors.toList());
   }
 
-  private Stream maySkipCompaction(Stream 
instants) {
-return this.skipCompaction
-? instants.filter(instant -> 
!instant.getAction().equals(HoodieTimeline.COMMIT_ACTION))
-: instants;
+  private Stream maySkipOverwriteInstants(Stream 
instants) {
+return instants.filter(instant -> !this.skipCompaction || 
!instant.getAction().equals(HoodieTimeline.COMPACTION_ACTION))
+.filter(instant -> !this.skipClustering|| 
!instant.getAction().equals(HoodieTimeline.REPLACE_COMMIT_ACTION));
   }

Review Comment:
   `!this.skipClustering||` -> `!this.skipClustering ||`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] scxwhite opened a new pull request, #7236: [MINOR] Fix the npe caused by alter table add column.

2022-11-17 Thread GitBox


scxwhite opened a new pull request, #7236:
URL: https://github.com/apache/hudi/pull/7236

   ### Change Logs
   We put the public configuration in the file 
/etc/hudi/conf/hudi-defaults.conf,When the optimistic lock is opened, `alter 
table add column`  command will cause npe.
   
   
![image](https://user-images.githubusercontent.com/23207189/202611981-784680e5-1b33-42a3-9afb-74a13807b222.png)
   
   Then I found that because BaseHoodieWriteClient # preWrite was not executed 
in the AlterHoodieTableAddColumnsCommand class, 
pendingInflightAndRequestedInstants was not initialized. So submit this pr to 
fix it.
   https://user-images.githubusercontent.com/23207189/202612457-cb94c86a-4ae0-4a89-a124-0faae0dccb1a.png;>
   
   ### Impact
   
   alter table add/change column with optimistic lock is opened.
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-5233) Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns

2022-11-17 Thread Alexander Trushev (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Trushev closed HUDI-5233.
---
Resolution: Fixed

Fixed via master branch: e4e28836c235f96edf4c38a75dd6e95beeaecb27

> Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns
> ---
>
> Key: HUDI-5233
> URL: https://issues.apache.org/jira/browse/HUDI-5233
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Alexander Trushev
>Assignee: Alexander Trushev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> InternalSchemaUtils.collectTypeChangedCols returns all columns instead of 
> changed ones



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] danny0405 commented on a diff in pull request #7232: [HUDI-5235] clustering target size should larger than small file limit

2022-11-17 Thread GitBox


danny0405 commented on code in PR #7232:
URL: https://github.com/apache/hudi/pull/7232#discussion_r1025957414


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java:
##
@@ -136,7 +136,13 @@ public ClusteringOperator(Configuration conf, RowType 
rowType) {
 
 // override max parquet file size in conf
 this.conf.setLong(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key(),
-
this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES));
+Integer.MAX_VALUE);
+
+// target size should larger than small file limit
+
this.conf.setLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT.key(),
+
this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES) 
> this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT) ?
+  
this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_SMALL_FILE_LIMIT) :

Review Comment:
   Seems reasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #7232: [HUDI-5235] clustering target size should larger than small file limit

2022-11-17 Thread GitBox


danny0405 commented on code in PR #7232:
URL: https://github.com/apache/hudi/pull/7232#discussion_r1025957285


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/clustering/ClusteringOperator.java:
##
@@ -136,7 +136,13 @@ public ClusteringOperator(Configuration conf, RowType 
rowType) {
 
 // override max parquet file size in conf
 this.conf.setLong(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key(),
-
this.conf.getLong(FlinkOptions.CLUSTERING_PLAN_STRATEGY_TARGET_FILE_MAX_BYTES));
+Integer.MAX_VALUE);

Review Comment:
   Don' think it is right yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5233) Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns

2022-11-17 Thread Alexander Trushev (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Trushev updated HUDI-5233:

Fix Version/s: 0.13.0

> Fix bug when InternalSchemaUtils.collectTypeChangedCols returns all columns
> ---
>
> Key: HUDI-5233
> URL: https://issues.apache.org/jira/browse/HUDI-5233
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Alexander Trushev
>Assignee: Alexander Trushev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> InternalSchemaUtils.collectTypeChangedCols returns all columns instead of 
> changed ones



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7222: [MINOR] fixed Flink's DataStream does not support creating managed table

2022-11-17 Thread GitBox


hudi-bot commented on PR #7222:
URL: https://github.com/apache/hudi/pull/7222#issuecomment-1319501614

   
   ## CI report:
   
   * 17aef066b20b39a90f6d22f243fd3cbb58004e68 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13078)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13081)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13079)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-11-17 Thread GitBox


hudi-bot commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319498561

   
   ## CI report:
   
   * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073)
 
   * f3b7a61c226d136cd232a062a0f28e085f060035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13095)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #7159: [HUDI-5173]Skip if there is only one file in clusteringGroup

2022-11-17 Thread GitBox


danny0405 commented on PR #7159:
URL: https://github.com/apache/hudi/pull/7159#issuecomment-1319497037

   Thanks for the contribution, I have reviewed and created a patch here:
   [5173.zip](https://github.com/apache/hudi/files/10037492/5173.zip)
   
   You can apply the patch with cmd: `git apply xxx.patch`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-11-17 Thread GitBox


hudi-bot commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1319495871

   
   ## CI report:
   
   * 82ab5bfa3d2246beae8835178671840e4e1b77b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13073)
 
   * f3b7a61c226d136cd232a062a0f28e085f060035 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


YannByron commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319486636

   Nice work. Looks good, just leave two comments to solve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron commented on a diff in pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


YannByron commented on code in PR #7138:
URL: https://github.com/apache/hudi/pull/7138#discussion_r1025940652


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##
@@ -18,7 +18,6 @@
 package org.apache.hudi
 
 import org.apache.hadoop.fs.Path
-

Review Comment:
   please keep the import code style that separate the different package by a 
blank line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] liaooo commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync

2022-11-17 Thread GitBox


liaooo commented on PR #3771:
URL: https://github.com/apache/hudi/pull/3771#issuecomment-1319480279

   So now HiveSync supports kerberos or not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5023) Evaluate removing Queueing in the write path

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-5023:
--
Sprint: 2022/11/15  (was: 2022/11/29)

> Evaluate removing Queueing in the write path
> 
>
> Key: HUDI-5023
> URL: https://issues.apache.org/jira/browse/HUDI-5023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should evaluate removing _any queueing_ (BoundedInMemoryQueue, 
> DisruptorQueue) on the write path for multiple reasons:
> *It breaks up vertical chain of transformations applied to data*
> Spark (alas other engines) rely on the notion of _Iteration_ to vertically 
> compose all transformations applied to a single record to allow for effective 
> _stream_ processing, where all transformations are applied to an _Iterator, 
> yielding records_ from the source, that way
>  # Chain of transformations* is applied to every record one by one, allowing 
> to effectively limit amount of memory used to the number of records being 
> read and processed simultaneously (if the reading is not batched, it'd be 
> just a single record), which in turn allows
>  # To limit # of memory allocations required to process a single record. 
> Consider the opposite: if we'd do it breadth-wise, applying first 
> transformation to _all_ of the records, we will have to store all of 
> transformed records in memory which is costly from both GC overhead as well 
> as pure object churn perspectives.
>  
> Enqueueing is essentially violates both of these invariants, breaking up 
> {_}stream{_}-like processing model and forcing records to be kept in memory 
> for no good reason.
>  
> * This chain is broken up at shuffling points (collection of tasks executed 
> b/w these shuffling points are called stages in Spark)
>  
> *It requires data to be allocated on the heap*
> As was called out in the previous paragraph, enqueueing raw data read from 
> the source breaks up _stream_ processing paradigm and forces records to be 
> persisted in the heap.
> Consider following example: plain ParquetReader from Spark actually uses 
> *mutable* `ColumnarBatchRow` providing a Row-based view into the batch of 
> data being read from the file.
> Now, since it's a mutable object we can use it to _iterate_ over all of the 
> records (while doing stream-processing) ultimately producing some "output" 
> (either writing into another file, shuffle block, etc), but we +can't keep a 
> reference on it+ (for ex, by +enqueueing+ it) – since the object is mutable. 
> Instead we are forced to make a *copy* of it, which will obviously require us 
> to allocate it on the heap.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4937) Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT readers

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4937:
--
Sprint: 2022/10/04, 2022/10/18, 2022/11/01, 2022/11/15  (was: 2022/10/04, 
2022/10/18, 2022/11/01, 2022/11/29)

> Fix HoodieTable injecting HoodieBackedTableMetadata not reusing underlying MT 
> readers
> -
>
> Key: HUDI-4937
> URL: https://issues.apache.org/jira/browse/HUDI-4937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently, `HoodieTable` is holding `HoodieBackedTableMetadata` that is setup 
> not to reuse actual LogScanner and HFileReader used to read MT itself.
> This is proving to be wasteful on a number of occasions already, including 
> (not an exhaustive list):
> https://github.com/apache/hudi/issues/6373



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4911) Make sure LogRecordReader doesn't flush the cache before each lookup

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4911:
--
Sprint: 2022/11/15  (was: 2022/11/29)

> Make sure LogRecordReader doesn't flush the cache before each lookup
> 
>
> Key: HUDI-4911
> URL: https://issues.apache.org/jira/browse/HUDI-4911
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Currently {{HoodieMetadataMergedLogRecordReader }}will flush internal record 
> cache before each lookup which makes every lookup essentially do 
> re-processing of the whole log-blocks stack again.
> We should avoid that and only do the re-parsing incrementally (for the keys 
> that ain't already cached)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] trushev commented on a diff in pull request #6358: [HUDI-4588][HUDI-4472] Addressing schema handling issues in the write path

2022-11-17 Thread GitBox


trushev commented on code in PR #6358:
URL: https://github.com/apache/hudi/pull/6358#discussion_r1025925085


##
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/HoodieFlinkCopyOnWriteTable.java:
##
@@ -378,7 +377,7 @@ protected Iterator> 
handleUpdateInternal(HoodieMergeHandlehttps://github.com/apache/hudi/pull/5830. 
   The day will come when one of us will have to solve merge conflict with 
master branch.
   I'd prefer to do it ASAP because no `FlinkMergeHelper` means no conflict:)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4919) Sql MERGE INTO incurs too much memory overhead

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-4919:
-

Assignee: sivabalan narayanan  (was: Alexey Kudinkin)

> Sql MERGE INTO incurs too much memory overhead
> --
>
> Key: HUDI-4919
> URL: https://issues.apache.org/jira/browse/HUDI-4919
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.13.0
>
>
> When using spark-sql MERGE INTO, memory requirement shoots up. To merge new 
> incoming data for 120MB parquet file, memory requirement shoots up > 10GB. 
>  
> from user:
> We are trying to process some input data which is of 5 GB (Parquet snappy 
> compression) and this will try to insert/update Hudi table for 4 days (Day is 
> partition).
> My Data size in Hudi target table for each partition is like around 3.5GB to 
> 10GB.We are trying to process the data and our process is keep failing with 
> OOM (java.lang.OutOfMemoryError: GC overhead limit exceeded).
> We have tried with 32GB and 64GB of executor memory as well with 3 cores.
> Our process is running fine when we have less updates and more inserts.
>  
>  
> Got a brief of the issue again: 
> Its a partitioned dataset. and each partition roughly has 3.5 to 10GB size of 
> data. max parquet file size is default and so 120MB files max.
> input batch is spread across last 3 to 4 partitions. 
> Incoming data is 5GB parquet compressed.
> User tried giving close to 20GB per task (64 GB executor w/ 3 cores) and 
> still hit memory issues and failed.
> If the incoming batch had fewer updates, it works. else it fails w/ OOM. 
> Tried w/ both BLOOM and simple, but did not work. 
> Similar incremental ingestion works w/o any issues w/ spark-ds writes. Issue 
> is only w/ MERGE INTO w/ spark-sql. 
>  
> Specifics about the table schema:
> Table has around 50 columns and there are no nested fields
> All data types are generic once like String,Timestamp,Decimal
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/29  (was: 2022/08/22, 
2022/09/05, 2022/10/18)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18  (was: 2022/08/22, 2022/09/05, 
2022/10/18, 2022/11/15)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] YannByron commented on a diff in pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


YannByron commented on code in PR #7138:
URL: https://github.com/apache/hudi/pull/7138#discussion_r1025925016


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala:
##
@@ -72,57 +68,21 @@ class HoodieStreamSource(
 
parameters.get(DataSourceReadOptions.QUERY_TYPE.key).contains(DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
 &&
 
parameters.get(DataSourceReadOptions.INCREMENTAL_FORMAT.key).contains(DataSourceReadOptions.INCREMENTAL_FORMAT_CDC_VAL)
 
-  @transient private var lastOffset: HoodieSourceOffset = _
-
   @transient private lazy val initialOffsets = {
-val metadataLog =
-  new HDFSMetadataLog[HoodieSourceOffset](sqlContext.sparkSession, 
metadataPath) {
-override def serialize(metadata: HoodieSourceOffset, out: 
OutputStream): Unit = {
-  val writer = new BufferedWriter(new OutputStreamWriter(out, 
StandardCharsets.UTF_8))
-  writer.write("v" + VERSION + "\n")
-  writer.write(metadata.json)
-  writer.flush()
-}
-
-/**
-  * Deserialize the init offset from the metadata file.
-  * The format in the metadata file is like this:
-  * --
-  * v1 -- The version info in the first line
-  * offsetJson -- The json string of HoodieSourceOffset in the rest of 
the file
-  * ---
-  * @param in
-  * @return
-  */
-override def deserialize(in: InputStream): HoodieSourceOffset = {
-  val content = FileIOUtils.readAsUTFString(in)
-  // Get version from the first line
-  val firstLineEnd = content.indexOf("\n")
-  if (firstLineEnd > 0) {
-val version = getVersion(content.substring(0, firstLineEnd))
-if (version > VERSION) {
-  throw new IllegalStateException(s"UnSupportVersion: max support 
version is: $VERSION" +
-s" current version is: $version")
-}
-// Get offset from the rest line in the file
-HoodieSourceOffset.fromJson(content.substring(firstLineEnd + 1))
-  } else {
-throw new IllegalStateException(s"Bad metadata format, failed to 
find the version line.")
-  }
-}
-  }
+val metadataLog = new HoodieMetadataLog(sqlContext.sparkSession, 
metadataPath)
 metadataLog.get(0).getOrElse {
-  metadataLog.add(0, INIT_OFFSET)
-  INIT_OFFSET
-}
-  }
-
-  private def getVersion(versionLine: String): Int = {
-if (versionLine.startsWith("v")) {
-  versionLine.substring(1).toInt
-} else {
-  throw new IllegalStateException(s"Illegal version line: $versionLine " +
-s"in the streaming metadata path")
+  val offset = offsetRangeLimit match {
+case HoodieEarliestOffsetRangeLimit =>
+  INIT_OFFSET
+case HoodieLatestOffsetRangeLimit =>
+  getLatestOffset.getOrElse(throw new HoodieException("Cannot fetch 
latest offset from table, " +

Review Comment:
   can we use INIT_OFFSET when `getLatestOffset` is empty ? I mean 
`getLatestOffset.getOrElse(INIT_OFFSET)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-5148) Write RFC for index function

2022-11-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5148:

Status: Patch Available  (was: In Progress)

> Write RFC for index function
> 
>
> Key: HUDI-5148
> URL: https://issues.apache.org/jira/browse/HUDI-5148
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-5210) End-to-end PoC of index function

2022-11-17 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-5210:

Status: In Progress  (was: Open)

> End-to-end PoC of index function
> 
>
> Key: HUDI-5210
> URL: https://issues.apache.org/jira/browse/HUDI-5210
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query

2022-11-17 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4812.

Resolution: Done

> Lazy partition listing and file groups fetching in Spark Query
> --
>
> Key: HUDI-4812
> URL: https://issues.apache.org/jira/browse/HUDI-4812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> In current spark query implementation, the FileIndex will refresh and load 
> all file groups in cached in order to serve subsequent queries.
>  
> For large table with many partitions, this may introduce much overhead in 
> initialization. Meanwhile, the query itself may come with partition filter. 
> So the loading of file groups will be unnecessary.
>  
> So to optimize, the whole refresh logic will become lazy, where actual work 
> will be carried out only after the partition filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


hudi-bot commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319460803

   
   ## CI report:
   
   * c3171ba5115240fd5e00a0d47cf4f9b0c182b8e6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13074)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13094)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #7138: [HUDI-5162] Allow user specified start offset for streaming query

2022-11-17 Thread GitBox


boneanxs commented on PR #7138:
URL: https://github.com/apache/hudi/pull/7138#issuecomment-1319460578

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >