Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
pvillard31 closed pull request #8631: NIFI-13027 - Warn users for small files processing in PutIceberg URL: https://github.com/apache/nifi/pull/8631 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
pvillard31 commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2104982035 Hey @joewitt & @exceptionfactory - will close for now. I won't have the opportunity to get back to this before quite some time and I believe @mark-bathori may be able to come up with a PR that follows your recommendations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
exceptionfactory commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2102943183 Just following up on this, glad to take another look after the rebase, but I still disagree with adding a property to enable warning, since the warning is not really actionable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
joewitt commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2102930541 @pvillard31 Can you please rebase and advise whether you are good removing the configurability of the size? I think the warning is fair game but the configurable size just seems unnecessary. I'm fine leaving it so the only thing really needing done is a rebase - but I'm asking if you'd consider dropping that. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
pvillard31 commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050159532 My concern with changing the routing approach is that it'd be a significant breaking change for existing flows. Having a logic that is not managed through properties also means that if a user does want small files (no idea why...) that would not be an option. But yeah I guess having a built-in rate limiter is probably the best option if we don't want to add properties. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
exceptionfactory commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050124715 I agree that just logging a warning from any source is not ideal, since it requires some amount of user involvement to see and respond to the warning. Changing the behavior to require a minimum size and then taking some action, like routing to failure or perhaps yielding, in addition to logging, seems like it could be helpful. My concern was logging without any additional behavior change, but introducing some behavior change with logging seems worth considering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
joewitt commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050125108 @pvillard31 I was meaning to say it is good to warn the user about usage not making sense with that processor. But it does not need to be configurable what the size is - we already have too many dials/knobs. Instead the processor should keep some simple state about the rate/avg size of files seen and if over say a 1-5 minute interval the data is on average 'too small' then log a warn/bulletin. The bulletin could certainly advise the user that data to this processor should be batched together and that a component such as MergeContent or MergeRecord can be helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
pvillard31 commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050119287 From discussions with Iceberg experts, a good size is 10MB. Problem is that we've seen users using PutIceberg without any kind of Merge operation before (imagine ConsumeKafka (not record based) to PutIceberg) creating thousands of small (hundreds of bytes) files and associated snapshots. This has basically killed the performances and made the query engine on top of the Iceberg tables pretty much unusable. While I do agree that the approach with a specific property sets a precedent that is not great, I don't really see another good approach to make sure the flow designer is aware of this as soon as the designer tries to run the processor to make some tests. I definitely don't think we should go in the direction of adding the Merge capabilities into the processor itself (that is going to be a mess and extremely complicated). The rules engine would be great but you can't be sure that the rule would be set and enforced by users before someone starts designing a flow where Iceberg is a destination. The reason for this change in this case is really because creating so many snapshot files has basically killed the downstream system. Another approach that I can think of: if the processor is processing a flow file that is below the agreed size, then we would yield the processor, and the default yield duration would be set to something like 1 minute. But we would still want to emit a bulletin to explain the behavior or something... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
joewitt commented on PR #8631: URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050092291 @exceptionfactory The problem with doing that in the flow with logmessage is that you're putting the onus on the user to do that. And the point is they might not know. @pvillard31 I think this makes sense but to my response to exceptionfactory the whole idea here is to convey information that might not be obvious to the end user. This seems like it could be more thoughtfully done by skipping new properties altogether. Is there a somewhat established "datasets on average going to Iceberg should be X bytes" big? If there is something like that then just have the processor keep a bit of state about this and after a bit of time and set of files go by periodically kick out such guidance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]
pvillard31 opened a new pull request, #8631: URL: https://github.com/apache/nifi/pull/8631 # Summary [NIFI-13027](https://issues.apache.org/jira/browse/NIFI-13027) - Warn users for small files processing in PutIceberg # Tracking Please complete the following tracking steps prior to pull request creation. ### Issue Tracking - [ ] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue created ### Pull Request Tracking - [ ] Pull Request title starts with Apache NiFi Jira issue number, such as `NIFI-0` - [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such `NIFI-0` ### Pull Request Formatting - [ ] Pull Request based on current revision of the `main` branch - [ ] Pull Request refers to a feature branch with one commit containing changes # Verification Please indicate the verification steps performed prior to pull request creation. ### Build - [ ] Build completed using `mvn clean install -P contrib-check` - [ ] JDK 21 ### Licensing - [ ] New dependencies are compatible with the [Apache License 2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License Policy](https://www.apache.org/legal/resolved.html) - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` files ### Documentation - [ ] Documentation formatting appears as expected in rendered files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org