Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-05-10 Thread via GitHub


pvillard31 closed pull request #8631: NIFI-13027 - Warn users for small files 
processing in PutIceberg
URL: https://github.com/apache/nifi/pull/8631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-05-10 Thread via GitHub


pvillard31 commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2104982035

   Hey @joewitt & @exceptionfactory - will close for now. I won't have the 
opportunity to get back to this before quite some time and I believe 
@mark-bathori may be able to come up with a PR that follows your 
recommendations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-05-09 Thread via GitHub


exceptionfactory commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2102943183

   Just following up on this, glad to take another look after the rebase, but I 
still disagree with adding a property to enable warning, since the warning is 
not really actionable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-05-09 Thread via GitHub


joewitt commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2102930541

   @pvillard31 Can you please rebase and advise whether you are good removing 
the configurability of the size?
   
   I think the warning is fair game but the configurable size just seems 
unnecessary.  I'm fine leaving it so the only thing really needing done is a 
rebase - but I'm asking if you'd consider dropping that.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


pvillard31 commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050159532

   My concern with changing the routing approach is that it'd be a significant 
breaking change for existing flows. 
   
   Having a logic that is not managed through properties also means that if a 
user does want small files (no idea why...) that would not be an option. But 
yeah I guess having a built-in rate limiter is probably the best option if we 
don't want to add properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


exceptionfactory commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050124715

   I agree that just logging a warning from any source is not ideal, since it 
requires some amount of user involvement to see and respond to the warning.
   
   Changing the behavior to require a minimum size and then taking some action, 
like routing to failure or perhaps yielding, in addition to logging, seems like 
it could be helpful. My concern was logging without any additional behavior 
change, but introducing some behavior change with logging seems worth 
considering.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


joewitt commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050125108

   @pvillard31 I was meaning to say it is good to warn the user about usage not 
making sense with that processor.  But it does not need to be configurable what 
the size is - we already have too many dials/knobs.  Instead the processor 
should keep some simple state about the rate/avg size of files seen and if over 
say a 1-5 minute interval the data is on average 'too small' then log a 
warn/bulletin.  The bulletin could certainly advise the user that data to this 
processor should be batched together and that a component such as MergeContent 
or MergeRecord can be helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


pvillard31 commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050119287

   From discussions with Iceberg experts, a good size is 10MB. Problem is that 
we've seen users using PutIceberg without any kind of Merge operation before 
(imagine ConsumeKafka (not record based) to PutIceberg) creating thousands of 
small (hundreds of bytes) files and associated snapshots. This has basically 
killed the performances and made the query engine on top of the Iceberg tables 
pretty much unusable. While I do agree that the approach with a specific 
property sets a precedent that is not great, I don't really see another good 
approach to make sure the flow designer is aware of this as soon as the 
designer tries to run the processor to make some tests. I definitely don't 
think we should go in the direction of adding the Merge capabilities into the 
processor itself (that is going to be a mess and extremely complicated). The 
rules engine would be great but you can't be sure that the rule would be set 
and enforced by users before someone starts designing a flow where Iceberg is a
  destination. The reason for this change in this case is really because 
creating so many snapshot files has basically killed the downstream system.
   
   Another approach that I can think of: if the processor is processing a flow 
file that is below the agreed size, then we would yield the processor, and the 
default yield duration would be set to something like 1 minute. But we would 
still want to emit a bulletin to explain the behavior or something...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


joewitt commented on PR #8631:
URL: https://github.com/apache/nifi/pull/8631#issuecomment-2050092291

   @exceptionfactory The problem with doing that in the flow with logmessage is 
that you're putting the onus on the user to do that.  And the point is they 
might not know.  
   
   @pvillard31 I think this makes sense but to my response to exceptionfactory 
the whole idea here is to convey information that might not be obvious to the 
end user.  This seems like it could be more thoughtfully done by skipping new 
properties altogether.  Is there a somewhat established "datasets on average 
going to Iceberg should be X bytes" big?  If there is something like that then 
just have the processor keep a bit of state about this and after a bit of time 
and set of files go by periodically kick out such guidance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] NIFI-13027 - Warn users for small files processing in PutIceberg [nifi]

2024-04-11 Thread via GitHub


pvillard31 opened a new pull request, #8631:
URL: https://github.com/apache/nifi/pull/8631

   # Summary
   
   [NIFI-13027](https://issues.apache.org/jira/browse/NIFI-13027) - Warn users 
for small files processing in PutIceberg
   
   # Tracking
   
   Please complete the following tracking steps prior to pull request creation.
   
   ### Issue Tracking
   
   - [ ] [Apache NiFi Jira](https://issues.apache.org/jira/browse/NIFI) issue 
created
   
   ### Pull Request Tracking
   
   - [ ] Pull Request title starts with Apache NiFi Jira issue number, such as 
`NIFI-0`
   - [ ] Pull Request commit message starts with Apache NiFi Jira issue number, 
as such `NIFI-0`
   
   ### Pull Request Formatting
   
   - [ ] Pull Request based on current revision of the `main` branch
   - [ ] Pull Request refers to a feature branch with one commit containing 
changes
   
   # Verification
   
   Please indicate the verification steps performed prior to pull request 
creation.
   
   ### Build
   
   - [ ] Build completed using `mvn clean install -P contrib-check`
 - [ ] JDK 21
   
   ### Licensing
   
   - [ ] New dependencies are compatible with the [Apache License 
2.0](https://apache.org/licenses/LICENSE-2.0) according to the [License 
Policy](https://www.apache.org/legal/resolved.html)
   - [ ] New dependencies are documented in applicable `LICENSE` and `NOTICE` 
files
   
   ### Documentation
   
   - [ ] Documentation formatting appears as expected in rendered files
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org