[ 
https://issues.apache.org/jira/browse/NIFI-15669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18066186#comment-18066186
 ] 

ASF subversion and git services commented on NIFI-15669:
--------------------------------------------------------

Commit c0bb66fa5bbfef13ef928145be340d3abe579d41 in nifi's branch 
refs/heads/main from Mark Payne
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=c0bb66fa5bb ]

NIFI-15669 Redesigned ConsumeKinesis without Kinesis Client Library (#10964)

- Refactored ConsumeKinesis removing dependency on Kinesis Client Library, 
providing much faster startup and drastically reducing heap utilization with 
Enhanced Fan-Out
- Simplified checkpointing by eliminating subsequences because we always 
include all sub-records within a single ProcessSession so we don't need to 
checkpoint partial sequences
- Exclude apache-client library in favor of apache5-client for synchronous HTTP

Signed-off-by: David Handermann <[email protected]>

> Refactor ConsumeKinesis to avoid dependency on KCL
> --------------------------------------------------
>
>                 Key: NIFI-15669
>                 URL: https://issues.apache.org/jira/browse/NIFI-15669
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Mark Payne
>            Priority: Major
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> The existing ConsumeKinesis Processor depends on the Kinesis Client Library, 
> which is a reasonable choice, in general. However, the Kinesis Client Library 
> is very heavy-weight, brings in a lot of transitive dependencies and was 
> designed with some assumptions in mind that do not align well with use in 
> NiFi.
> Specifically, startup time is very long - typically 10-15 minutes for the 
> first iteration as it waits for all consumers to join before allowing any 
> consumer to pull data. While this makes sense for a big, long-running job 
> that you might run from Spark or Flink, it is too long to be reasonable for 
> NiFi and often leads to users thinking it's not working.
> More importantly, the KCL library, when using Enhanced Fan-Out Consumer, 
> buffers up to 11 Kinesis Events per shard. This limit of 11 events is hard 
> coded and cannot be configured. An Event is effectively a batch of records, 
> which typically amounts to 1-2 MB. For a few shards and/or where NiFi is 
> keeping up without issue, this works well. However, if backpressure gets 
> applied we end up buffering 2 MB * 11 = 22 MB or so per shard. Many users 
> have Kinesis Streams with hundreds of shards, some even with 1,000+ shards. 
> With 1,000 shards that's 22 GB of data buffered in heap. And this is 
> considering only the 'payload'. There is additional overhead from metadata 
> and Java objects.
> We need a faster, more stable solution. The new solution can still make use 
> of DynamoDB for checkpointing and support more or less the same options and 
> configuration in a backward compatible manner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to