Bharath Kumarasubramanian created SAMZA-2789:
------------------------------------------------
Summary: Remove cap on intermediate stream partition count for
stream mode
Key: SAMZA-2789
URL: https://issues.apache.org/jira/browse/SAMZA-2789
Project: Samza
Issue Type: Improvement
Reporter: Bharath Kumarasubramanian
Assignee: Bharath Kumarasubramanian
{*}Problem{*}: Intermediate stream partition count inference logic caps the
partition size to 256 resulting in imbalances in work assignments to tasks
{*}Description{*}: As part of the intermediate partition size inference logic,
we currently employ the following algorithm.
* partitionCount = Math.max(maxPartitionSize(inputStreams),
maxPartitionSize(outputStreams))
* cap the partitionCount to MAX_INFERRED_PARTITIONS defined in the
`IntermediateStreamManager` which is 256
* apply the inferred partition count to intermediate streams whose partition
count is uninitialized
The logic above always caps the partition size of intermediate streams to 256
for all auto-created intermediate streams. This can prevent the job from
scaling up uniformly as the intermediate partition assignment is capped to 256
tasks thereby rendering other tasks imbalanced in case of number tasks > 256.
{*}Changes{*}:
* Apply the cap only for batch mode as 256 limit was introduced for batch mode
where number of files (partition) could be large
* Add unit tests for `IntermediateStreamManager`
{*}Tests{*}: Added unit tests for the code changes
{*}API Changes{*}: None
{*}Upgrade Instructions{*}:
* Jobs that are temporarily worked around this constraint by setting
`job.intermediate.stream.partitions` should remove the configuration in order
for samza to infer and apply the partition count as described above
* Jobs that don't use `job.intermediate.stream.partitions` need no changes.
{*}Usage Instructions{*}: Refer to upgrade instruction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)