Bharath Kumarasubramanian created SAMZA-2789:
------------------------------------------------

             Summary: Remove cap on intermediate stream partition count for 
stream mode
                 Key: SAMZA-2789
                 URL: https://issues.apache.org/jira/browse/SAMZA-2789
             Project: Samza
          Issue Type: Improvement
            Reporter: Bharath Kumarasubramanian
            Assignee: Bharath Kumarasubramanian


{*}Problem{*}: Intermediate stream partition count inference logic caps the 
partition size to 256 resulting in imbalances in work assignments to tasks

{*}Description{*}: As part of the intermediate partition size inference logic, 
we currently employ the following algorithm.
 * partitionCount = Math.max(maxPartitionSize(inputStreams), 
maxPartitionSize(outputStreams))
 * cap the partitionCount to MAX_INFERRED_PARTITIONS defined in the 
`IntermediateStreamManager` which is 256
 * apply the inferred partition count to intermediate streams whose partition 
count is uninitialized

The logic above always caps the partition size of intermediate streams to 256 
for all auto-created intermediate streams. This can prevent the job from 
scaling up uniformly as the intermediate partition assignment is capped to 256 
tasks thereby rendering other tasks imbalanced in case of number tasks > 256.

{*}Changes{*}:
 * Apply the cap only for batch mode as 256 limit was introduced for batch mode 
where number of files (partition) could be large
 * Add unit tests for `IntermediateStreamManager`

{*}Tests{*}: Added unit tests for the code changes

{*}API Changes{*}: None

{*}Upgrade Instructions{*}: 
 * Jobs that are temporarily worked around this constraint by setting 
`job.intermediate.stream.partitions` should remove the configuration in order 
for samza to infer and apply the partition count as described above
 * Jobs that don't use `job.intermediate.stream.partitions` need no changes.

{*}Usage Instructions{*}: Refer to upgrade instruction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to