gemini-code-assist[bot] commented on code in PR #37565:
URL: https://github.com/apache/beam/pull/37565#discussion_r2850327059


##########
sdks/python/apache_beam/transforms/util_test.py:
##########
@@ -1025,6 +1026,236 @@ def test_stateful_grows_to_max_batch(self):
           | beam.Map(len))
       assert_that(res, equal_to([1, 1, 2, 4, 8, 16, 32, 50, 50]))
 
+  def test_length_bucket_assignment(self):
+    """WithLengthBucketKey assigns correct bucket indices."""
+    boundaries = [10, 50, 100]
+    dofn = util.WithLengthBucketKey(length_fn=len, 
bucket_boundaries=boundaries)
+    # bisect_left: length < 10 -> bucket 0, 10 <= length < 50 -> bucket 1, etc.
+    self.assertEqual(dofn._get_bucket(5), 0)
+    self.assertEqual(dofn._get_bucket(10), 0)
+    self.assertEqual(dofn._get_bucket(11), 1)
+    self.assertEqual(dofn._get_bucket(50), 1)
+    self.assertEqual(dofn._get_bucket(51), 2)
+    self.assertEqual(dofn._get_bucket(100), 2)
+    self.assertEqual(dofn._get_bucket(101), 3)
+    self.assertEqual(dofn._get_bucket(999), 3)

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   The comment here seems to describe the behavior of `bisect.bisect_right`, 
but the assertions match `bisect.bisect_left`, which is confusing.
   
   If `_get_bucket` is changed to use `bisect.bisect_right` as suggested in my 
other comment, this test should be updated to reflect the new behavior. The 
boundaries would be inclusive on the lower end, which is more conventional.
   
   Here is a suggested update for the test to align with `bisect.bisect_right`:
   
   ```suggestion
       # bisect_right creates buckets where the boundary is the lower-inclusive 
bound.
       # e.g., for boundaries [10, 50], buckets are (-inf, 10), [10, 50), [50, 
inf)
       self.assertEqual(dofn._get_bucket(5), 0)
       self.assertEqual(dofn._get_bucket(10), 1)
       self.assertEqual(dofn._get_bucket(11), 1)
       self.assertEqual(dofn._get_bucket(50), 2)
       self.assertEqual(dofn._get_bucket(51), 2)
       self.assertEqual(dofn._get_bucket(100), 3)
       self.assertEqual(dofn._get_bucket(999), 3)
   ```



##########
sdks/python/apache_beam/transforms/util.py:
##########
@@ -1208,6 +1209,28 @@ def process(self, element):
     yield (self.key, element)
 
 
+class WithLengthBucketKey(DoFn):
+  """Keys elements with (worker_uuid, length_bucket) for length-aware
+  stateful batching. Elements of similar length are routed to the same
+  state partition, reducing padding waste."""
+  def __init__(self, length_fn, bucket_boundaries):
+    self.shared_handle = shared.Shared()
+    self._length_fn = length_fn
+    self._bucket_boundaries = bucket_boundaries
+
+  def setup(self):
+    self.key = self.shared_handle.acquire(
+        load_shared_key, "WithLengthBucketKey").key
+
+  def _get_bucket(self, length):
+    return bisect.bisect_left(self._bucket_boundaries, length)

Review Comment:
   ![medium](https://www.gstatic.com/codereviewagent/medium-priority.svg)
   
   Using `bisect.bisect_right` would be more intuitive for defining bucket 
boundaries. Typically, boundaries are inclusive on the lower end and exclusive 
on the upper end (e.g., a bucket for lengths in `[10, 50)`). 
`bisect.bisect_right` achieves this behavior, which seems to be what was 
intended based on comments and logic in the tests.
   
   With the current `bisect.bisect_left`, a length of 10 falls into bucket 0 
(for lengths `<= 10`), which can be surprising. With `bisect.bisect_right`, a 
length of 10 would fall into bucket 1 (for lengths in `[10, 50)`).
   
   If you make this change, please also update the assertions in 
`test_length_bucket_assignment` in `util_test.py` to match.
   
   ```suggestion
       return bisect.bisect_right(self._bucket_boundaries, length)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to