[ https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Bradshaw resolved BEAM-4858. ----------------------------------- Resolution: Fixed Fix Version/s: 2.8.0 > Clean up _BatchSizeEstimator in element-batching transform. > ----------------------------------------------------------- > > Key: BEAM-4858 > URL: https://issues.apache.org/jira/browse/BEAM-4858 > Project: Beam > Issue Type: Bug > Components: sdk-py-core > Reporter: Valentyn Tymofieiev > Assignee: Robert Bradshaw > Priority: Minor > Fix For: 2.8.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] > non-trivial performance-sensitive logic in element-batching transform. Let's > take a look at > [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]. > > Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on > the type of the keys - whether they are integers or floats. > The keys of key-value pairs contained in {{self._data}} are added as integers > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260], > however, when we 'thin' the collected entries > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279], > the keys will become floats. Surprisingly, using either integer or float > division consistently [in the > comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271] > negatively affects the performance of a custom pipeline I was using to > benchmark these changes. The performance impact likely comes from changes in > the logic that depends on how division is evaluated, not from the > performance of division operation itself. > In terms of Python 3 conversion the best course of action that avoids > regression seems to be to preserve the existing Python 2 behavior using > {{old_div}} from {{past.utils.division}}, in the medium term we should clean > up the logic. We may want to add a targeted microbenchmark to evaluate > performance of this code, and maybe cythonize the code, since it seems to be > performance-sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)