subject:"\[jira\] \[Work logged\] \(BEAM\-4858\) Clean up _BatchSizeEstimator in element\-batching transform."

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-09 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=152877&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-152877
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 09/Oct/18 19:17
Start Date: 09/Oct/18 19:17
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r223829525
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,62 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a + bx over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a + bx over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a + b * xs - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 0.5, and weighting by the inverse of x to give a more
+  # stable y-intercept.
+  weight = (cook_ds <= 0.5) / xs
 
 Review comment:
   I see, thanks for taking time to explain on PR & comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 152877)
Time Spent: 5h 40m  (was: 5.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/tra

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-09 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=152625&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-152625
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 09/Oct/18 10:54
Start Date: 09/Oct/18 10:54
Worklog Time Spent: 10m 
  Work Description: robertwb closed pull request #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/python/apache_beam/transforms/util.py 
b/sdks/python/apache_beam/transforms/util.py
index 8a999691f03..067d4f74aaa 100644
--- a/sdks/python/apache_beam/transforms/util.py
+++ b/sdks/python/apache_beam/transforms/util.py
@@ -30,7 +30,6 @@
 from builtins import zip
 
 from future.utils import itervalues
-from past.utils import old_div
 
 from apache_beam import typehints
 from apache_beam.metrics import Metrics
@@ -213,6 +212,7 @@ def __init__(self,
max_batch_size=1000,
target_batch_overhead=.1,
target_batch_duration_secs=1,
+   variance=0.25,
clock=time.time):
 if min_batch_size > max_batch_size:
   raise ValueError("Minimum (%s) must not be greater than maximum (%s)" % (
@@ -230,6 +230,7 @@ def __init__(self,
 self._max_batch_size = max_batch_size
 self._target_batch_overhead = target_batch_overhead
 self._target_batch_duration_secs = target_batch_duration_secs
+self._variance = variance
 self._clock = clock
 self._data = []
 self._ignore_next_timing = False
@@ -269,23 +270,63 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a + bx over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a + bx over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a + b * xs - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 0.5, and weighting by the inverse of x to give a more
+  # stable y-intercept (as small batches have relatively information
+  # about the fixed ovehead).
+  weight = (cook_ds <= 0.5) / xs
+  b, a = np.polyfit(xs, ys, 1, w=weight)
+  return a, b
+
+  try:
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+linear_regression = linear_regression_numpy
+  except ImportError:
+linear_regression = linear_regression_no_numpy
 
   def next_batch_size(self):
 if self._min_batch_size == self._max_batch_size:
@@ -300,14 +341,14 @@ def next_batch_size(self):
   self._min_batch_size * self._MAX_GROWTH_FACTOR),
   self._min_bat

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-09 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=152569&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-152569
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 09/Oct/18 07:25
Start Date: 09/Oct/18 07:25
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r223584542
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,62 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a + bx over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a + bx over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a + b * xs - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 0.5, and weighting by the inverse of x to give a more
+  # stable y-intercept.
+  weight = (cook_ds <= 0.5) / xs
 
 Review comment:
   Smaller batches are more accurate predictors of the fixed cost. I'll update 
the comment. 
   
   One way to think about this is that there is a fair amount of variance in 
processing elements. When one has y = (a+err_b) + (b+err_b)*x, a small err_b 
can greatly influence the prediction for a when x is large. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 152569)
Time Spent: 5h 20m  (was: 5h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-09 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=152566&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-152566
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 09/Oct/18 07:23
Start Date: 09/Oct/18 07:23
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r223584542
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,62 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a + bx over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a + bx over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a + b * xs - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 0.5, and weighting by the inverse of x to give a more
+  # stable y-intercept.
+  weight = (cook_ds <= 0.5) / xs
 
 Review comment:
   Smaller batches are more accurate predictors of the fixed cost. I'll update 
the comment. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 152566)
Time Spent: 5h 10m  (was: 5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b6

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-08 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=152452&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-152452
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 08/Oct/18 22:47
Start Date: 08/Oct/18 22:47
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r223515305
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,62 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a + bx over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a + bx over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a + b * xs - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 0.5, and weighting by the inverse of x to give a more
+  # stable y-intercept.
+  weight = (cook_ds <= 0.5) / xs
 
 Review comment:
   What is the intuition behind weighing by the inverse of x? Do we assume that 
we have to process smaller batches more often than larger batches, and 
therefore smaller batches datapoints should receive more consideration than 
larger batches? Why is the instability caused by larger xs, and not by smaller? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 152452)
Time Spent: 5h  (was: 4h 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2d

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-04 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=151077&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-151077
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 04/Oct/18 08:34
Start Date: 04/Oct/18 08:34
Worklog Time Spent: 10m 
  Work Description: robertwb commented on issue #6375: [BEAM-4858] Clean up 
division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-426932337
 
 
   Yes, this is ready for another round of review. I also cleaned up the commit 
history (with an exanded and updated commit comment.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 151077)
Time Spent: 4h 50m  (was: 4h 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-03 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=150910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-150910
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 03/Oct/18 19:38
Start Date: 03/Oct/18 19:38
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-426771378
 
 
   Thanks. Is this ready for review?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 150910)
Time Spent: 4h 40m  (was: 4.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148775&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148775
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 15:01
Start Date: 27/Sep/18 15:01
Worklog Time Spent: 10m 
  Work Description: robertwb commented on issue #6375: [BEAM-4858] Clean up 
division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-425126489
 
 
   You're right, the a and b were switched in computing the error term when I 
copied this to the PR. This meant that significantly more points were 
considered outliers (but enough retained to typically give a reasonable 
regression). Unfortunately this fix means that it's still pretty sensitive to 
multiple outliers...
   
   I'm trying a simpler approach: just assume the top quantile is outliers. We 
have enough data to make this pretty robust. Running experiments now. 
   
   (As for computing h, I used sagemath.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148775)
Time Spent: 4.5h  (was: 4h 20m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148770&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148770
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 14:52
Start Date: 27/Sep/18 14:52
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-425123012
 
 
   Thanks. The code LGTM, PTAL at test failures and please do a performance 
check using updated code before merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148770)
Time Spent: 4h 20m  (was: 4h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148691&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148691
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 11:16
Start Date: 27/Sep/18 11:16
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220884037
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,60 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a*x + b over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a*x + b over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 1.
+  b, a = np.polyfit(xs, ys, 1, w=cook_ds < 1)
 
 Review comment:
   Reconciled. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148691)
Time Spent: 4h 10m  (was: 4h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division con

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148690&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148690
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 11:16
Start Date: 27/Sep/18 11:16
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220883853
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,60 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a*x + b over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a*x + b over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
 
 Review comment:
   The typical definition takes into account the number of degrees of freedom. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148690)
Time Spent: 4h  (was: 3h 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes f

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148689&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148689
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 11:14
Start Date: 27/Sep/18 11:14
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220883501
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
-  cap = min(cap, (a / b) * (1 / self._target_batch_overhead - 1))
+  target = min(target, (a / b) * (1 / self._target_batch_overhead - 1))
 
-# Avoid getting stuck at min_batch_size.
+# Avoid getting stuck.
 jitter = len(self._data) % 2
-return int(max(self._min_batch_size + jitter, cap))
+if len(self._data) > 10:
+  target += int(target * self._variance * 2 * (random.random() - .5))
+
+return int(max(self._min_batch_size + jitter, min(target, cap)))
 
 Review comment:
   Clarified. The choice of approximation isn't important here (and I don't 
want to give the impression that it's a non-linear polynomial fit.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148689)
Time Spent: 3h 50m  (was: 3h 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148687&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148687
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 11:11
Start Date: 27/Sep/18 11:11
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220882847
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
 
 Review comment:
   Right. Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148687)
Time Spent: 3h 40m  (was: 3.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148625&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148625
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220826243
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
 
 Review comment:
   Thanks, cross-checked :-). Did you also compute h by pen-and-paper or there 
is a faster way? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148625)
Time Spent: 3h 10m  (was: 3h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/tra

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148627
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220795308
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,60 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a*x + b over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a*x + b over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Re-compute the regression, excluding those points with Cook's distance
+  # greater than 1.
+  b, a = np.polyfit(xs, ys, 1, w=cook_ds < 1)
 
 Review comment:
   Nit: Excluding those with D _greater_ than one, so w should be cook_ds  <= 1?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148627)
Time Spent: 3.5h  (was: 3h 20m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become f

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148622
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220659176
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
 
 Review comment:
   Thanks. I think the comment should say `y = a + bx` here and in line 297 
below according to current naming?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148622)
Time Spent: 2h 50m  (was: 2h 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148623&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148623
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220792493
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
-  cap = min(cap, (a / b) * (1 / self._target_batch_overhead - 1))
+  target = min(target, (a / b) * (1 / self._target_batch_overhead - 1))
 
-# Avoid getting stuck at min_batch_size.
+# Avoid getting stuck.
 jitter = len(self._data) % 2
-return int(max(self._min_batch_size + jitter, cap))
+if len(self._data) > 10:
+  target += int(target * self._variance * 2 * (random.random() - .5))
+
+return int(max(self._min_batch_size + jitter, min(target, cap)))
 
 Review comment:
   Thanks, that helps. Consider the wording: ```...which would not allow us to 
do the approximation with a polynomial.```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148623)
Time Spent: 3h  (was: 2h 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148621&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148621
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220659438
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
 
 Review comment:
   Thanks, I missed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148621)
Time Spent: 2h 50m  (was: 2h 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148624&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148624
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220791614
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,60 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a*x + b over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a*x + b over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
 
 Review comment:
   Shouldn't this be `a + b *xs - ys` since we have ` b, a = np.polyfit(xs, ys, 
1) ` and `polyfit` returns coefficient for highest power first? The test starts 
to fail with this change though...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148624)
Time Spent: 3h  (was: 2h 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeli

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-27 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148626
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 27/Sep/18 07:59
Start Date: 27/Sep/18 07:59
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220792049
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,60 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data)
+# As it's used below to alternate jitter.
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+# Least squares fit for y = a*x + b over all points.
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit for y = a*x + b over all points.
+b, a = np.polyfit(xs, ys, 1)
+
+n = len(xs)
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  # https://en.wikipedia.org/wiki/Cook%27s_distance
+  sum_x = sum(xs)
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
 
 Review comment:
   s2 is mean squared error, right? Why do we divide over `n-2` as opposed to 
`n`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148626)
Time Spent: 3h 20m  (was: 3h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact l

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148103&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148103
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220539072
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
-  cap = min(cap, (a / b) * (1 / self._target_batch_overhead - 1))
+  target = min(target, (a / b) * (1 / self._target_batch_overhead - 1))
 
-# Avoid getting stuck at min_batch_size.
+# Avoid getting stuck.
 jitter = len(self._data) % 2
-return int(max(self._min_batch_size + jitter, cap))
+if len(self._data) > 10:
+  target += int(target * self._variance * 2 * (random.random() - .5))
+
+return int(max(self._min_batch_size + jitter, min(target, cap)))
 
 Review comment:
   Clarified. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148103)
Time Spent: 2h 10m  (was: 2h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148107&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148107
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220537728
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
 
 Review comment:
   These are described in the public-facing docstring below. I plumbed variance 
through there as well now. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148107)
Time Spent: 2h 40m  (was: 2.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148106&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148106
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220533414
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
 
 Review comment:
   Huh, I've never heard it called that before. Added an unambiguous comment. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148106)
Time Spent: 2.5h  (was: 2h 20m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148104&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148104
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220533548
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
 
 Review comment:
   Oh, that's a good one. It even supports weights (needed below). Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148104)
Time Spent: 2h 20m  (was: 2h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of t

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148102&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148102
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220535187
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
 
 Review comment:
   Added a link.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148102)
Time Spent: 2h 10m  (was: 2h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148105&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148105
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 12:28
Start Date: 26/Sep/18 12:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220535149
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
 
 Review comment:
   This is an absolute, not relative. measure of error, so there's not a 
default epsilon that makes sense. I simply want to divide by this below to 
normalize the error of each point to this "average" error, which is fine if 
it's tiny. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148105)
Time Spent: 2h 20m  (was: 2h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=148010&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-148010
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:28
Start Date: 26/Sep/18 07:28
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220451868
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util_test.py
 ##
 @@ -125,6 +126,75 @@ def test_target_overhead(self):
 clock.sleep(batch_duration(actual_sizes[-1]))
 self.assertEqual(expected_sizes, actual_sizes)
 
+  def test_variance(self):
+clock = FakeClock()
 
 Review comment:
   Actually, it shouldn't matter.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 148010)
Time Spent: 2h  (was: 1h 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147997&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147997
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220443132
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
-  cap = min(cap, (a / b) * (1 / self._target_batch_overhead - 1))
+  target = min(target, (a / b) * (1 / self._target_batch_overhead - 1))
 
-# Avoid getting stuck at min_batch_size.
+# Avoid getting stuck.
 jitter = len(self._data) % 2
-return int(max(self._min_batch_size + jitter, cap))
+if len(self._data) > 10:
+  target += int(target * self._variance * 2 * (random.random() - .5))
+
+return int(max(self._min_batch_size + jitter, min(target, cap)))
 
 Review comment:
   Curious, what's the reasoning behind adding `jitter` (a derivative of length 
of historical datapoints) to minimum batch size when computing recommended 
batch size?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147997)
Time Spent: 1h 40m  (was: 1.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147991&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147991
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220429071
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
 
 Review comment:
   We could mention in a comment that we are computing a 'line of best fit', 
which seems to be a common name for the simple least square fit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147991)
Time Spent: 1h 10m  (was: 1h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147998&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147998
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220435226
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
 
 Review comment:
   Do you have a suggestion for a reference to cross-check the formulas for 
computing Cook's distance we are using here? It would also be helpful to leave 
it as a comment for anyone who will read this later.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147998)
Time Spent: 1h 40m  (was: 1.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
>

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147992&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147992
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220447914
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util_test.py
 ##
 @@ -125,6 +126,75 @@ def test_target_overhead(self):
 clock.sleep(batch_duration(actual_sizes[-1]))
 self.assertEqual(expected_sizes, actual_sizes)
 
+  def test_variance(self):
+clock = FakeClock()
 
 Review comment:
   How about we fix the initial start time of FakeClock() to reduce potential 
for flakiness?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147992)
Time Spent: 1h 20m  (was: 1h 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147995&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147995
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220439390
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -316,17 +357,22 @@ def next_batch_size(self):
 last_batch_size = self._data[-1][0]
 cap = min(last_batch_size * self._MAX_GROWTH_FACTOR, self._max_batch_size)
 
+target = self._max_batch_size
+
 if self._target_batch_duration_secs:
   # Solution to a + b*x = self._target_batch_duration_secs.
-  cap = min(cap, (self._target_batch_duration_secs - a) / b)
+  target = min(target, (self._target_batch_duration_secs - a) / b)
 
 if self._target_batch_overhead:
   # Solution to a / (a + b*x) = self._target_batch_overhead.
 
 Review comment:
   What is the interpretation of target_batch_overhead? Trying to understand 
where this equation comes from.
   
   Also, it would be helpful to explain constructor parameters in the 
docstring. I think the meaning of `target_batch_overhead` and `variance` are 
not obvious without diving into the code and a description may be helpful for 
getting a high-level understanding how to tune this if necessary. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147995)
Time Spent: 1h 40m  (was: 1.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147996&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147996
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220428808
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
 
 Review comment:
   You could consider using something like:
   ```a, b = np.polyfit(xs, ys, 1)```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147996)
Time Spent: 1h 40m  (was: 1.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performa

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147994&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147994
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220430306
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
 
 Review comment:
   Perhaps we should say `s2 < eps` for some value of `eps`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147994)
Time Spent: 1.5h  (was: 1h 20m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how divisio

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147993&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147993
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220445091
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
 
 Review comment:
   Why is this important?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147993)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-26 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147999&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147999
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 26/Sep/18 07:14
Start Date: 26/Sep/18 07:14
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on a change in pull request #6375: 
[BEAM-4858] Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#discussion_r220437474
 
 

 ##
 File path: sdks/python/apache_beam/transforms/util.py
 ##
 @@ -269,23 +270,68 @@ def record_time(self, batch_size):
 self._thin_data()
 
   def _thin_data(self):
-sorted_data = sorted(self._data)
-odd_one_out = [sorted_data[-1]] if len(sorted_data) % 2 == 1 else []
-# Sort the pairs by how different they are.
-
-def div_keys(kv1_kv2):
-  (x1, _), (x2, _) = kv1_kv2
-  return old_div(x2, x1) # TODO(BEAM-4858)
-
-pairs = sorted(zip(sorted_data[::2], sorted_data[1::2]),
-   key=div_keys)
-# Keep the top 1/3 most different pairs, average the top 2/3 most similar.
-threshold = 2 * len(pairs) // 3
-self._data = (
-list(sum(pairs[threshold:], ()))
-+ [((x1 + x2) / 2.0, (t1 + t2) / 2.0)
-   for (x1, t1), (x2, t2) in pairs[:threshold]]
-+ odd_one_out)
+# Make sure we don't change the parity of len(self._data).
+self._data.pop(random.randrange(len(self._data) // 4))
+self._data.pop(random.randrange(len(self._data) // 2))
+
+  @staticmethod
+  def linear_regression_no_numpy(xs, ys):
+n = float(len(xs))
+xbar = sum(xs) / n
+ybar = sum(ys) / n
+b = (sum([(x - xbar) * (y - ybar) for x, y in zip(xs, ys)])
+ / sum([(x - xbar)**2 for x in xs]))
+a = ybar - b * xbar
+return a, b
+
+  @staticmethod
+  def linear_regression_numpy(xs, ys):
+# pylint: disable=wrong-import-order, wrong-import-position
+import numpy as np
+from numpy import sum
+xs = np.asarray(xs, dtype=float)
+ys = np.asarray(ys, dtype=float)
+
+# First do a simple least squares fit over all points.
+n = len(xs)
+sum_x = sum(xs)
+sum_y = sum(ys)
+xbar = sum_x / n
+ybar = sum_y / n
+b = sum((xs - xbar) * (ys - ybar)) / sum((xs - xbar)**2)
+a = ybar - b * xbar
+
+if n < 10:
+  return a, b
+else:
+  # Refine this by throwing out outliers, according to Cook's distance.
+  sum_x2 = sum(xs**2)
+  errs = a * xs + b - ys
+  s2 = sum(errs**2) / (n - 2)
+  if s2 == 0:
+# It's an exact fit!
+return a, b
+  h = (sum_x2 - 2 * sum_x * xs + n * xs**2) / (n * sum_x2 - sum_x**2)
+  cook_ds = 0.5 / s2 * errs**2 * (h / (1 - h)**2)
+
+  # Exclude those with Cook's distance greater than 1, and re-compute
+  # the regression.
+  weight = cook_ds < 1
 
 Review comment:
   We could use `np.polyfit` here as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147999)
Time Spent: 1h 50m  (was: 1h 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-25 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=147627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147627
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 25/Sep/18 15:51
Start Date: 25/Sep/18 15:51
Worklog Time Spent: 10m 
  Work Description: robertwb commented on issue #6375: [BEAM-4858] Clean up 
division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-424398681
 
 
   Tests passing, PTAL.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147627)
Time Spent: 1h  (was: 50m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-21 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=146647&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-146647
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 22/Sep/18 00:50
Start Date: 22/Sep/18 00:50
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-423704721
 
 
   Looks like the Precommit failure is specific to this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 146647)
Time Spent: 40m  (was: 0.5h)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-21 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=146648&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-146648
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 22/Sep/18 00:50
Start Date: 22/Sep/18 00:50
Worklog Time Spent: 10m 
  Work Description: tvalentyn edited a comment on issue #6375: [BEAM-4858] 
Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-423704721
 
 
   Looks like the Precommit failure is specific to this PR now that tests on 
the master branch are passing again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 146648)
Time Spent: 50m  (was: 40m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-21 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=146620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-146620
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 21/Sep/18 22:35
Start Date: 21/Sep/18 22:35
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-423688655
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 146620)
Time Spent: 0.5h  (was: 20m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-20 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=146098&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-146098
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 20/Sep/18 18:36
Start Date: 20/Sep/18 18:36
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #6375: [BEAM-4858] Clean 
up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375#issuecomment-423288734
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 146098)
Time Spent: 20m  (was: 10m)

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-12 Thread ASF GitHub Bot (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-4858?focusedWorklogId=143437&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-143437
 ]

ASF GitHub Bot logged work on BEAM-4858:


Author: ASF GitHub Bot
Created on: 12/Sep/18 09:50
Start Date: 12/Sep/18 09:50
Worklog Time Spent: 10m 
  Work Description: robertwb opened a new pull request #6375: [BEAM-4858] 
Clean up division in batch size estimator.
URL: https://github.com/apache/beam/pull/6375
 
 
   R: @tvalentyn 
   
   
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   It will help us expedite review of your Pull Request if you tag someone 
(e.g. `@username`) to look at it.
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/)
 | --- | --- | --- | --- | --- | ---
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
  [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | --- | --- | --- | ---
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 143437)
Time Spent: 10m
Remaining Estimate: 0h

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/

42 matches

Mail list logo