[jira] [Created] (BEAM-5720) Default coder breaks with large ints on Python 3

2018-10-11 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5720:
-

 Summary: Default coder breaks with large ints on Python 3
 Key: BEAM-5720
 URL: https://issues.apache.org/jira/browse/BEAM-5720
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


The test for `int` includes greater than 64-bit values, which causes an 
overflow error later in the code. We need to only use that coding scheme for 
machine-sized ints. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5705) PRs 6557 and 6589 break internal Dataflow tests

2018-10-11 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646167#comment-16646167
 ] 

Robert Bradshaw commented on BEAM-5705:
---

I think this is fixed by https://github.com/apache/beam/pull/6600

>  PRs 6557 and 6589 break internal Dataflow tests
> 
>
> Key: BEAM-5705
> URL: https://issues.apache.org/jira/browse/BEAM-5705
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.8.0
>Reporter: Charles Chen
>Assignee: Ahmet Altay
>Priority: Blocker
> Fix For: 2.8.0
>
>
> The PRs [https://github.com/apache/beam/pull/6557] and 
> [https://github.com/apache/beam/pull/6589] (which fixed postcommits for 6557 
> with a temporary workaround) break internal Dataflow tests.  The Dataflow 
> team needs to investigate this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-10 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-4858.
---
   Resolution: Fixed
Fix Version/s: 2.8.0

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
> Fix For: 2.8.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-10-10 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644983#comment-16644983
 ] 

Robert Bradshaw commented on BEAM-4858:
---

Submitted pr/6375.
 * Simplifies the aging out of old data. Not only was the old formula hard to
understand, but it meant that bad data could stick around and poison the
estimates forever.
 * Adds a variance parameter allowing the batch size to vary over a fixed
range giving a broader base for the linear regression.
 * Uses numpy when available to do the regression. This is both much more
efficient and allows for less error-prone expression of more complicated
analysis.

The algorithm was also changed to:
 * Eliminates outliers, both using Cook's distance and just throwing out the
top (often high-variance and high-influence) 20% when there is sufficient
data.
 * Weight by the inverse of batch size, to provide a more stable fixed size
estimate (which the default "overhead" target is sensitive to).

These changes were tested against a large TFT pipeline and found to produce
more uniform batch sizes and similar, possibly slightly improved, total
runtimes and total costs.

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
> Fix For: 2.8.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5702) Avoid reshuffle for zero and one element creates

2018-10-10 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5702:
-

 Summary: Avoid reshuffle for zero and one element creates
 Key: BEAM-5702
 URL: https://issues.apache.org/jira/browse/BEAM-5702
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


These are commonly used (e.g. for Writes or CombineGlobally with Default) and 
can be implemented directly on top of impulse rather. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5664) A canceled pipeline should not return a done status in the jobserver.

2018-10-05 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5664:
-

 Summary: A canceled pipeline should not return a done status in 
the jobserver.
 Key: BEAM-5664
 URL: https://issues.apache.org/jira/browse/BEAM-5664
 Project: Beam
  Issue Type: Improvement
  Components: runner-core
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5664) A canceled pipeline should not return a done status in the jobserver.

2018-10-05 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5664:
--
Labels: portability-flink  (was: )

> A canceled pipeline should not return a done status in the jobserver.
> -
>
> Key: BEAM-5664
> URL: https://issues.apache.org/jira/browse/BEAM-5664
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Robert Bradshaw
>Assignee: Kenneth Knowles
>Priority: Major
>  Labels: portability-flink
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode

2018-10-03 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636638#comment-16636638
 ] 

Robert Bradshaw commented on BEAM-5500:
---

In light of [https://github.com/apache/beam/pull/6517] can we call this closed?

> Portable python sdk worker leaks memory in streaming mode
> -
>
> Key: BEAM-5500
> URL: https://issues.apache.org/jira/browse/BEAM-5500
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Micah Wylde
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability-flink
> Attachments: chart.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using the portable python sdk with flink in streaming mode, we see that 
> the python worker processes steadily increase memory usage until they are OOM 
> killed. This behavior is consistent across various kinds of streaming 
> pipelines, including those with fixed windows and global windows.
> A simple wordcount-like pipeline demonstrates the issue for us (note this is 
> run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides 
> access to kinesis as a portable streaming source):
> {code:java}
> counts = (p
> | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream')
> | 'decode' >> beam.FlatMap(decode) # parses from json into python objs
> | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1))
> | 'window' >> beam.WindowInto(window.GlobalWindows(),
>   trigger=AfterProcessingTime(15 * 1000),
>   accumulation_mode=AccumulationMode.DISCARDING)
> | 'group' >> beam.GroupByKey()
> | 'count' >> beam.Map(count_ones)
> | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x))
> {code}
> When run, we see a steady increase in memory usage in the sdk_worker process. 
> Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory 
> usage over time and found that it's largely dicts and strings (see attached 
> chart).
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5606) Automatically infer gcp project from environment.

2018-10-02 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5606:
-

 Summary: Automatically infer gcp project from environment.
 Key: BEAM-5606
 URL: https://issues.apache.org/jira/browse/BEAM-5606
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5521) Cache execution trees in SDK worker

2018-09-28 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5521:
-

 Summary: Cache execution trees in SDK worker
 Key: BEAM-5521
 URL: https://issues.apache.org/jira/browse/BEAM-5521
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw


Currently they are re-constructed from the protos for every bundle, which is 
expensive (especially for 1-element bundles in streaming flink). 

Care should be taken to ensure the objects can be re-usued. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5521) Cache execution trees in SDK worker

2018-09-28 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5521:
--
Labels: portability-flink  (was: )

> Cache execution trees in SDK worker
> ---
>
> Key: BEAM-5521
> URL: https://issues.apache.org/jira/browse/BEAM-5521
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability-flink
>
> Currently they are re-constructed from the protos for every bundle, which is 
> expensive (especially for 1-element bundles in streaming flink). 
> Care should be taken to ensure the objects can be re-usued. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode

2018-09-27 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5500:
--
Labels: portability-flink  (was: )

> Portable python sdk worker leaks memory in streaming mode
> -
>
> Key: BEAM-5500
> URL: https://issues.apache.org/jira/browse/BEAM-5500
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Micah Wylde
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability-flink
> Attachments: chart.png
>
>
> When using the portable python sdk with flink in streaming mode, we see that 
> the python worker processes steadily increase memory usage until they are OOM 
> killed. This behavior is consistent across various kinds of streaming 
> pipelines, including those with fixed windows and global windows.
> A simple wordcount-like pipeline demonstrates the issue for us (note this is 
> run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides 
> access to kinesis as a portable streaming source):
> {code:java}
> counts = (p
> | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream')
> | 'decode' >> beam.FlatMap(decode) # parses from json into python objs
> | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1))
> | 'window' >> beam.WindowInto(window.GlobalWindows(),
>   trigger=AfterProcessingTime(15 * 1000),
>   accumulation_mode=AccumulationMode.DISCARDING)
> | 'group' >> beam.GroupByKey()
> | 'count' >> beam.Map(count_ones)
> | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x))
> {code}
> When run, we see a steady increase in memory usage in the sdk_worker process. 
> Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory 
> usage over time and found that it's largely dicts and strings (see attached 
> chart).
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5509) Python pipeline_options doesn't handle int type

2018-09-27 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630428#comment-16630428
 ] 

Robert Bradshaw commented on BEAM-5509:
---

I don't think we should be representing integral values as floating point in 
the pipeline options representation (though perhaps we'd have to use strings 
given that JSON doesn't support ints.)

> Python pipeline_options doesn't handle int type
> ---
>
> Key: BEAM-5509
> URL: https://issues.apache.org/jira/browse/BEAM-5509
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability-flink
>
> The int option supplied at the command line is turned into a decimal during 
> serialization and then the parser in SDK harness fails to restore it as int.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5509) Python pipeline_options doesn't handle int type

2018-09-27 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5509:
--
Labels: portability-flink  (was: )

> Python pipeline_options doesn't handle int type
> ---
>
> Key: BEAM-5509
> URL: https://issues.apache.org/jira/browse/BEAM-5509
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability-flink
>
> The int option supplied at the command line is turned into a decimal during 
> serialization and then the parser in SDK harness fails to restore it as int.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode

2018-09-27 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630296#comment-16630296
 ] 

Robert Bradshaw commented on BEAM-5500:
---

I've tried running this repeatedly in the direct runner, and am only finding 
the aforementioned leak of ~300 bytes per stage per bundle (which, yes, needs 
to be resolved). How many bundles/second are you processing? 

> Portable python sdk worker leaks memory in streaming mode
> -
>
> Key: BEAM-5500
> URL: https://issues.apache.org/jira/browse/BEAM-5500
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Micah Wylde
>Assignee: Robert Bradshaw
>Priority: Major
> Attachments: chart.png
>
>
> When using the portable python sdk with flink in streaming mode, we see that 
> the python worker processes steadily increase memory usage until they are OOM 
> killed. This behavior is consistent across various kinds of streaming 
> pipelines, including those with fixed windows and global windows.
> A simple wordcount-like pipeline demonstrates the issue for us (note this is 
> run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides 
> access to kinesis as a portable streaming source):
> {code:java}
> counts = (p
> | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream')
> | 'decode' >> beam.FlatMap(decode) # parses from json into python objs
> | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1))
> | 'window' >> beam.WindowInto(window.GlobalWindows(),
>   trigger=AfterProcessingTime(15 * 1000),
>   accumulation_mode=AccumulationMode.DISCARDING)
> | 'group' >> beam.GroupByKey()
> | 'count' >> beam.Map(count_ones)
> | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x))
> {code}
> When run, we see a steady increase in memory usage in the sdk_worker process. 
> Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory 
> usage over time and found that it's largely dicts and strings (see attached 
> chart).
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.

2018-09-25 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626973#comment-16626973
 ] 

Robert Bradshaw commented on BEAM-5468:
---

Namespaced logging in the Python SDK would be convenient (and could be done 
automatically based on modules, see [https://gist.github.com/bdarnell/3118509]) 
but this is somewhat orthogonal. 

I agree the default should be whatever the runner is using, with an option to 
set it explicitly (analogous to how Java does it), coordinating such that the 
harness never asks for logs it's just going to drop. 

> Allow runner to set worker log level in Python SDK harness.
> ---
>
> Key: BEAM-5468
> URL: https://issues.apache.org/jira/browse/BEAM-5468
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3622) DirectRunner memory issue with Python SDK

2018-09-25 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3622:
--
Component/s: (was: sdk-py-harness)

> DirectRunner memory issue with Python SDK
> -
>
> Key: BEAM-3622
> URL: https://issues.apache.org/jira/browse/BEAM-3622
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: yuri krnr
>Assignee: Charles Chen
>Priority: Major
>
> After running pipeline for a while in a streaming mode (reading from Pub/Sub 
> and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic 
> memory usage of a process. Using guppy as a profiler I got the following 
> results:
> start
> {noformat}
>  INFO *** MemoryReport Heap:
>  Partition of a set of 240208 objects. Total size = 34988840 bytes.
>  Index  Count   % Size   % Cumulative  % Kind (class / dict of class)
>  0  88289  37  8696984  25   8696984  25 str
>  1  5  22  4897352  14  13594336  39 tuple
>  2   5083   2  2790664   8  16385000  47 dict (no owner)
>  3   1939   1  1749656   5  18134656  52 type
>  4699   0  1723272   5  19857928  57 dict of module
>  5  12337   5  1579136   5  21437064  61 types.CodeType
>  6  12403   5  1488360   4  22925424  66 function
>  7   1939   1  1452616   4  24378040  70 dict of type
>  8677   0   709496   2  25087536  72 dict of 0x1e4d880
>  9  25603  11   614472   2  25702008  73 int
> <1103 more rows. Type e.g. '_.more' to view.>
> {noformat}
> after several hours of running
> {noformat}
> INFO *** MemoryReport Heap:
>  Partition of a set of 1255662 objects. Total size = 315029632 bytes.
>  Index  Count   % Size   % Cumulative  % Kind (class / dict of class)
>  0  95554   8 99755056  32  99755056  32 dict of
>  
> apache_beam.runners.direct.bundle_factory._Bundle
>  1 117943   9 54193192  17 153948248  49 dict (no owner)
>  2 161068  13 27169296   9 181117544  57 unicode
>  3  94571   8 26479880   8 207597424  66 dict of apache_beam.pvalue.PBegin
>  4 126461  10 12715336   4 220312760  70 str
>  5  44374   4 12424720   4 232737480  74 dict of 
> apitools.base.protorpclite.messages.FieldList
>  6  44374   4  6348624   2 239086104  76 
> apitools.base.protorpclite.messages.FieldList
>  7  95556   8  6115584   2 245201688  78 
> apache_beam.runners.direct.bundle_factory._Bundle
>  8  94571   8  6052544   2 251254232  80 apache_beam.pvalue.PBegin
>  9  57371   5  5218424   2 256472656  81 tuple
> <1187 more rows. Type e.g. '_.more' to view.>
> {noformat}
>  
> I see that every bundle still sits in memory and all its data too. why aren't 
> the gc-ed?
> What is the policy for gc for the dataflow processes?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.

2018-09-24 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5468:
-

 Summary: Allow runner to set worker log level in Python SDK 
harness.
 Key: BEAM-5468
 URL: https://issues.apache.org/jira/browse/BEAM-5468
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5442) PortableRunner swallows custom options for Runner

2018-09-21 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623368#comment-16623368
 ] 

Robert Bradshaw commented on BEAM-5442:
---

It seems like we should validate the options we know about, but pass all 
options through. (Maybe warn about ones we don't know about?)

> PortableRunner swallows custom options for Runner
> -
>
> Key: BEAM-5442
> URL: https://issues.apache.org/jira/browse/BEAM-5442
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, sdk-py-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
>  Labels: portability, portability-flink
> Fix For: 2.8.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The PortableRunner doesn't pass custom PipelineOptions to the executing 
> Runner.
> Example: {{--parallelism=4}} won't be forwarded to the FlinkRunner.
> (The option is just removed during proto translation without any warning)
> We should allow some form of customization through the options, even for the 
> PortableRunner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


***UNCHECKED*** [jira] [Created] (BEAM-5428) Implement cross-bundle state caching.

2018-09-19 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5428:
-

 Summary: Implement cross-bundle state caching.
 Key: BEAM-5428
 URL: https://issues.apache.org/jira/browse/BEAM-5428
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-5395) BeamPython data plane streams data

2018-09-18 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-5395.
---
   Resolution: Fixed
Fix Version/s: 2.8.0

> BeamPython data plane streams data
> --
>
> Key: BEAM-5395
> URL: https://issues.apache.org/jira/browse/BEAM-5395
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>Priority: Major
> Fix For: 2.8.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently the default implementation is to buffer all data for the bundle. 
> Experiments were made splitting at arbitrary byte boundaries, but it appears 
> that Java requires messages to be split on element boundaries. For now we 
> should implement that in Python (even if this means not being able to split 
> up large elements among multiple messages). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5395) BeamPython data plane streams data

2018-09-17 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5395:
-

 Summary: BeamPython data plane streams data
 Key: BEAM-5395
 URL: https://issues.apache.org/jira/browse/BEAM-5395
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw


Currently the default implementation is to buffer all data for the bundle. 

Experiments were made splitting at arbitrary byte boundaries, but it appears 
that Java requires messages to be split on element boundaries. For now we 
should implement that in Python (even if this means not being able to split up 
large elements among multiple messages). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5394) BeamPython on Flink runs at scale

2018-09-17 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5394:
-

 Summary: BeamPython on Flink runs at scale
 Key: BEAM-5394
 URL: https://issues.apache.org/jira/browse/BEAM-5394
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw


This is an umbrella bug for verifying that BeamPython can run production sized 
workloads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-5392) GroupByKey on Spark: All values for a single key need to fit in-memory at once

2018-09-17 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-5392:
--
Summary: GroupByKey on Spark: All values for a single key need to fit 
in-memory at once  (was: GroupByKey: All values for a single key need to fit 
in-memory at once)

> GroupByKey on Spark: All values for a single key need to fit in-memory at once
> --
>
> Key: BEAM-5392
> URL: https://issues.apache.org/jira/browse/BEAM-5392
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Affects Versions: 2.6.0
>Reporter: David Moravek
>Assignee: David Moravek
>Priority: Major
>  Labels: performance
>
> Currently, when using GroupByKey, all values for a single key need to fit 
> in-memory at once.
>  
> There are following issues, that need to be addressed:
> a) We can not use Spark's _groupByKey_, because it requires all values to fit 
> in memory for a single key (it is implemented as "list combiner")
> b) _ReduceFnRunner_ iterates over values multiple times in order to group 
> also by window
>  
> Solution:
>  
> In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that 
> can take advantage of having elements for a single key sorted by timestamp.
>  
> We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet 
> this constraint.
>  
> For non-merging windows, we can put window itself into the key resulting in 
> smaller groupings.
>  
> This approach was already tested in ~100TB input scale on Spark 2.3.x. 
> (custom Spark runner).
>  
> I'll submit a patch once the Dataflow Worker code donation is complete.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3106) Consider not pinning all python dependencies, or moving them to requirements.txt

2018-09-13 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614093#comment-16614093
 ] 

Robert Bradshaw commented on BEAM-3106:
---

Our main requirements now specify version ranges (generally guided by semantic 
versioning); we should unpin our gcp requirements when possible as well 
wherever possible. 

> Consider not pinning all python dependencies, or moving them to 
> requirements.txt
> 
>
> Key: BEAM-3106
> URL: https://issues.apache.org/jira/browse/BEAM-3106
> Project: Beam
>  Issue Type: Wish
>  Components: build-system
>Affects Versions: 2.1.0
> Environment: python
>Reporter: Maximilian Roos
>Priority: Major
>
> Currently all python dependencies are [pinned or 
> capped|https://github.com/apache/beam/blob/master/sdks/python/setup.py#L97]
> While there's a good argument for supplying a `requirements.txt` with well 
> tested dependencies, having them specified in `setup.py` forces them to an 
> exact state on each install of Beam. This makes using Beam in any environment 
> with other libraries nigh on impossible. 
> This is particularly severe for the `gcp` dependencies, where we have 
> libraries that won't work with an older version (but Beam _does_ work with an 
> newer version). We have to do a bunch of gymnastics to get the correct 
> versions installed because of this. Unfortunately, airflow repeats this 
> practice and conflicts on a number of dependencies, adding further 
> complication (but, again there is no real conflict).
> I haven't seen this practice outside of the Apache & Google ecosystem - for 
> example no libraries in numerical python do this. Here's a [discussion on 
> SO|https://stackoverflow.com/questions/28509481/should-i-pin-my-python-dependencies-versions]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5354) Side Inputs seems to be non-working in the sdk-go

2018-09-13 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613379#comment-16613379
 ] 

Robert Bradshaw commented on BEAM-5354:
---

There are some worker harness bugs in handling of side inputs on dataflow that 
I've been fixing. I'm in the process of pushing out new containers. 

> Side Inputs seems to be non-working in the sdk-go
> -
>
> Key: BEAM-5354
> URL: https://issues.apache.org/jira/browse/BEAM-5354
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-go
>Reporter: Tomas Roos
>Priority: Major
>
> Running the contains example fails with
>  
> {code:java}
> Output i0 for step was not found.
> {code}
> This is because of the call to debug.Head (which internally uses SideInput)
> Removing the following line 
> [https://github.com/apache/beam/blob/master/sdks/go/examples/contains/contains.go#L50]
>  
> The pipeline executes well.
>  
> Executed on id's
>  
> go-job-1-1536664417610678545 
> vs
> go-job-1-1536664934354466938
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.

2018-09-12 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611858#comment-16611858
 ] 

Robert Bradshaw commented on BEAM-4858:
---

I created [https://github.com/apache/beam/pull/6375] which cleans this up. 
Floating point division is always the correct thing to do here; I'd like to 
understand under what circumstances this would be made worse. (Note that for a 
real pipeline, the actual timings can be noisy.)

 

Note that the code in question should not be (that) performance sensitive (with 
the exception of adding to the batch and checking the batch size), as the 
complicated logic is called once per batch (which should be reasonably 
expensive, and is taken into account).

> Clean up _BatchSizeEstimator in element-batching transform.
> ---
>
> Key: BEAM-4858
> URL: https://issues.apache.org/jira/browse/BEAM-4858
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Robert Bradshaw
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] 
> non-trivial performance-sensitive logic in element-batching transform. Let's 
> take a look at 
> [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271].
>  
> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on 
> the type of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260],
>  however, when we 'thin' the collected entries 
> [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279],
>  the keys will become floats. Surprisingly, using either integer or float 
> division consistently [in the 
> comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]
>   negatively affects the performance of a custom pipeline I was using to 
> benchmark these changes. The performance impact likely comes from changes in 
> the logic that depends on  how division is evaluated, not from the 
> performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids 
> regression seems to be to preserve the existing Python 2 behavior using 
> {{old_div}} from {{past.utils.division}}, in the medium term we should clean 
> up the logic. We may want to add a targeted microbenchmark to evaluate 
> performance of this code, and maybe cythonize the code, since it seems to be 
> performance-sensitive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-5214) Update Java quickstart to use maven

2018-09-03 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601920#comment-16601920
 ] 

Robert Bradshaw commented on BEAM-5214:
---

Yes, though the initial command to generate the example ({{mvn 
archetype:generate}}) should probably be changed to something that sill works.

> Update Java quickstart to use maven
> ---
>
> Key: BEAM-5214
> URL: https://issues.apache.org/jira/browse/BEAM-5214
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, website
>Reporter: Robert Bradshaw
>Assignee: Reuven Lax
>Priority: Major
>
> The existing quickstart still uses mvn commands. 
> https://beam.apache.org/get-started/quickstart-java/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5255) Fix over-aggressive division futurization.

2018-08-29 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5255:
-

 Summary: Fix over-aggressive division futurization.
 Key: BEAM-5255
 URL: https://issues.apache.org/jira/browse/BEAM-5255
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: 2.6.0
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


When converting from Python 2 to Python 3, `a / b` becomes `a // b` only for 
ints, but it is incorrect to do this substitution for floating point division. 

 

I noticed this change in the microbenchmarks, but we should do an audit to make 
sure we haven't broken things elsewhere. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-5214) Update Java quickstart to use maven

2018-08-24 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-5214:
-

 Summary: Update Java quickstart to use maven
 Key: BEAM-5214
 URL: https://issues.apache.org/jira/browse/BEAM-5214
 Project: Beam
  Issue Type: Bug
  Components: examples-java, website
Reporter: Robert Bradshaw
Assignee: Reuven Lax


The existing quickstart still uses mvn commands. 

https://beam.apache.org/get-started/quickstart-java/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4782) Enforce KV coders for MultiMap side inputs

2018-07-12 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4782:
-

 Summary: Enforce KV coders for MultiMap side inputs
 Key: BEAM-4782
 URL: https://issues.apache.org/jira/browse/BEAM-4782
 Project: Beam
  Issue Type: Task
  Components: sdk-py-harness
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw


Non-Python runners don't understand that the (default) FastPrimitivesCoder may 
be a KV coder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4781) PTransforms that simply return their input cause portable Flink to crash.

2018-07-12 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4781:
-

 Summary: PTransforms that simply return their input cause portable 
Flink to crash.
 Key: BEAM-4781
 URL: https://issues.apache.org/jira/browse/BEAM-4781
 Project: Beam
  Issue Type: Task
  Components: runner-flink
Reporter: Robert Bradshaw
Assignee: Aljoscha Krettek


E.g.

 
{code:python}
class MaybePrint(beam.PTransform):
   def expand(self, pcoll):
if some_flag:
  pcoll | beam.Map(logging.info)
    return pcoll
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4729) Conditionally propagate local GCS credentials to locally spawned docker images.

2018-07-03 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4729:
-

 Summary: Conditionally propagate local GCS credentials to locally 
spawned docker images.
 Key: BEAM-4729
 URL: https://issues.apache.org/jira/browse/BEAM-4729
 Project: Beam
  Issue Type: Task
  Components: sdk-java-harness
Reporter: Robert Bradshaw
Assignee: Luke Cwik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (BEAM-3883) Python SDK stages artifacts when talking to job server

2018-06-21 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reopened BEAM-3883:
---

Staging code is written, but not yet hooked up.

> Python SDK stages artifacts when talking to job server
> --
>
> Key: BEAM-3883
> URL: https://issues.apache.org/jira/browse/BEAM-3883
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ben Sidhom
>Assignee: Ankur Goenka
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 19h 10m
>  Remaining Estimate: 0h
>
> The Python SDK does not currently stage its user-defined functions or 
> dependencies when talking to the job API. Artifacts that need to be staged 
> include the user code itself, any SDK components not included in the 
> container image, and the list of Python packages that must be installed at 
> runtime.
>  
> Artifacts that are currently expected can be found in the harness boot code: 
> [https://github.com/apache/beam/blob/58e3b06bee7378d2d8db1c8dd534b415864f63e1/sdks/python/container/boot.go#L52.]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4605) Make GroupByKey the primitive in the Python SDK

2018-06-20 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518749#comment-16518749
 ] 

Robert Bradshaw commented on BEAM-4605:
---

At least the old direct runner relies on the composite structure, an inventory 
should be taken to see if there are others.

> Make GroupByKey the primitive in the Python SDK
> ---
>
> Key: BEAM-4605
> URL: https://issues.apache.org/jira/browse/BEAM-4605
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Assignee: Ahmet Altay
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently it is still a composite, with GBKOnly as a primitive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4605) Make GroupByKey the primitive in the Python SDK

2018-06-20 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4605:
-

 Summary: Make GroupByKey the primitive in the Python SDK
 Key: BEAM-4605
 URL: https://issues.apache.org/jira/browse/BEAM-4605
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Currently it is still a composite, with GBKOnly as a primitive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3884) Python SDK supports Impulse as a primitive transform

2018-06-19 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-3884:
-

Assignee: Robert Bradshaw  (was: Ahmet Altay)

> Python SDK supports Impulse as a primitive transform
> 
>
> Key: BEAM-3884
> URL: https://issues.apache.org/jira/browse/BEAM-3884
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ben Sidhom
>Assignee: Robert Bradshaw
>Priority: Major
>
> Portable runners require Impulse to be the only root nodes of pipelines. The 
> Python SDK should provide this for pipeline construction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4588) Use wheels for numpy

2018-06-19 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4588:
-

 Summary: Use wheels for numpy
 Key: BEAM-4588
 URL: https://issues.apache.org/jira/browse/BEAM-4588
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


When building the container, one has the line:

Skipping bdist_wheel for numpy, due to binaries being disabled for it.

This is arguably one of the most expensive packages to build from scratch. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4578) Cleanup paths due to obsolete worker/sdk upgrades.

2018-06-18 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4578:
-

 Summary: Cleanup paths due to obsolete worker/sdk upgrades.
 Key: BEAM-4578
 URL: https://issues.apache.org/jira/browse/BEAM-4578
 Project: Beam
  Issue Type: Task
  Components: runner-dataflow, sdk-py-core
Reporter: Robert Bradshaw
Assignee: Thomas Groh






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4565) Hot key fanout should not distribute keys to all shards.

2018-06-14 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4565:
-

 Summary: Hot key fanout should not distribute keys to all shards.
 Key: BEAM-4565
 URL: https://issues.apache.org/jira/browse/BEAM-4565
 Project: Beam
  Issue Type: Task
  Components: sdk-java-core, sdk-py-core
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 2.5.0
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles


The goal is to reduce the number of value sent to a single post-GBK worker. If 
combiner lifting happens, each bundle will sends a single value per sub-key, 
causing an N-fold blowup in shuffle data and N reducers with the same amount of 
data to consume as the single reducer in the non-fanout case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4494) Migrate website source code to apache/beam [website-migration] branch

2018-06-14 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512692#comment-16512692
 ] 

Robert Bradshaw commented on BEAM-4494:
---

Yes, we could just ask people to make changes elsewhere if we get PRs there. 
Also, the script above can be used multiple times, i.e. even if changes are 
done in both places it's just a git merge to get master that has everything. 

> Migrate website source code to apache/beam [website-migration] branch
> -
>
> Key: BEAM-4494
> URL: https://issues.apache.org/jira/browse/BEAM-4494
> Project: Beam
>  Issue Type: Sub-task
>  Components: website
>Reporter: Scott Wegner
>Assignee: Scott Wegner
>Priority: Major
>  Labels: beam-site-automation-reliability
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4546) Implement with hot key fanout for combiners

2018-06-13 Thread Robert Bradshaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-4546:
-

Assignee: Robert Bradshaw

> Implement with hot key fanout for combiners
> ---
>
> Key: BEAM-4546
> URL: https://issues.apache.org/jira/browse/BEAM-4546
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Robert Bradshaw
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4494) Migrate website source code to apache/beam [website-migration] branch

2018-06-12 Thread Robert Bradshaw (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510373#comment-16510373
 ] 

Robert Bradshaw commented on BEAM-4494:
---

Is there any advantage to having a separate branch? There are lots of 
downsides...

> Migrate website source code to apache/beam [website-migration] branch
> -
>
> Key: BEAM-4494
> URL: https://issues.apache.org/jira/browse/BEAM-4494
> Project: Beam
>  Issue Type: Sub-task
>  Components: website
>Reporter: Scott Wegner
>Assignee: Scott Wegner
>Priority: Major
>  Labels: beam-site-automation-reliability
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4524) We should not be using md5 to validate artifact integrity.

2018-06-08 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4524:
-

 Summary: We should not be using md5 to validate artifact integrity.
 Key: BEAM-4524
 URL: https://issues.apache.org/jira/browse/BEAM-4524
 Project: Beam
  Issue Type: Task
  Components: beam-model
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles


https://github.com/apache/beam/blob/6f239498e676f471427e17abc4bc5cffba9887c5/model/job-management/src/main/proto/beam_artifact_api.proto#L63

Something like sha256 would probably be sufficient. 

https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-2980) BagState.isEmpty needs a tighter spec

2018-05-22 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484480#comment-16484480
 ] 

Robert Bradshaw commented on BEAM-2980:
---

This seems a bit more subtle. Given 

BagState myBag = // empty
ReadableState isMyBagEmpty = myBag.isEmpty();
readResultBefore = myBag.read();
myBag.add(bizzle);
readResultAfter = myBag.read();
bool createBeforeReadAfter = isMyBagEmpty.read()

I think it makes a lot of sense for readResultAfter to contain bizzle, but the 
question is whether createBeforeReadAfter is True or False. 

> BagState.isEmpty needs a tighter spec
> -
>
> Key: BEAM-2980
> URL: https://issues.apache.org/jira/browse/BEAM-2980
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Daniel Mills
>Priority: Major
>
> Consider the following:
> {code}
> BagState myBag = // empty
> ReadableState isMyBagEmpty = myBag.isEmpty();
> myBag.add(bizzle);
> bool empty = isMyBagEmpty.read();
> {code}
> Should {{empty}} be true or false? We need a consistent answer, across all 
> kinds of state, when snapshots are required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4150) Standardize use of PCollection coder proto attribute

2018-04-20 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4150:
-

 Summary: Standardize use of PCollection coder proto attribute
 Key: BEAM-4150
 URL: https://issues.apache.org/jira/browse/BEAM-4150
 Project: Beam
  Issue Type: Task
  Components: beam-model
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles


In some places it's expected to be a WindowedCoder, in others the raw 
ElementCoder. We should use the same convention (decided in discussion to be 
the raw ElementCoder) everywhere. The WindowCoder can be pulled out of the 
attached windowing strategy, and the input/output ports should specify the 
encoding directly rather than read the adjacent PCollection coder fields. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-4097) Python SDK should set the environment in the job submission protos

2018-04-19 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-4097:
-

Assignee: Robert Bradshaw  (was: Ahmet Altay)

> Python SDK should set the environment in the job submission protos
> --
>
> Key: BEAM-4097
> URL: https://issues.apache.org/jira/browse/BEAM-4097
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3792) Python submits portable pipelines to the Flink-served endpoint.

2018-04-19 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3792.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

> Python submits portable pipelines to the Flink-served endpoint.
> ---
>
> Key: BEAM-3792
> URL: https://issues.apache.org/jira/browse/BEAM-3792
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink
>Reporter: Robert Bradshaw
>Assignee: Robert Bradshaw
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3911) Do we want to encourage release pinning?

2018-04-17 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441725#comment-16441725
 ] 

Robert Bradshaw commented on BEAM-3911:
---

No, we don't want to encourage this. 

> Do we want to encourage release pinning?
> 
>
> Key: BEAM-3911
> URL: https://issues.apache.org/jira/browse/BEAM-3911
> Project: Beam
>  Issue Type: Bug
>  Components: website
>Reporter: Robert Bradshaw
>Assignee: Melissa Pashniak
>Priority: Major
>
> The Python instructions recommend adding apache-beam==2.3.0 to one's 
> setup.py. Instead, we should probably recommend apache-beam<3. It may also be 
> worth calling out apache-beam can be installed with pip install apache-beam. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-4105) Python Postcommit suite does not start on Jenkins - proto generation failed

2018-04-17 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441669#comment-16441669
 ] 

Robert Bradshaw commented on BEAM-4105:
---

Perhaps the pip is too old? 

> Python Postcommit suite does not start on Jenkins - proto generation failed
> ---
>
> Key: BEAM-4105
> URL: https://issues.apache.org/jira/browse/BEAM-4105
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Valentyn Tymofieiev
>Assignee: Jason Kuster
>Priority: Major
>
> The suite has been steadily failing for last 3 days. Sample logs:  
> # Tox runs unit tests in a virtual environment
> ${LOCAL_PATH}/tox -e ALL -c sdks/python/tox.ini
> GLOB sdist-make: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py
> ERROR: invocation failed (exit code 1), logfile: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/log/tox-0.log
> ERROR: actionid: tox
> msg: packaging
> cmdargs: ['/usr/bin/python', 
> local('/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py'),
>  'sdist', '--formats=zip', '--dist-dir', 
> local('/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/dist')]
> /usr/local/lib/python2.7/dist-packages/setuptools/dist.py:397: UserWarning: 
> Normalizing '2.5.0.dev' to '2.5.0.dev0'
>   normalized_version,
> Regenerating common_urns module.
> running sdist
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py:51:
>  UserWarning: Installing grpcio-tools is recommended for development.
>   warnings.warn('Installing grpcio-tools is recommended for development.')
> WARNING:root:Installing grpcio-tools into 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/.eggs/grpcio-wheels
> Process Process-1:
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in 
> _bootstrap
> self.run()
>   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
> self._target(*self._args, **self._kwargs)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py",
>  line 149, in _install_grpcio_tools_and_generate_proto_files
> shutil.rmtree(build_path)
>   File "/usr/lib/python2.7/shutil.py", line 239, in rmtree
> onerror(os.listdir, path, sys.exc_info())
>   File "/usr/lib/python2.7/shutil.py", line 237, in rmtree
> names = os.listdir(path)
> OSError: [Errno 2] No such file or directory: 
> '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/.eggs/grpcio-wheels-build'
> Traceback (most recent call last):
>   File "setup.py", line 235, in 
> 'test': generate_protos_first(test),
>   File "/usr/local/lib/python2.7/dist-packages/setuptools/__init__.py", line 
> 129, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
> dist.run_commands()
>   File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
> self.run_command(cmd)
>   File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
> cmd_obj.run()
>   File "setup.py", line 141, in run
> gen_protos.generate_proto_files()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py",
>  line 97, in generate_proto_files
> raise ValueError("Proto generation failed (see log for details).")
> ValueError: Proto generation failed (see log for details).
> ERROR: FAIL could not package project - v = InvocationError('/usr/bin/python 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py
>  sdist --formats=zip --dist-dir 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/dist
>  (see 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/log/tox-0.log)',
>  1)
> Build step 'Execute shell' marked build as failure
> Sending e-mails to: commits@beam.apache.org
> Finished: FAILURE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4098) Handle WindowInto in the Java SDK Harness

2018-04-16 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4098:
-

 Summary: Handle WindowInto in the Java SDK Harness
 Key: BEAM-4098
 URL: https://issues.apache.org/jira/browse/BEAM-4098
 Project: Beam
  Issue Type: Task
  Components: sdk-java-core
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4097) Python SDK should set the environment in the job submission protos

2018-04-16 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4097:
-

 Summary: Python SDK should set the environment in the job 
submission protos
 Key: BEAM-4097
 URL: https://issues.apache.org/jira/browse/BEAM-4097
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4074) Cleanup metrics/counters code

2018-04-13 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4074:
-

 Summary: Cleanup metrics/counters code
 Key: BEAM-4074
 URL: https://issues.apache.org/jira/browse/BEAM-4074
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


E.g. right now we have

metricbase.Distribution
metric.DelegatingDistribution
metric.cells.DistributionCell
metric.cells.DistributionResult
metric.cells.DistributionData
metric.cells.DistributionAggregator
transforms.cy_combiners.DataflowDistributionCounter
transforms.cy_combiners.DataflowDistributionCounterFn
transforms.cy_dataflow_distribution_counter.DataflowDistributionCounter

plus some code under util/counters. This is true for "ordinary" sum and max 
counters as well. We should consolidate/simplify this code. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3407) Post-commit test fails with "Backing channel 'beam8' is disconnected"

2018-04-10 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433219#comment-16433219
 ] 

Robert Bradshaw commented on BEAM-3407:
---

This is still happening, e.g.

[https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/31/console]

https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/34/console

> Post-commit test fails with "Backing channel 'beam8' is disconnected"
> -
>
> Key: BEAM-3407
> URL: https://issues.apache.org/jira/browse/BEAM-3407
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Henning Rohde
>Assignee: Jason Kuster
>Priority: Major
>
> Example failure below.
> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/3809/console
> ...
> 2018-01-02T22:46:30.423 [INFO] Excluding io.grpc:grpc-protobuf-lite:jar:1.2.0 
> from the shaded jar.
> 2018-01-02T22:46:30.423 [INFO] Excluding io.grpc:grpc-stub:jar:1.2.0 from the 
> shaded jar.
> ERROR: Failed to parse POMs
> java.io.IOException: Backing channel 'beam8' is disconnected.
>   at 
> hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:212)
>   at 
> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
>   at com.sun.proxy.$Proxy131.isAlive(Unknown Source)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1138)
>   at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:166)
>   at 
> hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:879)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
>   at hudson.model.Run.execute(Run.java:1724)
>   at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
>   at hudson.model.ResourceController.execute(ResourceController.java:97)
>   at hudson.model.Executor.run(Executor.java:421)
> Caused by: java.io.IOException: Unexpected termination of the channel
>   at 
> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
> Caused by: java.io.EOFException
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:355)
>   at 
> hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:48)
>   at 
> hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
>   at 
> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
> ERROR: Step ‘E-mail Notification’ failed: no workspace for 
> beam_PostCommit_Java_ValidatesRunner_Spark #3809
> ERROR: beam8 is offline; cannot locate JDK 1.8 (latest)
> ERROR: beam8 is offline; cannot locate Maven 3.3.3
> Finished: FAILURE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4030) Add CombineFn.compact, similar to Java

2018-04-06 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4030:
-

 Summary: Add CombineFn.compact, similar to Java
 Key: BEAM-4030
 URL: https://issues.apache.org/jira/browse/BEAM-4030
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Some CombineFns buffer elements in their add_inputs because a combining 
operation cost can be effectively amortized across many elements. However, this 
introduces the extra (possibly higher) cost of potentially serializing more 
expensive buffers through shuffle. We should add a CombineFn.compact(self, 
accumulator) method (defaulting to the identity) similar to what the Java SDK 
provides which is called when flushing an element from the PGBKCV table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3737) Key-aware batching function

2018-04-04 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426269#comment-16426269
 ] 

Robert Bradshaw commented on BEAM-3737:
---

It seems that GroupByKey() would already give you values batched per key, 
right? Or are you looking for something you can place before the GBK that 
enables combiner lifting? 

> Key-aware batching function
> ---
>
> Key: BEAM-3737
> URL: https://issues.apache.org/jira/browse/BEAM-3737
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chuan Yu Foo
>Priority: Major
>
> I have a CombineFn for which add_input has very large overhead. I would like 
> to batch the incoming elements into a large batch before each call to 
> add_input to reduce this overhead. In other words, I would like to do 
> something like: 
> {{elements | GroupByKey() | BatchElements() | CombineValues(MyCombineFn())}}
> Unfortunately, BatchElements is not key-aware, and can't be used after a 
> GroupByKey to batch elements per key. I'm working around this by doing the 
> batching within CombineValues, which makes the CombineFn rather messy. It 
> would be nice if there were a key-aware BatchElements transform which could 
> be used in this context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-4012) Suppress tracebacks on failure-catching tests

2018-04-04 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-4012:
--
Labels: beginner  (was: )

> Suppress tracebacks on failure-catching tests
> -
>
> Key: BEAM-4012
> URL: https://issues.apache.org/jira/browse/BEAM-4012
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Assignee: Ahmet Altay
>Priority: Major
>  Labels: beginner
>
> Our tests of assert_that (e.g. in 
> apache_beam.runners.portability.fn_api_runner_test) dump the expected "error" 
> to stdout making our tests noisy (and they look like failures). It'd be good 
> to suppress these in tests (while making sure things are still properly 
> logged on workers. 
> There are probably other tests of similar nature. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-4012) Suppress tracebacks on failure-catching tests

2018-04-04 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-4012:
-

 Summary: Suppress tracebacks on failure-catching tests
 Key: BEAM-4012
 URL: https://issues.apache.org/jira/browse/BEAM-4012
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Our tests of assert_that (e.g. in 
apache_beam.runners.portability.fn_api_runner_test) dump the expected "error" 
to stdout making our tests noisy (and they look like failures). It'd be good to 
suppress these in tests (while making sure things are still properly logged on 
workers. 

There are probably other tests of similar nature. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3911) Do we want to encourage release pinning?

2018-03-22 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3911:
-

 Summary: Do we want to encourage release pinning?
 Key: BEAM-3911
 URL: https://issues.apache.org/jira/browse/BEAM-3911
 Project: Beam
  Issue Type: Bug
  Components: website
Reporter: Robert Bradshaw
Assignee: Melissa Pashniak


The Python instructions recommend adding apache-beam==2.3.0 to one's setup.py. 
Instead, we should probably recommend apache-beam<3. It may also be worth 
calling out apache-beam can be installed with pip install apache-beam. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner

2018-03-20 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407244#comment-16407244
 ] 

Robert Bradshaw commented on BEAM-1442:
---

As of today, you can get this from pip. Note for avro in particular that 
https://issues.apache.org/jira/browse/BEAM-2810 is still open. 

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Charles Chen
>Priority: Major
>  Labels: gsoc2017, mentor, python
> Fix For: 2.4.0
>
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3890) Add withHotKeyFanout to Python

2018-03-20 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3890:
-

 Summary: Add withHotKeyFanout to Python
 Key: BEAM-3890
 URL: https://issues.apache.org/jira/browse/BEAM-3890
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-1442) Performance improvement of the Python DirectRunner

2018-03-20 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-1442.
---
Resolution: Fixed

While one is rarely "done" with possible performance improvements, I'm going to 
close this bug as done for 2.4.0 due to the significant improvements that make 
the Python runner at least execute at reasonable speed now. 

> Performance improvement of the Python DirectRunner
> --
>
> Key: BEAM-1442
> URL: https://issues.apache.org/jira/browse/BEAM-1442
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Charles Chen
>Priority: Major
>  Labels: gsoc2017, mentor, python
> Fix For: 2.4.0
>
>
> The DirectRunner for Python and Java are intended to act as policy enforcers, 
> and correctness checkers for Beam pipelines; but there are users that run 
> data processing tasks in them.
> Currently, the Python Direct Runner has less-than-great performance, although 
> some work has gone into improving it. There are more opportunities for 
> improvement.
> Skills for this project:
> * Python
> * Cython (nice to have)
> * Working through the Beam getting started materials (nice to have)
> To start figuring out this problem, it is advisable to run a simple pipeline, 
> and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions 
> directly on JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3851) Support element timestamps while publishing to Kafka.

2018-03-20 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3851:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> Support element timestamps while publishing to Kafka.
> -
>
> Key: BEAM-3851
> URL: https://issues.apache.org/jira/browse/BEAM-3851
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kafka
>Affects Versions: 2.3.0
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> KafkaIO sink should support using input element timestamp for the message 
> published to Kafka. Otherwise there is no way for user to influence the 
> timestamp of the messages in Kafka sink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3815) 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or beam-2.4.0

2018-03-20 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3815.
---
Resolution: Fixed

> 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or 
> beam-2.4.0
> --
>
> Key: BEAM-3815
> URL: https://issues.apache.org/jira/browse/BEAM-3815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.4.0
>Reporter: Alan Myrvold
>Assignee: Robert Bradshaw
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Running dataflow quickstart with the 2.4.0 RC2 uses a container named 
> beam-master-20180228, should be 2.4.0 or beam-2.4.0, as in previous releases.
> For 2.3.0, this version was set on the release branch with 
> [https://github.com/apache/beam/commit/245cb4cbec9c1dd7d4f9ff8eb3ab38667c5472e9#diff-0032670e9006c6eba52569423c64e6d2]
> Perhaps the release guide needs to be updated?
>  wget 
> https://repository.apache.org/content/repositories/orgapachebeam-1030/org/apache/beam/beam-runners-google-cloud-dataflow-java/2.4.0/beam-runners-google-cloud-dataflow-java-2.4.0.jar
> jar xf beam-runners-google-cloud-dataflow-java-2.4.0.jar 
> org/apache/beam/runners/dataflow/dataflow.properties
> grep container.version org/apache/beam/runners/dataflow/dataflow.properties



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3868) Move state and time logic out of the ParDoEvaluator

2018-03-20 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3868:
--
Fix Version/s: (was: 2.4.0)

> Move state and time logic out of the ParDoEvaluator
> ---
>
> Key: BEAM-3868
> URL: https://issues.apache.org/jira/browse/BEAM-3868
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-direct
>Reporter: Batkhuyag Batsaikhan
>Assignee: Batkhuyag Batsaikhan
>Priority: Minor
>
> ParDoEvaluator has State and Timer logic that ideally belongs to 
> StatefulParDoEvaluator. If possible, move these logic to its appropriate 
> place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3871) Ensure all non-root transforms have a Parent

2018-03-17 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3871:
-

 Summary: Ensure all non-root transforms have a Parent
 Key: BEAM-3871
 URL: https://issues.apache.org/jira/browse/BEAM-3871
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Sometimes it is set to None, but even top-level transforms should have it set 
to the root transform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3865) Incorrect timestamp on merging window outputs.

2018-03-16 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3865:
-

 Summary: Incorrect timestamp on merging window outputs.
 Key: BEAM-3865
 URL: https://issues.apache.org/jira/browse/BEAM-3865
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: 2.3.0, 2.2.0
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Looks like we're setting multiple watermark holds with one arbitrarily being 
held. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3815) 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or beam-2.4.0

2018-03-08 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3815:
--
Fix Version/s: 2.4.0

> 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or 
> beam-2.4.0
> --
>
> Key: BEAM-3815
> URL: https://issues.apache.org/jira/browse/BEAM-3815
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.4.0
>Reporter: Alan Myrvold
>Assignee: Robert Bradshaw
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Running dataflow quickstart with the 2.4.0 RC2 uses a container named 
> beam-master-20180228, should be 2.4.0 or beam-2.4.0, as in previous releases.
> For 2.3.0, this version was set on the release branch with 
> [https://github.com/apache/beam/commit/245cb4cbec9c1dd7d4f9ff8eb3ab38667c5472e9#diff-0032670e9006c6eba52569423c64e6d2]
> Perhaps the release guide needs to be updated?
>  wget 
> https://repository.apache.org/content/repositories/orgapachebeam-1030/org/apache/beam/beam-runners-google-cloud-dataflow-java/2.4.0/beam-runners-google-cloud-dataflow-java-2.4.0.jar
> jar xf beam-runners-google-cloud-dataflow-java-2.4.0.jar 
> org/apache/beam/runners/dataflow/dataflow.properties
> grep container.version org/apache/beam/runners/dataflow/dataflow.properties



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3812) Avoid pickling PTransforms in proto representation

2018-03-08 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3812:
-

 Summary: Avoid pickling PTransforms in proto representation
 Key: BEAM-3812
 URL: https://issues.apache.org/jira/browse/BEAM-3812
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Any transform that requires passing information through the runner protos 
should have an explicit urn and payload. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3799) Nexmark Query 10 breaks with direct runner

2018-03-07 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389910#comment-16389910
 ] 

Robert Bradshaw commented on BEAM-3799:
---

The direct runner does extra checking. Looks like the benchmarks is incorrectly 
written. 

> Nexmark Query 10 breaks with direct runner
> --
>
> Key: BEAM-3799
> URL: https://issues.apache.org/jira/browse/BEAM-3799
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Ismaël Mejía
>Assignee: Thomas Groh
>Priority: Major
> Fix For: 2.4.0
>
>
> While running query 10 with the direct runner like this:
> {quote}mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main 
> -Pdirect-runner -Dexec.args="--runner=DirectRunner --query=10 
> --streaming=false --manageResources=false --monitorJobs=true 
> --enforceEncodability=true --enforceImmutability=true" -pl 'sdks/java/nexmark'
> {quote}
> I found that it breaks with the direct runner with  following exception (it 
> works ok with the other runners):
> {quote}[WARNING] 
> java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke (Method.java:498)
>     at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:294)
>     at java.lang.Thread.run (Thread.java:748)
> Caused by: org.apache.beam.sdk.util.IllegalMutationException: PTransform 
> Query10/Query10.UploadEvents/ParMultiDo(Anonymous) mutated value KV{null, 
> 2015-07-15T00:00:09.999Z shard-3-of-00025 0 ON_TIME null
> } after it was output (new value was KV{null, 2015-07-15T00:00:09.999Z 
> shard-3-of-00025 0 ON_TIME null
> }). Values must not be mutated in any way after being output.
>     at 
> org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.commit
>  (ImmutabilityCheckingBundleFactory.java:134)
>     at org.apache.beam.runners.direct.EvaluationContext.commitBundles 
> (EvaluationContext.java:212)
>     at org.apache.beam.runners.direct.EvaluationContext.handleResult 
> (EvaluationContext.java:152)
>     at 
> org.apache.beam.runners.direct.QuiescenceDriver$TimerIterableCompletionCallback.handleResult
>  (QuiescenceDriver.java:258)
>     at org.apache.beam.runners.direct.DirectTransformExecutor.finishBundle 
> (DirectTransformExecutor.java:190)
>     at org.apache.beam.runners.direct.DirectTransformExecutor.run 
> (DirectTransformExecutor.java:127)
>     at java.util.concurrent.Executors$RunnableAdapter.call 
> (Executors.java:511)
>     at java.util.concurrent.FutureTask.run (FutureTask.java:266)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker 
> (ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run 
> (ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run (Thread.java:748)
> Caused by: org.apache.beam.sdk.util.IllegalMutationException: Value KV{null, 
> 2015-07-15T00:00:09.999Z shard-3-of-00025 0 ON_TIME null
> } mutated illegally, new value was KV{null, 2015-07-15T00:00:09.999Z 
> shard-3-of-00025 0 ON_TIME null
> }. Encoding was 
> rO0ABXNyADZvcmcuYXBhY2hlLmJlYW0uc2RrLm5leG1hcmsucXVlcmllcy5RdWVyeTEwJE91dHB1dEZpbGUWUg9rZM1SvgIABUoABWluZGV4TAAIZmlsZW5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAAxtYXhUaW1lc3RhbXB0ABdMb3JnL2pvZGEvdGltZS9JbnN0YW50O0wABXNoYXJkcQB-AAFMAAZ0aW1pbmd0ADpMb3JnL2FwYWNoZS9iZWFtL3Nkay90cmFuc2Zvcm1zL3dpbmRvd2luZy9QYW5lSW5mbyRUaW1pbmc7eHAAAHBzcgAVb3JnLmpvZGEudGltZS5JbnN0YW50Lci-0MYOnM0CAAFKAAdpTWlsbGlzeHFOjwLrD3QAFHNoYXJkLTAwMDAzLW9mLTAwMDI1fnIAOG9yZy5hcGFjaGUuYmVhbS5zZGsudHJhbnNmb3Jtcy53aW5kb3dpbmcuUGFuZUluZm8kVGltaW5nAAASAAB4cgAOamF2YS5sYW5nLkVudW0AABIAAHhwdAAHT05fVElNRQ,
>  now 
> rO0ABXNyADZvcmcuYXBhY2hlLmJlYW0uc2RrLm5leG1hcmsucXVlcmllcy5RdWVyeTEwJE91dHB1dEZpbGUWUg9rZM1SvgIABUoABWluZGV4TAAIZmlsZW5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAAxtYXhUaW1lc3RhbXB0ABdMb3JnL2pvZGEvdGltZS9JbnN0YW50O0wABXNoYXJkcQB-AAFMAAZ0aW1pbmd0ADpMb3JnL2FwYWNoZS9iZWFtL3Nkay90cmFuc2Zvcm1zL3dpbmRvd2luZy9QYW5lSW5mbyRUaW1pbmc7eHAAAHBzcgAVb3JnLmpvZGEudGltZS5JbnN0YW50Lci-0MYOnM0CAAFKAAdpTWlsbGlzeHFOjwLrD3QAFHNoYXJkLTAwMDAzLW9mLTAwMDI1fnIAOG9yZy5hcGFjaGUuYmVhbS5zZGsudHJhbnNmb3Jtcy53aW5kb3dpbmcuUGFuZUluZm8kVGltaW5nAAASAAB4cgAOamF2YS5sYW5nLkVudW0AABIAAHhwdAAHT05fVElNRQ.
>     at 
> org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.illegalMutation
>  (MutationDetectors.java:144)
>     at 
> org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.verifyUnmodifiedThrowingCheckedExceptions
>  

[jira] [Updated] (BEAM-3409) Unexpected behavior of DoFn teardown method running in unit tests

2018-03-07 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3409:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> Unexpected behavior of DoFn teardown method running in unit tests 
> --
>
> Key: BEAM-3409
> URL: https://issues.apache.org/jira/browse/BEAM-3409
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.3.0
>Reporter: Alexey Romanenko
>Assignee: Romain Manni-Bucau
>Priority: Blocker
>  Labels: test
> Fix For: 2.5.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Writing a unit test, I found out a strange behaviour of Teardown method of 
> DoFn implementation when I run this method in unit tests using TestPipeline.
> To be more precise, it doesn’t wait until teardown() method will be finished, 
> it just exits from this method after about 1 sec (on my machine) even if it 
> should take longer (very simple example - running infinite loop inside this 
> method or put thread in sleep). In the same time, when I run the same code 
> from main() with ordinary Pipeline and direct runner, then it’s ok and it 
> works as expected - teardown() method will be performed completely despite 
> how much time it will take.
> I created two test cases to reproduce this issue - the first one to run with 
> main() and the second one to run with junit. They use the same implementation 
> of DoFn (class LongTearDownFn) and expects that teardown method will be 
> running at least for SLEEP_TIME ms. In case of running as junit test it's not 
> a case (see output log).
> - run with main()
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/main/java/TearDown.java
> - run with junit
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/test/java/TearDownTest.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3792) Python submits portable pipelines to the Flink-served endpoint.

2018-03-06 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3792:
-

 Summary: Python submits portable pipelines to the Flink-served 
endpoint.
 Key: BEAM-3792
 URL: https://issues.apache.org/jira/browse/BEAM-3792
 Project: Beam
  Issue Type: Sub-task
  Components: runner-flink
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3780) Add a utility to instantiate a partially unknown coder

2018-03-05 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386905#comment-16386905
 ] 

Robert Bradshaw commented on BEAM-3780:
---

Isn't this capability already present as part of LengthPrefixUnknownCoders? Or 
is this a proposal to refactor/modify how that works? 

> Add a utility to instantiate a partially unknown coder
> --
>
> Key: BEAM-3780
> URL: https://issues.apache.org/jira/browse/BEAM-3780
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Thomas Groh
>Assignee: Thomas Groh
>Priority: Major
>
> Coders must be understood by the SDK harness that is encoding or decoding the 
> associated elements. However, the pipeline runner is capable of constructing 
> partial coders, where an unknown coder is replaced with a ByteArrayCoder. It 
> then can decompose the portions of elements it is aware of, without having to 
> understand the custom element encodings.
>  
> This should go in CoderTranslation, as an alternative to the full-fidelity 
> rehydration of a coder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3409) Unexpected behavior of DoFn teardown method running in unit tests

2018-03-01 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3409.
---
Resolution: Fixed

> Unexpected behavior of DoFn teardown method running in unit tests 
> --
>
> Key: BEAM-3409
> URL: https://issues.apache.org/jira/browse/BEAM-3409
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.3.0
>Reporter: Alexey Romanenko
>Assignee: Romain Manni-Bucau
>Priority: Blocker
>  Labels: test
> Fix For: 2.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Writing a unit test, I found out a strange behaviour of Teardown method of 
> DoFn implementation when I run this method in unit tests using TestPipeline.
> To be more precise, it doesn’t wait until teardown() method will be finished, 
> it just exits from this method after about 1 sec (on my machine) even if it 
> should take longer (very simple example - running infinite loop inside this 
> method or put thread in sleep). In the same time, when I run the same code 
> from main() with ordinary Pipeline and direct runner, then it’s ok and it 
> works as expected - teardown() method will be performed completely despite 
> how much time it will take.
> I created two test cases to reproduce this issue - the first one to run with 
> main() and the second one to run with junit. They use the same implementation 
> of DoFn (class LongTearDownFn) and expects that teardown method will be 
> running at least for SLEEP_TIME ms. In case of running as junit test it's not 
> a case (see output log).
> - run with main()
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/main/java/TearDown.java
> - run with junit
> https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/test/java/TearDownTest.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3768) Compile error for Flink translation

2018-03-01 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3768:
-

 Summary: Compile error for Flink translation
 Key: BEAM-3768
 URL: https://issues.apache.org/jira/browse/BEAM-3768
 Project: Beam
  Issue Type: Test
  Components: runner-flink
Reporter: Robert Bradshaw
Assignee: Aljoscha Krettek
 Fix For: 2.4.0


2018-03-01T21:22:58.234 [INFO] --- maven-compiler-plugin:3.7.0:compile 
(default-compile) @ beam-runners-flink_2.11 ---
2018-03-01T21:22:58.258 [INFO] Changes detected - recompiling the module!
2018-03-01T21:22:58.259 [INFO] Compiling 75 source files to 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/target/classes
2018-03-01T21:22:59.555 [INFO] 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/DedupingOperator.java:
 Some input files use or override a deprecated API.
2018-03-01T21:22:59.556 [INFO] 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/DedupingOperator.java:
 Recompile with -Xlint:deprecation for details.
2018-03-01T21:22:59.556 [INFO] 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/SideInputInitializer.java:
 Some input files use unchecked or unsafe operations.
2018-03-01T21:22:59.556 [INFO] 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/SideInputInitializer.java:
 Recompile with -Xlint:unchecked for details.
2018-03-01T21:22:59.556 [INFO] 
-
2018-03-01T21:22:59.556 [ERROR] COMPILATION ERROR : 
2018-03-01T21:22:59.556 [INFO] 
-
2018-03-01T21:22:59.557 [ERROR] 
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java:[359,11]
 cannot infer type arguments for org.apache.beam.runners.core.ProcessFnRunner<>
  reason: cannot infer type-variable(s) InputT,OutputT,RestrictionT
(argument mismatch; org.apache.beam.runners.core.DoFnRunner cannot be 
converted to 
org.apache.beam.runners.core.SimpleDoFnRunner>,OutputT>)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3479) Add a regression test for the DoFn classloader selection

2018-03-01 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382865#comment-16382865
 ] 

Robert Bradshaw commented on BEAM-3479:
---

Bumping to the next version, as this is non-blocking (per comments on the PR).

> Add a regression test for the DoFn classloader selection
> 
>
> Key: BEAM-3479
> URL: https://issues.apache.org/jira/browse/BEAM-3479
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-core
>Reporter: Romain Manni-Bucau
>Assignee: Romain Manni-Bucau
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Follow up task after https://github.com/apache/beam/pull/4235 merge. This 
> task is about ensuring we test that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3479) Add a regression test for the DoFn classloader selection

2018-03-01 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3479:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> Add a regression test for the DoFn classloader selection
> 
>
> Key: BEAM-3479
> URL: https://issues.apache.org/jira/browse/BEAM-3479
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-core
>Reporter: Romain Manni-Bucau
>Assignee: Romain Manni-Bucau
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Follow up task after https://github.com/apache/beam/pull/4235 merge. This 
> task is about ensuring we test that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3611) Split KafkaIO.java into smaller files

2018-03-01 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3611.
---
Resolution: Fixed

Closing this as resolved, based on an inspection of the PR. 

> Split KafkaIO.java into smaller files
> -
>
> Key: BEAM-3611
> URL: https://issues.apache.org/jira/browse/BEAM-3611
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kafka
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Minor
> Fix For: 2.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> KafkaIO.java has grown too big and includes both source and sink 
> implementation. Better to move these to own files. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3700) PipelineOptionsFactory leaks memory

2018-02-28 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3700:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> PipelineOptionsFactory leaks memory
> ---
>
> Key: BEAM-3700
> URL: https://issues.apache.org/jira/browse/BEAM-3700
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-core
>Reporter: Romain Manni-Bucau
>Assignee: Romain Manni-Bucau
>Priority: Major
> Fix For: 2.5.0
>
>
> PipelineOptionsFactory has a lot of cache but no way to reset it. This task 
> is about adding a public method to be able to control it in integrations 
> (runners likely).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-3689) Direct runner leak a reader for every 10 input records

2018-02-28 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw closed BEAM-3689.
-
Resolution: Fixed

Based on my reading of the PR, this is fixed. 

> Direct runner leak a reader for every 10 input records
> --
>
> Key: BEAM-3689
> URL: https://issues.apache.org/jira/browse/BEAM-3689
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-direct
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
>Priority: Major
> Fix For: 2.4.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Direct runner reads 10 records at a time from a reader. I think the intention 
> is to reuse the reader, but it reuses only if the reader is idle initially, 
> not when the source has messages available.
> When I was testing KafkaIO with direct runner it kept opening new reader for 
> every 10 records and soon ran out of file descriptors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3702) Support system properties source for pipeline options

2018-02-28 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3702:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> Support system properties source for pipeline options
> -
>
> Key: BEAM-3702
> URL: https://issues.apache.org/jira/browse/BEAM-3702
> Project: Beam
>  Issue Type: Task
>  Components: sdk-java-core
>Reporter: Romain Manni-Bucau
>Assignee: Romain Manni-Bucau
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3749) support customized trigger/accumulationMode in BeamSql

2018-02-28 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3749:
--
Fix Version/s: (was: 2.4.0)
   2.5.0

> support customized trigger/accumulationMode in BeamSql
> --
>
> Key: BEAM-3749
> URL: https://issues.apache.org/jira/browse/BEAM-3749
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently BeamSql use {{DefaultTrigger}} for aggregation operations. 
> By adding two options {{withTrigger(Trigger)}} and 
> {{withAccumulationMode(AccumulationMode)}}, developers can specify their own 
> aggregation strategies with BeamSql.
> [~xumingming] [~kedin] [~kenn] for any comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3749) support customized trigger/accumulationMode in BeamSql

2018-02-28 Thread Robert Bradshaw (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381530#comment-16381530
 ] 

Robert Bradshaw commented on BEAM-3749:
---

I wonder if sink-based triggers would be a better fit here than adding 
triggering to an sql statement. 

> support customized trigger/accumulationMode in BeamSql
> --
>
> Key: BEAM-3749
> URL: https://issues.apache.org/jira/browse/BEAM-3749
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Xu Mingmin
>Priority: Major
> Fix For: 2.4.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently BeamSql use {{DefaultTrigger}} for aggregation operations. 
> By adding two options {{withTrigger(Trigger)}} and 
> {{withAccumulationMode(AccumulationMode)}}, developers can specify their own 
> aggregation strategies with BeamSql.
> [~xumingming] [~kedin] [~kenn] for any comments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3713) Consider moving away from nose to nose2 or pytest.

2018-02-15 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3713:
-

 Summary: Consider moving away from nose to nose2 or pytest.
 Key: BEAM-3713
 URL: https://issues.apache.org/jira/browse/BEAM-3713
 Project: Beam
  Issue Type: Test
  Components: sdk-py-core, testing
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


Per 
[https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,]
 , nose is in maintenance mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-3205) Publicly document known coder wire formats and their URNs

2018-02-12 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw closed BEAM-3205.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Publicly document known coder wire formats and their URNs
> -
>
> Key: BEAM-3205
> URL: https://issues.apache.org/jira/browse/BEAM-3205
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability
> Fix For: 2.4.0
>
>
> Overarching issue: We need to get our Google Docs, markdown, and email 
> threads that sketch the Beam model as it is developed into a centralized 
> place with clear information architecture / navigation, and draw the line 
> that "if it isn't reachable from here in an obvious way it isn't the spec". 
> [1]
> Specific issue: Which coders are required for a runner and SDK to understand? 
> Which coders are otherwise considered standardized? What is the abstract 
> specification for their wire format?
> Today we have 
> https://github.com/apache/beam/blob/master/model/fn-execution/src/test/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>  which is the beginning of a compliance test suite for standardized coders.
> This would really benefit from:
>  - narrative descriptions of the formats, including _abstract_ specification 
> (not examples) and perhaps motivation
>  - specification of which are required and which are merely "well known"
>  - ties into BEAM-3203 in terms of which coders are required to decode to 
> compatible value in every SDK
>  - once we have an abstract spec and some examples, and one language has 
> robust coders that pass the examples, we could turn it around and treat that 
> implementation as a reference impl for fuzz testing
> Any sort of fancy hacking that blends the tests with the narrative is fine, 
> though mostly I think they'll end up covering disjoint topics.
> [1] I filed BEAM-2567 and BEAM-2568 and ported 
> https://beam.apache.org/contribute/runner-guide/, and [~herohde] put together 
> https://beam.apache.org/contribute/portability/ and 
> https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3074) Propagate pipeline protos through Dataflow API from Python

2018-02-12 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3074.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Propagate pipeline protos through Dataflow API from Python
> --
>
> Key: BEAM-3074
> URL: https://issues.apache.org/jira/browse/BEAM-3074
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Kenneth Knowles
>Assignee: Robert Bradshaw
>Priority: Major
>  Labels: portability
> Fix For: 2.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3644) Speed up Python DirectRunner execution by using the FnApiRunner when possible

2018-02-07 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3644:
--
Description: 
Local execution of Beam pipelines on the current Python DirectRunner currently 
suffers from performance issues, which makes it hard for pipeline authors to 
iterate, especially on medium to large size datasets. We would like to optimize 
and make this a better experience for Beam users.

The FnApiRunner was written as a way of leveraging the portability framework 
execution code path for local execution for portability development. We've 
found it also offers great speedups in batch execution, so we propose to switch 
to use this runner in batch pipelines. For example, WordCount on the 
Shakespeare dataset with a single CPU core now takes 50 seconds to run, 
compared to 12 minutes before, a 15x performance improvement that users can get 
for free, with no pipeline changes.

  was:
Local execution of Beam pipelines on the current Python DirectRunner currently 
suffers from performance issues, which makes it hard for pipeline authors to 
iterate, especially on medium to large size datasets. We would like to optimize 
and make this a better experience for Beam users.

In the past few months, Robert implemented the FnApiRunner as a way of 
leveraging the portability framework execution code path for local execution. 
We've found great speedups in batch execution, so we propose to switch to use 
this runner in batch pipelines. For example, WordCount on the Shakespeare 
dataset with a single CPU core now takes 50 seconds to run, compared to 12 
minutes before, a 15x performance improvement that users can get for free, with 
no pipeline changes.


> Speed up Python DirectRunner execution by using the FnApiRunner when possible
> -
>
> Key: BEAM-3644
> URL: https://issues.apache.org/jira/browse/BEAM-3644
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Charles Chen
>Priority: Major
>
> Local execution of Beam pipelines on the current Python DirectRunner 
> currently suffers from performance issues, which makes it hard for pipeline 
> authors to iterate, especially on medium to large size datasets. We would 
> like to optimize and make this a better experience for Beam users.
> The FnApiRunner was written as a way of leveraging the portability framework 
> execution code path for local execution for portability development. We've 
> found it also offers great speedups in batch execution, so we propose to 
> switch to use this runner in batch pipelines. For example, WordCount on the 
> Shakespeare dataset with a single CPU core now takes 50 seconds to run, 
> compared to 12 minutes before, a 15x performance improvement that users can 
> get for free, with no pipeline changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3625) DoFn.XxxParam does not work for Map and FlatMap

2018-02-06 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3625:
-

 Summary: DoFn.XxxParam does not work for Map and FlatMap
 Key: BEAM-3625
 URL: https://issues.apache.org/jira/browse/BEAM-3625
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3595) Normalize URNs across SDKs and runners.

2018-02-01 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3595:
-

 Summary: Normalize URNs across SDKs and runners.
 Key: BEAM-3595
 URL: https://issues.apache.org/jira/browse/BEAM-3595
 Project: Beam
  Issue Type: Bug
  Components: beam-model
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-2804) support TIMESTAMP in Sort

2018-01-23 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-2804.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> support TIMESTAMP in Sort
> -
>
> Key: BEAM-2804
> URL: https://issues.apache.org/jira/browse/BEAM-2804
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Xu Mingmin
>Assignee: Shayang Zang
>Priority: Minor
>  Labels: beginner
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn

2018-01-22 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw updated BEAM-3510:
--
Fix Version/s: 2.3.0

> Python Runner API serialization for CombinePerKey drops arguments to CombineFn
> --
>
> Key: BEAM-3510
> URL: https://issues.apache.org/jira/browse/BEAM-3510
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.2.0
>Reporter: Charles Chen
>Assignee: Robert Bradshaw
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the Runner API serialization for CombinePerKey drops the args and 
> kwargs that are passed at runtime to the CombineFn.  We should fix this issue 
> so that the Runner API round-trip preserves semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn

2018-01-22 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3510.
---
Resolution: Fixed

> Python Runner API serialization for CombinePerKey drops arguments to CombineFn
> --
>
> Key: BEAM-3510
> URL: https://issues.apache.org/jira/browse/BEAM-3510
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.2.0
>Reporter: Charles Chen
>Assignee: Robert Bradshaw
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the Runner API serialization for CombinePerKey drops the args and 
> kwargs that are passed at runtime to the CombineFn.  We should fix this issue 
> so that the Runner API round-trip preserves semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn

2018-01-22 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-3510:
-

Assignee: Robert Bradshaw  (was: Charles Chen)

> Python Runner API serialization for CombinePerKey drops arguments to CombineFn
> --
>
> Key: BEAM-3510
> URL: https://issues.apache.org/jira/browse/BEAM-3510
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.2.0
>Reporter: Charles Chen
>Assignee: Robert Bradshaw
>Priority: Major
>
> Currently, the Runner API serialization for CombinePerKey drops the args and 
> kwargs that are passed at runtime to the CombineFn.  We should fix this issue 
> so that the Runner API round-trip preserves semantics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-2906) Dataflow supports portable progress reporting

2018-01-22 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-2906.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Dataflow supports portable progress reporting
> -
>
> Key: BEAM-2906
> URL: https://issues.apache.org/jira/browse/BEAM-2906
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-dataflow
>Reporter: Henning Rohde
>Assignee: Robert Bradshaw
>Priority: Minor
>  Labels: portability
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-2906) Dataflow supports portable progress reporting

2018-01-22 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-2906:
-

Assignee: Robert Bradshaw

> Dataflow supports portable progress reporting
> -
>
> Key: BEAM-2906
> URL: https://issues.apache.org/jira/browse/BEAM-2906
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-dataflow
>Reporter: Henning Rohde
>Assignee: Robert Bradshaw
>Priority: Minor
>  Labels: portability
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3490) Reasonable Python direct runner batch performance.

2018-01-17 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3490:
-

 Summary: Reasonable Python direct runner batch performance.
 Key: BEAM-3490
 URL: https://issues.apache.org/jira/browse/BEAM-3490
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Robert Bradshaw


The plan is to migrate to the FnApi Runner for batch workloads. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-3419) Enable iterable side input for beam runners.

2018-01-05 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3419:
-

 Summary: Enable iterable side input for beam runners.
 Key: BEAM-3419
 URL: https://issues.apache.org/jira/browse/BEAM-3419
 Project: Beam
  Issue Type: New Feature
  Components: runner-core
Reporter: Robert Bradshaw
Assignee: Kenneth Knowles






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-3393) Empty flattens cannot be used as side inputs.

2017-12-22 Thread Robert Bradshaw (JIRA)
Robert Bradshaw created BEAM-3393:
-

 Summary: Empty flattens cannot be used as side inputs.
 Key: BEAM-3393
 URL: https://issues.apache.org/jira/browse/BEAM-3393
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Robert Bradshaw
Assignee: Ahmet Altay


{{with beam.Pipeline() as p:
  main = p | "CM" >> beam.Create([])
  side1 = p | "C1" >> beam.Create([])
  side2 = p | "C2" >> beam.Create([])
  side = (side1, side2) | beam.Flatten()
  res = main | beam.Map(lambda x, side: x, beam.pvalue.AsList(side))
}}
results in 
{{
  File 
"/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/direct/evaluation_context.py",
 line 84, in add_values
assert not view.has_result
}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-3369) Python HEAD fails tests due to a ValueError

2017-12-19 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw resolved BEAM-3369.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Python HEAD fails tests due to a ValueError 
> 
>
> Key: BEAM-3369
> URL: https://issues.apache.org/jira/browse/BEAM-3369
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: Not applicable
>Reporter: Chamikara Jayalath
>Assignee: Robert Bradshaw
>Priority: Critical
> Fix For: 2.3.0
>
>
> tfidf_test.py is failing. Full error message is given below.
> ValueError: The key coder "TupleCoder[BytesCoder, FastPrimitivesCoder]" for 
> GroupByKey operation "GroupByKey" is not deterministic. This may result in 
> incorrect pipeline output. This can be fixed by adding a type hint to the 
> operation preceding the GroupByKey step, and for custom key classes, by 
> writing a deterministic custom Coder. Please see the documentation for more 
> details.
> Seems to be due to following commit (passes without that).
> https://github.com/apache/beam/commit/1acd1ae901eefbcc8249d90e12ca82db0f91e41e#commitcomment-26356503
> Robert, can you take a look ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2937) Fn API combiner support w/ lifting to PGBK

2017-12-18 Thread Robert Bradshaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Bradshaw reassigned BEAM-2937:
-

Assignee: Robert Bradshaw

> Fn API combiner support w/ lifting to PGBK
> --
>
> Key: BEAM-2937
> URL: https://issues.apache.org/jira/browse/BEAM-2937
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Henning Rohde
>Assignee: Robert Bradshaw
>  Labels: portability
>
> The FnAPI should support this optimization. Detailed design TBD.
> Once design is ready, expand subtasks similarly to BEAM-2822.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >