[jira] [Created] (BEAM-5720) Default coder breaks with large ints on Python 3
Robert Bradshaw created BEAM-5720: - Summary: Default coder breaks with large ints on Python 3 Key: BEAM-5720 URL: https://issues.apache.org/jira/browse/BEAM-5720 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay The test for `int` includes greater than 64-bit values, which causes an overflow error later in the code. We need to only use that coding scheme for machine-sized ints. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5705) PRs 6557 and 6589 break internal Dataflow tests
[ https://issues.apache.org/jira/browse/BEAM-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646167#comment-16646167 ] Robert Bradshaw commented on BEAM-5705: --- I think this is fixed by https://github.com/apache/beam/pull/6600 > PRs 6557 and 6589 break internal Dataflow tests > > > Key: BEAM-5705 > URL: https://issues.apache.org/jira/browse/BEAM-5705 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.8.0 >Reporter: Charles Chen >Assignee: Ahmet Altay >Priority: Blocker > Fix For: 2.8.0 > > > The PRs [https://github.com/apache/beam/pull/6557] and > [https://github.com/apache/beam/pull/6589] (which fixed postcommits for 6557 > with a temporary workaround) break internal Dataflow tests. The Dataflow > team needs to investigate this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.
[ https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-4858. --- Resolution: Fixed Fix Version/s: 2.8.0 > Clean up _BatchSizeEstimator in element-batching transform. > --- > > Key: BEAM-4858 > URL: https://issues.apache.org/jira/browse/BEAM-4858 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Minor > Fix For: 2.8.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] > non-trivial performance-sensitive logic in element-batching transform. Let's > take a look at > [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]. > > Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on > the type of the keys - whether they are integers or floats. > The keys of key-value pairs contained in {{self._data}} are added as integers > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260], > however, when we 'thin' the collected entries > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279], > the keys will become floats. Surprisingly, using either integer or float > division consistently [in the > comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271] > negatively affects the performance of a custom pipeline I was using to > benchmark these changes. The performance impact likely comes from changes in > the logic that depends on how division is evaluated, not from the > performance of division operation itself. > In terms of Python 3 conversion the best course of action that avoids > regression seems to be to preserve the existing Python 2 behavior using > {{old_div}} from {{past.utils.division}}, in the medium term we should clean > up the logic. We may want to add a targeted microbenchmark to evaluate > performance of this code, and maybe cythonize the code, since it seems to be > performance-sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.
[ https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644983#comment-16644983 ] Robert Bradshaw commented on BEAM-4858: --- Submitted pr/6375. * Simplifies the aging out of old data. Not only was the old formula hard to understand, but it meant that bad data could stick around and poison the estimates forever. * Adds a variance parameter allowing the batch size to vary over a fixed range giving a broader base for the linear regression. * Uses numpy when available to do the regression. This is both much more efficient and allows for less error-prone expression of more complicated analysis. The algorithm was also changed to: * Eliminates outliers, both using Cook's distance and just throwing out the top (often high-variance and high-influence) 20% when there is sufficient data. * Weight by the inverse of batch size, to provide a more stable fixed size estimate (which the default "overhead" target is sensitive to). These changes were tested against a large TFT pipeline and found to produce more uniform batch sizes and similar, possibly slightly improved, total runtimes and total costs. > Clean up _BatchSizeEstimator in element-batching transform. > --- > > Key: BEAM-4858 > URL: https://issues.apache.org/jira/browse/BEAM-4858 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Minor > Fix For: 2.8.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] > non-trivial performance-sensitive logic in element-batching transform. Let's > take a look at > [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]. > > Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on > the type of the keys - whether they are integers or floats. > The keys of key-value pairs contained in {{self._data}} are added as integers > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260], > however, when we 'thin' the collected entries > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279], > the keys will become floats. Surprisingly, using either integer or float > division consistently [in the > comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271] > negatively affects the performance of a custom pipeline I was using to > benchmark these changes. The performance impact likely comes from changes in > the logic that depends on how division is evaluated, not from the > performance of division operation itself. > In terms of Python 3 conversion the best course of action that avoids > regression seems to be to preserve the existing Python 2 behavior using > {{old_div}} from {{past.utils.division}}, in the medium term we should clean > up the logic. We may want to add a targeted microbenchmark to evaluate > performance of this code, and maybe cythonize the code, since it seems to be > performance-sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5702) Avoid reshuffle for zero and one element creates
Robert Bradshaw created BEAM-5702: - Summary: Avoid reshuffle for zero and one element creates Key: BEAM-5702 URL: https://issues.apache.org/jira/browse/BEAM-5702 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay These are commonly used (e.g. for Writes or CombineGlobally with Default) and can be implemented directly on top of impulse rather. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5664) A canceled pipeline should not return a done status in the jobserver.
Robert Bradshaw created BEAM-5664: - Summary: A canceled pipeline should not return a done status in the jobserver. Key: BEAM-5664 URL: https://issues.apache.org/jira/browse/BEAM-5664 Project: Beam Issue Type: Improvement Components: runner-core Reporter: Robert Bradshaw Assignee: Kenneth Knowles -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-5664) A canceled pipeline should not return a done status in the jobserver.
[ https://issues.apache.org/jira/browse/BEAM-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-5664: -- Labels: portability-flink (was: ) > A canceled pipeline should not return a done status in the jobserver. > - > > Key: BEAM-5664 > URL: https://issues.apache.org/jira/browse/BEAM-5664 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Robert Bradshaw >Assignee: Kenneth Knowles >Priority: Major > Labels: portability-flink > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode
[ https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636638#comment-16636638 ] Robert Bradshaw commented on BEAM-5500: --- In light of [https://github.com/apache/beam/pull/6517] can we call this closed? > Portable python sdk worker leaks memory in streaming mode > - > > Key: BEAM-5500 > URL: https://issues.apache.org/jira/browse/BEAM-5500 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Micah Wylde >Assignee: Robert Bradshaw >Priority: Major > Labels: portability-flink > Attachments: chart.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > When using the portable python sdk with flink in streaming mode, we see that > the python worker processes steadily increase memory usage until they are OOM > killed. This behavior is consistent across various kinds of streaming > pipelines, including those with fixed windows and global windows. > A simple wordcount-like pipeline demonstrates the issue for us (note this is > run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides > access to kinesis as a portable streaming source): > {code:java} > counts = (p > | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream') > | 'decode' >> beam.FlatMap(decode) # parses from json into python objs > | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1)) > | 'window' >> beam.WindowInto(window.GlobalWindows(), > trigger=AfterProcessingTime(15 * 1000), > accumulation_mode=AccumulationMode.DISCARDING) > | 'group' >> beam.GroupByKey() > | 'count' >> beam.Map(count_ones) > | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x)) > {code} > When run, we see a steady increase in memory usage in the sdk_worker process. > Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory > usage over time and found that it's largely dicts and strings (see attached > chart). > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5606) Automatically infer gcp project from environment.
Robert Bradshaw created BEAM-5606: - Summary: Automatically infer gcp project from environment. Key: BEAM-5606 URL: https://issues.apache.org/jira/browse/BEAM-5606 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5521) Cache execution trees in SDK worker
Robert Bradshaw created BEAM-5521: - Summary: Cache execution trees in SDK worker Key: BEAM-5521 URL: https://issues.apache.org/jira/browse/BEAM-5521 Project: Beam Issue Type: Improvement Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw Currently they are re-constructed from the protos for every bundle, which is expensive (especially for 1-element bundles in streaming flink). Care should be taken to ensure the objects can be re-usued. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-5521) Cache execution trees in SDK worker
[ https://issues.apache.org/jira/browse/BEAM-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-5521: -- Labels: portability-flink (was: ) > Cache execution trees in SDK worker > --- > > Key: BEAM-5521 > URL: https://issues.apache.org/jira/browse/BEAM-5521 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Labels: portability-flink > > Currently they are re-constructed from the protos for every bundle, which is > expensive (especially for 1-element bundles in streaming flink). > Care should be taken to ensure the objects can be re-usued. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode
[ https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-5500: -- Labels: portability-flink (was: ) > Portable python sdk worker leaks memory in streaming mode > - > > Key: BEAM-5500 > URL: https://issues.apache.org/jira/browse/BEAM-5500 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Micah Wylde >Assignee: Robert Bradshaw >Priority: Major > Labels: portability-flink > Attachments: chart.png > > > When using the portable python sdk with flink in streaming mode, we see that > the python worker processes steadily increase memory usage until they are OOM > killed. This behavior is consistent across various kinds of streaming > pipelines, including those with fixed windows and global windows. > A simple wordcount-like pipeline demonstrates the issue for us (note this is > run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides > access to kinesis as a portable streaming source): > {code:java} > counts = (p > | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream') > | 'decode' >> beam.FlatMap(decode) # parses from json into python objs > | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1)) > | 'window' >> beam.WindowInto(window.GlobalWindows(), > trigger=AfterProcessingTime(15 * 1000), > accumulation_mode=AccumulationMode.DISCARDING) > | 'group' >> beam.GroupByKey() > | 'count' >> beam.Map(count_ones) > | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x)) > {code} > When run, we see a steady increase in memory usage in the sdk_worker process. > Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory > usage over time and found that it's largely dicts and strings (see attached > chart). > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5509) Python pipeline_options doesn't handle int type
[ https://issues.apache.org/jira/browse/BEAM-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630428#comment-16630428 ] Robert Bradshaw commented on BEAM-5509: --- I don't think we should be representing integral values as floating point in the pipeline options representation (though perhaps we'd have to use strings given that JSON doesn't support ints.) > Python pipeline_options doesn't handle int type > --- > > Key: BEAM-5509 > URL: https://issues.apache.org/jira/browse/BEAM-5509 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Thomas Weise >Assignee: Robert Bradshaw >Priority: Major > Labels: portability-flink > > The int option supplied at the command line is turned into a decimal during > serialization and then the parser in SDK harness fails to restore it as int. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-5509) Python pipeline_options doesn't handle int type
[ https://issues.apache.org/jira/browse/BEAM-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-5509: -- Labels: portability-flink (was: ) > Python pipeline_options doesn't handle int type > --- > > Key: BEAM-5509 > URL: https://issues.apache.org/jira/browse/BEAM-5509 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Thomas Weise >Assignee: Robert Bradshaw >Priority: Major > Labels: portability-flink > > The int option supplied at the command line is turned into a decimal during > serialization and then the parser in SDK harness fails to restore it as int. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5500) Portable python sdk worker leaks memory in streaming mode
[ https://issues.apache.org/jira/browse/BEAM-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630296#comment-16630296 ] Robert Bradshaw commented on BEAM-5500: --- I've tried running this repeatedly in the direct runner, and am only finding the aforementioned leak of ~300 bytes per stage per bundle (which, yes, needs to be resolved). How many bundles/second are you processing? > Portable python sdk worker leaks memory in streaming mode > - > > Key: BEAM-5500 > URL: https://issues.apache.org/jira/browse/BEAM-5500 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Micah Wylde >Assignee: Robert Bradshaw >Priority: Major > Attachments: chart.png > > > When using the portable python sdk with flink in streaming mode, we see that > the python worker processes steadily increase memory usage until they are OOM > killed. This behavior is consistent across various kinds of streaming > pipelines, including those with fixed windows and global windows. > A simple wordcount-like pipeline demonstrates the issue for us (note this is > run on the [Lyft beam fork|https://github.com/lyft/beam/], which provides > access to kinesis as a portable streaming source): > {code:java} > counts = (p > | 'Kinesis' >> FlinkKinesisInput().with_stream('test-stream') > | 'decode' >> beam.FlatMap(decode) # parses from json into python objs > | 'pair_with_one' >> beam.Map(lambda x: (x["event_name"], 1)) > | 'window' >> beam.WindowInto(window.GlobalWindows(), > trigger=AfterProcessingTime(15 * 1000), > accumulation_mode=AccumulationMode.DISCARDING) > | 'group' >> beam.GroupByKey() > | 'count' >> beam.Map(count_ones) > | beam.Map(lambda x: logging.warn("count: %s", str(x)) or x)) > {code} > When run, we see a steady increase in memory usage in the sdk_worker process. > Using [heapy|http://guppy-pe.sourceforge.net/#Heapy] I've analyzed the memory > usage over time and found that it's largely dicts and strings (see attached > chart). > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.
[ https://issues.apache.org/jira/browse/BEAM-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626973#comment-16626973 ] Robert Bradshaw commented on BEAM-5468: --- Namespaced logging in the Python SDK would be convenient (and could be done automatically based on modules, see [https://gist.github.com/bdarnell/3118509]) but this is somewhat orthogonal. I agree the default should be whatever the runner is using, with an option to set it explicitly (analogous to how Java does it), coordinating such that the harness never asks for logs it's just going to drop. > Allow runner to set worker log level in Python SDK harness. > --- > > Key: BEAM-5468 > URL: https://issues.apache.org/jira/browse/BEAM-5468 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3622) DirectRunner memory issue with Python SDK
[ https://issues.apache.org/jira/browse/BEAM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3622: -- Component/s: (was: sdk-py-harness) > DirectRunner memory issue with Python SDK > - > > Key: BEAM-3622 > URL: https://issues.apache.org/jira/browse/BEAM-3622 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: yuri krnr >Assignee: Charles Chen >Priority: Major > > After running pipeline for a while in a streaming mode (reading from Pub/Sub > and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic > memory usage of a process. Using guppy as a profiler I got the following > results: > start > {noformat} > INFO *** MemoryReport Heap: > Partition of a set of 240208 objects. Total size = 34988840 bytes. > Index Count % Size % Cumulative % Kind (class / dict of class) > 0 88289 37 8696984 25 8696984 25 str > 1 5 22 4897352 14 13594336 39 tuple > 2 5083 2 2790664 8 16385000 47 dict (no owner) > 3 1939 1 1749656 5 18134656 52 type > 4699 0 1723272 5 19857928 57 dict of module > 5 12337 5 1579136 5 21437064 61 types.CodeType > 6 12403 5 1488360 4 22925424 66 function > 7 1939 1 1452616 4 24378040 70 dict of type > 8677 0 709496 2 25087536 72 dict of 0x1e4d880 > 9 25603 11 614472 2 25702008 73 int > <1103 more rows. Type e.g. '_.more' to view.> > {noformat} > after several hours of running > {noformat} > INFO *** MemoryReport Heap: > Partition of a set of 1255662 objects. Total size = 315029632 bytes. > Index Count % Size % Cumulative % Kind (class / dict of class) > 0 95554 8 99755056 32 99755056 32 dict of > > apache_beam.runners.direct.bundle_factory._Bundle > 1 117943 9 54193192 17 153948248 49 dict (no owner) > 2 161068 13 27169296 9 181117544 57 unicode > 3 94571 8 26479880 8 207597424 66 dict of apache_beam.pvalue.PBegin > 4 126461 10 12715336 4 220312760 70 str > 5 44374 4 12424720 4 232737480 74 dict of > apitools.base.protorpclite.messages.FieldList > 6 44374 4 6348624 2 239086104 76 > apitools.base.protorpclite.messages.FieldList > 7 95556 8 6115584 2 245201688 78 > apache_beam.runners.direct.bundle_factory._Bundle > 8 94571 8 6052544 2 251254232 80 apache_beam.pvalue.PBegin > 9 57371 5 5218424 2 256472656 81 tuple > <1187 more rows. Type e.g. '_.more' to view.> > {noformat} > > I see that every bundle still sits in memory and all its data too. why aren't > the gc-ed? > What is the policy for gc for the dataflow processes? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.
Robert Bradshaw created BEAM-5468: - Summary: Allow runner to set worker log level in Python SDK harness. Key: BEAM-5468 URL: https://issues.apache.org/jira/browse/BEAM-5468 Project: Beam Issue Type: Improvement Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5442) PortableRunner swallows custom options for Runner
[ https://issues.apache.org/jira/browse/BEAM-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16623368#comment-16623368 ] Robert Bradshaw commented on BEAM-5442: --- It seems like we should validate the options we know about, but pass all options through. (Maybe warn about ones we don't know about?) > PortableRunner swallows custom options for Runner > - > > Key: BEAM-5442 > URL: https://issues.apache.org/jira/browse/BEAM-5442 > Project: Beam > Issue Type: Bug > Components: sdk-java-core, sdk-py-core >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Labels: portability, portability-flink > Fix For: 2.8.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The PortableRunner doesn't pass custom PipelineOptions to the executing > Runner. > Example: {{--parallelism=4}} won't be forwarded to the FlinkRunner. > (The option is just removed during proto translation without any warning) > We should allow some form of customization through the options, even for the > PortableRunner. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
***UNCHECKED*** [jira] [Created] (BEAM-5428) Implement cross-bundle state caching.
Robert Bradshaw created BEAM-5428: - Summary: Implement cross-bundle state caching. Key: BEAM-5428 URL: https://issues.apache.org/jira/browse/BEAM-5428 Project: Beam Issue Type: Improvement Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-5395) BeamPython data plane streams data
[ https://issues.apache.org/jira/browse/BEAM-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-5395. --- Resolution: Fixed Fix Version/s: 2.8.0 > BeamPython data plane streams data > -- > > Key: BEAM-5395 > URL: https://issues.apache.org/jira/browse/BEAM-5395 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Fix For: 2.8.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Currently the default implementation is to buffer all data for the bundle. > Experiments were made splitting at arbitrary byte boundaries, but it appears > that Java requires messages to be split on element boundaries. For now we > should implement that in Python (even if this means not being able to split > up large elements among multiple messages). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5395) BeamPython data plane streams data
Robert Bradshaw created BEAM-5395: - Summary: BeamPython data plane streams data Key: BEAM-5395 URL: https://issues.apache.org/jira/browse/BEAM-5395 Project: Beam Issue Type: Bug Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw Currently the default implementation is to buffer all data for the bundle. Experiments were made splitting at arbitrary byte boundaries, but it appears that Java requires messages to be split on element boundaries. For now we should implement that in Python (even if this means not being able to split up large elements among multiple messages). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5394) BeamPython on Flink runs at scale
Robert Bradshaw created BEAM-5394: - Summary: BeamPython on Flink runs at scale Key: BEAM-5394 URL: https://issues.apache.org/jira/browse/BEAM-5394 Project: Beam Issue Type: Bug Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw This is an umbrella bug for verifying that BeamPython can run production sized workloads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-5392) GroupByKey on Spark: All values for a single key need to fit in-memory at once
[ https://issues.apache.org/jira/browse/BEAM-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-5392: -- Summary: GroupByKey on Spark: All values for a single key need to fit in-memory at once (was: GroupByKey: All values for a single key need to fit in-memory at once) > GroupByKey on Spark: All values for a single key need to fit in-memory at once > -- > > Key: BEAM-5392 > URL: https://issues.apache.org/jira/browse/BEAM-5392 > Project: Beam > Issue Type: Bug > Components: runner-spark >Affects Versions: 2.6.0 >Reporter: David Moravek >Assignee: David Moravek >Priority: Major > Labels: performance > > Currently, when using GroupByKey, all values for a single key need to fit > in-memory at once. > > There are following issues, that need to be addressed: > a) We can not use Spark's _groupByKey_, because it requires all values to fit > in memory for a single key (it is implemented as "list combiner") > b) _ReduceFnRunner_ iterates over values multiple times in order to group > also by window > > Solution: > > In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that > can take advantage of having elements for a single key sorted by timestamp. > > We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet > this constraint. > > For non-merging windows, we can put window itself into the key resulting in > smaller groupings. > > This approach was already tested in ~100TB input scale on Spark 2.3.x. > (custom Spark runner). > > I'll submit a patch once the Dataflow Worker code donation is complete. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3106) Consider not pinning all python dependencies, or moving them to requirements.txt
[ https://issues.apache.org/jira/browse/BEAM-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614093#comment-16614093 ] Robert Bradshaw commented on BEAM-3106: --- Our main requirements now specify version ranges (generally guided by semantic versioning); we should unpin our gcp requirements when possible as well wherever possible. > Consider not pinning all python dependencies, or moving them to > requirements.txt > > > Key: BEAM-3106 > URL: https://issues.apache.org/jira/browse/BEAM-3106 > Project: Beam > Issue Type: Wish > Components: build-system >Affects Versions: 2.1.0 > Environment: python >Reporter: Maximilian Roos >Priority: Major > > Currently all python dependencies are [pinned or > capped|https://github.com/apache/beam/blob/master/sdks/python/setup.py#L97] > While there's a good argument for supplying a `requirements.txt` with well > tested dependencies, having them specified in `setup.py` forces them to an > exact state on each install of Beam. This makes using Beam in any environment > with other libraries nigh on impossible. > This is particularly severe for the `gcp` dependencies, where we have > libraries that won't work with an older version (but Beam _does_ work with an > newer version). We have to do a bunch of gymnastics to get the correct > versions installed because of this. Unfortunately, airflow repeats this > practice and conflicts on a number of dependencies, adding further > complication (but, again there is no real conflict). > I haven't seen this practice outside of the Apache & Google ecosystem - for > example no libraries in numerical python do this. Here's a [discussion on > SO|https://stackoverflow.com/questions/28509481/should-i-pin-my-python-dependencies-versions] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5354) Side Inputs seems to be non-working in the sdk-go
[ https://issues.apache.org/jira/browse/BEAM-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613379#comment-16613379 ] Robert Bradshaw commented on BEAM-5354: --- There are some worker harness bugs in handling of side inputs on dataflow that I've been fixing. I'm in the process of pushing out new containers. > Side Inputs seems to be non-working in the sdk-go > - > > Key: BEAM-5354 > URL: https://issues.apache.org/jira/browse/BEAM-5354 > Project: Beam > Issue Type: Bug > Components: sdk-go >Reporter: Tomas Roos >Priority: Major > > Running the contains example fails with > > {code:java} > Output i0 for step was not found. > {code} > This is because of the call to debug.Head (which internally uses SideInput) > Removing the following line > [https://github.com/apache/beam/blob/master/sdks/go/examples/contains/contains.go#L50] > > The pipeline executes well. > > Executed on id's > > go-job-1-1536664417610678545 > vs > go-job-1-1536664934354466938 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.
[ https://issues.apache.org/jira/browse/BEAM-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611858#comment-16611858 ] Robert Bradshaw commented on BEAM-4858: --- I created [https://github.com/apache/beam/pull/6375] which cleans this up. Floating point division is always the correct thing to do here; I'd like to understand under what circumstances this would be made worse. (Note that for a real pipeline, the actual timings can be noisy.) Note that the code in question should not be (that) performance sensitive (with the exception of adding to the batch and checking the batch size), as the complicated logic is called once per batch (which should be reasonably expensive, and is taken into account). > Clean up _BatchSizeEstimator in element-batching transform. > --- > > Key: BEAM-4858 > URL: https://issues.apache.org/jira/browse/BEAM-4858 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Beam Python 3 conversion [exposed|https://github.com/apache/beam/pull/5729] > non-trivial performance-sensitive logic in element-batching transform. Let's > take a look at > [util.py#L271|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271]. > > Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on > the type of the keys - whether they are integers or floats. > The keys of key-value pairs contained in {{self._data}} are added as integers > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L260], > however, when we 'thin' the collected entries > [here|https://github.com/apache/beam/blob/d2ac08da2dccce8930432fae1ec7c30953880b69/sdks/python/apache_beam/transforms/util.py#L279], > the keys will become floats. Surprisingly, using either integer or float > division consistently [in the > comparator|https://github.com/apache/beam/blob/e98ff7c96afa2f72b3a98426dc1e9a47224da5c8/sdks/python/apache_beam/transforms/util.py#L271] > negatively affects the performance of a custom pipeline I was using to > benchmark these changes. The performance impact likely comes from changes in > the logic that depends on how division is evaluated, not from the > performance of division operation itself. > In terms of Python 3 conversion the best course of action that avoids > regression seems to be to preserve the existing Python 2 behavior using > {{old_div}} from {{past.utils.division}}, in the medium term we should clean > up the logic. We may want to add a targeted microbenchmark to evaluate > performance of this code, and maybe cythonize the code, since it seems to be > performance-sensitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5214) Update Java quickstart to use maven
[ https://issues.apache.org/jira/browse/BEAM-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601920#comment-16601920 ] Robert Bradshaw commented on BEAM-5214: --- Yes, though the initial command to generate the example ({{mvn archetype:generate}}) should probably be changed to something that sill works. > Update Java quickstart to use maven > --- > > Key: BEAM-5214 > URL: https://issues.apache.org/jira/browse/BEAM-5214 > Project: Beam > Issue Type: Bug > Components: examples-java, website >Reporter: Robert Bradshaw >Assignee: Reuven Lax >Priority: Major > > The existing quickstart still uses mvn commands. > https://beam.apache.org/get-started/quickstart-java/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5255) Fix over-aggressive division futurization.
Robert Bradshaw created BEAM-5255: - Summary: Fix over-aggressive division futurization. Key: BEAM-5255 URL: https://issues.apache.org/jira/browse/BEAM-5255 Project: Beam Issue Type: Bug Components: sdk-py-core Affects Versions: 2.6.0 Reporter: Robert Bradshaw Assignee: Ahmet Altay When converting from Python 2 to Python 3, `a / b` becomes `a // b` only for ints, but it is incorrect to do this substitution for floating point division. I noticed this change in the microbenchmarks, but we should do an audit to make sure we haven't broken things elsewhere. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-5214) Update Java quickstart to use maven
Robert Bradshaw created BEAM-5214: - Summary: Update Java quickstart to use maven Key: BEAM-5214 URL: https://issues.apache.org/jira/browse/BEAM-5214 Project: Beam Issue Type: Bug Components: examples-java, website Reporter: Robert Bradshaw Assignee: Reuven Lax The existing quickstart still uses mvn commands. https://beam.apache.org/get-started/quickstart-java/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4782) Enforce KV coders for MultiMap side inputs
Robert Bradshaw created BEAM-4782: - Summary: Enforce KV coders for MultiMap side inputs Key: BEAM-4782 URL: https://issues.apache.org/jira/browse/BEAM-4782 Project: Beam Issue Type: Task Components: sdk-py-harness Reporter: Robert Bradshaw Assignee: Robert Bradshaw Non-Python runners don't understand that the (default) FastPrimitivesCoder may be a KV coder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4781) PTransforms that simply return their input cause portable Flink to crash.
Robert Bradshaw created BEAM-4781: - Summary: PTransforms that simply return their input cause portable Flink to crash. Key: BEAM-4781 URL: https://issues.apache.org/jira/browse/BEAM-4781 Project: Beam Issue Type: Task Components: runner-flink Reporter: Robert Bradshaw Assignee: Aljoscha Krettek E.g. {code:python} class MaybePrint(beam.PTransform): def expand(self, pcoll): if some_flag: pcoll | beam.Map(logging.info) return pcoll {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4729) Conditionally propagate local GCS credentials to locally spawned docker images.
Robert Bradshaw created BEAM-4729: - Summary: Conditionally propagate local GCS credentials to locally spawned docker images. Key: BEAM-4729 URL: https://issues.apache.org/jira/browse/BEAM-4729 Project: Beam Issue Type: Task Components: sdk-java-harness Reporter: Robert Bradshaw Assignee: Luke Cwik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (BEAM-3883) Python SDK stages artifacts when talking to job server
[ https://issues.apache.org/jira/browse/BEAM-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reopened BEAM-3883: --- Staging code is written, but not yet hooked up. > Python SDK stages artifacts when talking to job server > -- > > Key: BEAM-3883 > URL: https://issues.apache.org/jira/browse/BEAM-3883 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ben Sidhom >Assignee: Ankur Goenka >Priority: Major > Fix For: 2.5.0 > > Time Spent: 19h 10m > Remaining Estimate: 0h > > The Python SDK does not currently stage its user-defined functions or > dependencies when talking to the job API. Artifacts that need to be staged > include the user code itself, any SDK components not included in the > container image, and the list of Python packages that must be installed at > runtime. > > Artifacts that are currently expected can be found in the harness boot code: > [https://github.com/apache/beam/blob/58e3b06bee7378d2d8db1c8dd534b415864f63e1/sdks/python/container/boot.go#L52.] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4605) Make GroupByKey the primitive in the Python SDK
[ https://issues.apache.org/jira/browse/BEAM-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518749#comment-16518749 ] Robert Bradshaw commented on BEAM-4605: --- At least the old direct runner relies on the composite structure, an inventory should be taken to see if there are others. > Make GroupByKey the primitive in the Python SDK > --- > > Key: BEAM-4605 > URL: https://issues.apache.org/jira/browse/BEAM-4605 > Project: Beam > Issue Type: Task > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Ahmet Altay >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently it is still a composite, with GBKOnly as a primitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4605) Make GroupByKey the primitive in the Python SDK
Robert Bradshaw created BEAM-4605: - Summary: Make GroupByKey the primitive in the Python SDK Key: BEAM-4605 URL: https://issues.apache.org/jira/browse/BEAM-4605 Project: Beam Issue Type: Task Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay Currently it is still a composite, with GBKOnly as a primitive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-3884) Python SDK supports Impulse as a primitive transform
[ https://issues.apache.org/jira/browse/BEAM-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-3884: - Assignee: Robert Bradshaw (was: Ahmet Altay) > Python SDK supports Impulse as a primitive transform > > > Key: BEAM-3884 > URL: https://issues.apache.org/jira/browse/BEAM-3884 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ben Sidhom >Assignee: Robert Bradshaw >Priority: Major > > Portable runners require Impulse to be the only root nodes of pipelines. The > Python SDK should provide this for pipeline construction. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4588) Use wheels for numpy
Robert Bradshaw created BEAM-4588: - Summary: Use wheels for numpy Key: BEAM-4588 URL: https://issues.apache.org/jira/browse/BEAM-4588 Project: Beam Issue Type: Task Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay When building the container, one has the line: Skipping bdist_wheel for numpy, due to binaries being disabled for it. This is arguably one of the most expensive packages to build from scratch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4578) Cleanup paths due to obsolete worker/sdk upgrades.
Robert Bradshaw created BEAM-4578: - Summary: Cleanup paths due to obsolete worker/sdk upgrades. Key: BEAM-4578 URL: https://issues.apache.org/jira/browse/BEAM-4578 Project: Beam Issue Type: Task Components: runner-dataflow, sdk-py-core Reporter: Robert Bradshaw Assignee: Thomas Groh -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4565) Hot key fanout should not distribute keys to all shards.
Robert Bradshaw created BEAM-4565: - Summary: Hot key fanout should not distribute keys to all shards. Key: BEAM-4565 URL: https://issues.apache.org/jira/browse/BEAM-4565 Project: Beam Issue Type: Task Components: sdk-java-core, sdk-py-core Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0, 2.0.0, 2.5.0 Reporter: Robert Bradshaw Assignee: Kenneth Knowles The goal is to reduce the number of value sent to a single post-GBK worker. If combiner lifting happens, each bundle will sends a single value per sub-key, causing an N-fold blowup in shuffle data and N reducers with the same amount of data to consume as the single reducer in the non-fanout case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4494) Migrate website source code to apache/beam [website-migration] branch
[ https://issues.apache.org/jira/browse/BEAM-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512692#comment-16512692 ] Robert Bradshaw commented on BEAM-4494: --- Yes, we could just ask people to make changes elsewhere if we get PRs there. Also, the script above can be used multiple times, i.e. even if changes are done in both places it's just a git merge to get master that has everything. > Migrate website source code to apache/beam [website-migration] branch > - > > Key: BEAM-4494 > URL: https://issues.apache.org/jira/browse/BEAM-4494 > Project: Beam > Issue Type: Sub-task > Components: website >Reporter: Scott Wegner >Assignee: Scott Wegner >Priority: Major > Labels: beam-site-automation-reliability > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-4546) Implement with hot key fanout for combiners
[ https://issues.apache.org/jira/browse/BEAM-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-4546: - Assignee: Robert Bradshaw > Implement with hot key fanout for combiners > --- > > Key: BEAM-4546 > URL: https://issues.apache.org/jira/browse/BEAM-4546 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Ahmet Altay >Assignee: Robert Bradshaw >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4494) Migrate website source code to apache/beam [website-migration] branch
[ https://issues.apache.org/jira/browse/BEAM-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510373#comment-16510373 ] Robert Bradshaw commented on BEAM-4494: --- Is there any advantage to having a separate branch? There are lots of downsides... > Migrate website source code to apache/beam [website-migration] branch > - > > Key: BEAM-4494 > URL: https://issues.apache.org/jira/browse/BEAM-4494 > Project: Beam > Issue Type: Sub-task > Components: website >Reporter: Scott Wegner >Assignee: Scott Wegner >Priority: Major > Labels: beam-site-automation-reliability > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4524) We should not be using md5 to validate artifact integrity.
Robert Bradshaw created BEAM-4524: - Summary: We should not be using md5 to validate artifact integrity. Key: BEAM-4524 URL: https://issues.apache.org/jira/browse/BEAM-4524 Project: Beam Issue Type: Task Components: beam-model Reporter: Robert Bradshaw Assignee: Kenneth Knowles https://github.com/apache/beam/blob/6f239498e676f471427e17abc4bc5cffba9887c5/model/job-management/src/main/proto/beam_artifact_api.proto#L63 Something like sha256 would probably be sufficient. https://en.wikipedia.org/wiki/MD5#Overview_of_security_issues -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-2980) BagState.isEmpty needs a tighter spec
[ https://issues.apache.org/jira/browse/BEAM-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484480#comment-16484480 ] Robert Bradshaw commented on BEAM-2980: --- This seems a bit more subtle. Given BagState myBag = // empty ReadableState isMyBagEmpty = myBag.isEmpty(); readResultBefore = myBag.read(); myBag.add(bizzle); readResultAfter = myBag.read(); bool createBeforeReadAfter = isMyBagEmpty.read() I think it makes a lot of sense for readResultAfter to contain bizzle, but the question is whether createBeforeReadAfter is True or False. > BagState.isEmpty needs a tighter spec > - > > Key: BEAM-2980 > URL: https://issues.apache.org/jira/browse/BEAM-2980 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Kenneth Knowles >Assignee: Daniel Mills >Priority: Major > > Consider the following: > {code} > BagState myBag = // empty > ReadableState isMyBagEmpty = myBag.isEmpty(); > myBag.add(bizzle); > bool empty = isMyBagEmpty.read(); > {code} > Should {{empty}} be true or false? We need a consistent answer, across all > kinds of state, when snapshots are required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4150) Standardize use of PCollection coder proto attribute
Robert Bradshaw created BEAM-4150: - Summary: Standardize use of PCollection coder proto attribute Key: BEAM-4150 URL: https://issues.apache.org/jira/browse/BEAM-4150 Project: Beam Issue Type: Task Components: beam-model Reporter: Robert Bradshaw Assignee: Kenneth Knowles In some places it's expected to be a WindowedCoder, in others the raw ElementCoder. We should use the same convention (decided in discussion to be the raw ElementCoder) everywhere. The WindowCoder can be pulled out of the attached windowing strategy, and the input/output ports should specify the encoding directly rather than read the adjacent PCollection coder fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-4097) Python SDK should set the environment in the job submission protos
[ https://issues.apache.org/jira/browse/BEAM-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-4097: - Assignee: Robert Bradshaw (was: Ahmet Altay) > Python SDK should set the environment in the job submission protos > -- > > Key: BEAM-4097 > URL: https://issues.apache.org/jira/browse/BEAM-4097 > Project: Beam > Issue Type: Task > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3792) Python submits portable pipelines to the Flink-served endpoint.
[ https://issues.apache.org/jira/browse/BEAM-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3792. --- Resolution: Fixed Fix Version/s: 2.5.0 > Python submits portable pipelines to the Flink-served endpoint. > --- > > Key: BEAM-3792 > URL: https://issues.apache.org/jira/browse/BEAM-3792 > Project: Beam > Issue Type: Sub-task > Components: runner-flink >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Fix For: 2.5.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3911) Do we want to encourage release pinning?
[ https://issues.apache.org/jira/browse/BEAM-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441725#comment-16441725 ] Robert Bradshaw commented on BEAM-3911: --- No, we don't want to encourage this. > Do we want to encourage release pinning? > > > Key: BEAM-3911 > URL: https://issues.apache.org/jira/browse/BEAM-3911 > Project: Beam > Issue Type: Bug > Components: website >Reporter: Robert Bradshaw >Assignee: Melissa Pashniak >Priority: Major > > The Python instructions recommend adding apache-beam==2.3.0 to one's > setup.py. Instead, we should probably recommend apache-beam<3. It may also be > worth calling out apache-beam can be installed with pip install apache-beam. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-4105) Python Postcommit suite does not start on Jenkins - proto generation failed
[ https://issues.apache.org/jira/browse/BEAM-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441669#comment-16441669 ] Robert Bradshaw commented on BEAM-4105: --- Perhaps the pip is too old? > Python Postcommit suite does not start on Jenkins - proto generation failed > --- > > Key: BEAM-4105 > URL: https://issues.apache.org/jira/browse/BEAM-4105 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Jason Kuster >Priority: Major > > The suite has been steadily failing for last 3 days. Sample logs: > # Tox runs unit tests in a virtual environment > ${LOCAL_PATH}/tox -e ALL -c sdks/python/tox.ini > GLOB sdist-make: > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py > ERROR: invocation failed (exit code 1), logfile: > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/log/tox-0.log > ERROR: actionid: tox > msg: packaging > cmdargs: ['/usr/bin/python', > local('/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py'), > 'sdist', '--formats=zip', '--dist-dir', > local('/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/dist')] > /usr/local/lib/python2.7/dist-packages/setuptools/dist.py:397: UserWarning: > Normalizing '2.5.0.dev' to '2.5.0.dev0' > normalized_version, > Regenerating common_urns module. > running sdist > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py:51: > UserWarning: Installing grpcio-tools is recommended for development. > warnings.warn('Installing grpcio-tools is recommended for development.') > WARNING:root:Installing grpcio-tools into > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/.eggs/grpcio-wheels > Process Process-1: > Traceback (most recent call last): > File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in > _bootstrap > self.run() > File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run > self._target(*self._args, **self._kwargs) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py", > line 149, in _install_grpcio_tools_and_generate_proto_files > shutil.rmtree(build_path) > File "/usr/lib/python2.7/shutil.py", line 239, in rmtree > onerror(os.listdir, path, sys.exc_info()) > File "/usr/lib/python2.7/shutil.py", line 237, in rmtree > names = os.listdir(path) > OSError: [Errno 2] No such file or directory: > '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/.eggs/grpcio-wheels-build' > Traceback (most recent call last): > File "setup.py", line 235, in > 'test': generate_protos_first(test), > File "/usr/local/lib/python2.7/dist-packages/setuptools/__init__.py", line > 129, in setup > return distutils.core.setup(**attrs) > File "/usr/lib/python2.7/distutils/core.py", line 151, in setup > dist.run_commands() > File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands > self.run_command(cmd) > File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command > cmd_obj.run() > File "setup.py", line 141, in run > gen_protos.generate_proto_files() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/gen_protos.py", > line 97, in generate_proto_files > raise ValueError("Proto generation failed (see log for details).") > ValueError: Proto generation failed (see log for details). > ERROR: FAIL could not package project - v = InvocationError('/usr/bin/python > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/setup.py > sdist --formats=zip --dist-dir > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/dist > (see > /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/target/.tox/log/tox-0.log)', > 1) > Build step 'Execute shell' marked build as failure > Sending e-mails to: commits@beam.apache.org > Finished: FAILURE -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4098) Handle WindowInto in the Java SDK Harness
Robert Bradshaw created BEAM-4098: - Summary: Handle WindowInto in the Java SDK Harness Key: BEAM-4098 URL: https://issues.apache.org/jira/browse/BEAM-4098 Project: Beam Issue Type: Task Components: sdk-java-core Reporter: Robert Bradshaw Assignee: Kenneth Knowles -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4097) Python SDK should set the environment in the job submission protos
Robert Bradshaw created BEAM-4097: - Summary: Python SDK should set the environment in the job submission protos Key: BEAM-4097 URL: https://issues.apache.org/jira/browse/BEAM-4097 Project: Beam Issue Type: Task Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4074) Cleanup metrics/counters code
Robert Bradshaw created BEAM-4074: - Summary: Cleanup metrics/counters code Key: BEAM-4074 URL: https://issues.apache.org/jira/browse/BEAM-4074 Project: Beam Issue Type: Task Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay E.g. right now we have metricbase.Distribution metric.DelegatingDistribution metric.cells.DistributionCell metric.cells.DistributionResult metric.cells.DistributionData metric.cells.DistributionAggregator transforms.cy_combiners.DataflowDistributionCounter transforms.cy_combiners.DataflowDistributionCounterFn transforms.cy_dataflow_distribution_counter.DataflowDistributionCounter plus some code under util/counters. This is true for "ordinary" sum and max counters as well. We should consolidate/simplify this code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3407) Post-commit test fails with "Backing channel 'beam8' is disconnected"
[ https://issues.apache.org/jira/browse/BEAM-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433219#comment-16433219 ] Robert Bradshaw commented on BEAM-3407: --- This is still happening, e.g. [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/31/console] https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/34/console > Post-commit test fails with "Backing channel 'beam8' is disconnected" > - > > Key: BEAM-3407 > URL: https://issues.apache.org/jira/browse/BEAM-3407 > Project: Beam > Issue Type: Bug > Components: testing >Reporter: Henning Rohde >Assignee: Jason Kuster >Priority: Major > > Example failure below. > https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/3809/console > ... > 2018-01-02T22:46:30.423 [INFO] Excluding io.grpc:grpc-protobuf-lite:jar:1.2.0 > from the shaded jar. > 2018-01-02T22:46:30.423 [INFO] Excluding io.grpc:grpc-stub:jar:1.2.0 from the > shaded jar. > ERROR: Failed to parse POMs > java.io.IOException: Backing channel 'beam8' is disconnected. > at > hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:212) > at > hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281) > at com.sun.proxy.$Proxy131.isAlive(Unknown Source) > at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1138) > at hudson.maven.ProcessCache$MavenProcess.call(ProcessCache.java:166) > at > hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:879) > at > hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504) > at hudson.model.Run.execute(Run.java:1724) > at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543) > at hudson.model.ResourceController.execute(ResourceController.java:97) > at hudson.model.Executor.run(Executor.java:421) > Caused by: java.io.IOException: Unexpected termination of the channel > at > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) > Caused by: java.io.EOFException > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150) > at > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859) > at java.io.ObjectInputStream.(ObjectInputStream.java:355) > at > hudson.remoting.ObjectInputStreamEx.(ObjectInputStreamEx.java:48) > at > hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) > at > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) > ERROR: Step ‘E-mail Notification’ failed: no workspace for > beam_PostCommit_Java_ValidatesRunner_Spark #3809 > ERROR: beam8 is offline; cannot locate JDK 1.8 (latest) > ERROR: beam8 is offline; cannot locate Maven 3.3.3 > Finished: FAILURE -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4030) Add CombineFn.compact, similar to Java
Robert Bradshaw created BEAM-4030: - Summary: Add CombineFn.compact, similar to Java Key: BEAM-4030 URL: https://issues.apache.org/jira/browse/BEAM-4030 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay Some CombineFns buffer elements in their add_inputs because a combining operation cost can be effectively amortized across many elements. However, this introduces the extra (possibly higher) cost of potentially serializing more expensive buffers through shuffle. We should add a CombineFn.compact(self, accumulator) method (defaulting to the identity) similar to what the Java SDK provides which is called when flushing an element from the PGBKCV table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3737) Key-aware batching function
[ https://issues.apache.org/jira/browse/BEAM-3737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426269#comment-16426269 ] Robert Bradshaw commented on BEAM-3737: --- It seems that GroupByKey() would already give you values batched per key, right? Or are you looking for something you can place before the GBK that enables combiner lifting? > Key-aware batching function > --- > > Key: BEAM-3737 > URL: https://issues.apache.org/jira/browse/BEAM-3737 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chuan Yu Foo >Priority: Major > > I have a CombineFn for which add_input has very large overhead. I would like > to batch the incoming elements into a large batch before each call to > add_input to reduce this overhead. In other words, I would like to do > something like: > {{elements | GroupByKey() | BatchElements() | CombineValues(MyCombineFn())}} > Unfortunately, BatchElements is not key-aware, and can't be used after a > GroupByKey to batch elements per key. I'm working around this by doing the > batching within CombineValues, which makes the CombineFn rather messy. It > would be nice if there were a key-aware BatchElements transform which could > be used in this context. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-4012) Suppress tracebacks on failure-catching tests
[ https://issues.apache.org/jira/browse/BEAM-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-4012: -- Labels: beginner (was: ) > Suppress tracebacks on failure-catching tests > - > > Key: BEAM-4012 > URL: https://issues.apache.org/jira/browse/BEAM-4012 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Ahmet Altay >Priority: Major > Labels: beginner > > Our tests of assert_that (e.g. in > apache_beam.runners.portability.fn_api_runner_test) dump the expected "error" > to stdout making our tests noisy (and they look like failures). It'd be good > to suppress these in tests (while making sure things are still properly > logged on workers. > There are probably other tests of similar nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-4012) Suppress tracebacks on failure-catching tests
Robert Bradshaw created BEAM-4012: - Summary: Suppress tracebacks on failure-catching tests Key: BEAM-4012 URL: https://issues.apache.org/jira/browse/BEAM-4012 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay Our tests of assert_that (e.g. in apache_beam.runners.portability.fn_api_runner_test) dump the expected "error" to stdout making our tests noisy (and they look like failures). It'd be good to suppress these in tests (while making sure things are still properly logged on workers. There are probably other tests of similar nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3911) Do we want to encourage release pinning?
Robert Bradshaw created BEAM-3911: - Summary: Do we want to encourage release pinning? Key: BEAM-3911 URL: https://issues.apache.org/jira/browse/BEAM-3911 Project: Beam Issue Type: Bug Components: website Reporter: Robert Bradshaw Assignee: Melissa Pashniak The Python instructions recommend adding apache-beam==2.3.0 to one's setup.py. Instead, we should probably recommend apache-beam<3. It may also be worth calling out apache-beam can be installed with pip install apache-beam. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-1442) Performance improvement of the Python DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407244#comment-16407244 ] Robert Bradshaw commented on BEAM-1442: --- As of today, you can get this from pip. Note for avro in particular that https://issues.apache.org/jira/browse/BEAM-2810 is still open. > Performance improvement of the Python DirectRunner > -- > > Key: BEAM-1442 > URL: https://issues.apache.org/jira/browse/BEAM-1442 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Charles Chen >Priority: Major > Labels: gsoc2017, mentor, python > Fix For: 2.4.0 > > > The DirectRunner for Python and Java are intended to act as policy enforcers, > and correctness checkers for Beam pipelines; but there are users that run > data processing tasks in them. > Currently, the Python Direct Runner has less-than-great performance, although > some work has gone into improving it. There are more opportunities for > improvement. > Skills for this project: > * Python > * Cython (nice to have) > * Working through the Beam getting started materials (nice to have) > To start figuring out this problem, it is advisable to run a simple pipeline, > and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions > directly on JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3890) Add withHotKeyFanout to Python
Robert Bradshaw created BEAM-3890: - Summary: Add withHotKeyFanout to Python Key: BEAM-3890 URL: https://issues.apache.org/jira/browse/BEAM-3890 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-1442) Performance improvement of the Python DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-1442. --- Resolution: Fixed While one is rarely "done" with possible performance improvements, I'm going to close this bug as done for 2.4.0 due to the significant improvements that make the Python runner at least execute at reasonable speed now. > Performance improvement of the Python DirectRunner > -- > > Key: BEAM-1442 > URL: https://issues.apache.org/jira/browse/BEAM-1442 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Charles Chen >Priority: Major > Labels: gsoc2017, mentor, python > Fix For: 2.4.0 > > > The DirectRunner for Python and Java are intended to act as policy enforcers, > and correctness checkers for Beam pipelines; but there are users that run > data processing tasks in them. > Currently, the Python Direct Runner has less-than-great performance, although > some work has gone into improving it. There are more opportunities for > improvement. > Skills for this project: > * Python > * Cython (nice to have) > * Working through the Beam getting started materials (nice to have) > To start figuring out this problem, it is advisable to run a simple pipeline, > and study the `Pipeline.run` and `DirectRunner.run` methods. Ask questions > directly on JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3851) Support element timestamps while publishing to Kafka.
[ https://issues.apache.org/jira/browse/BEAM-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3851: -- Fix Version/s: (was: 2.4.0) 2.5.0 > Support element timestamps while publishing to Kafka. > - > > Key: BEAM-3851 > URL: https://issues.apache.org/jira/browse/BEAM-3851 > Project: Beam > Issue Type: Improvement > Components: io-java-kafka >Affects Versions: 2.3.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 2.5.0 > > Time Spent: 40m > Remaining Estimate: 0h > > KafkaIO sink should support using input element timestamp for the message > published to Kafka. Otherwise there is no way for user to influence the > timestamp of the messages in Kafka sink. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3815) 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or beam-2.4.0
[ https://issues.apache.org/jira/browse/BEAM-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3815. --- Resolution: Fixed > 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or > beam-2.4.0 > -- > > Key: BEAM-3815 > URL: https://issues.apache.org/jira/browse/BEAM-3815 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.4.0 >Reporter: Alan Myrvold >Assignee: Robert Bradshaw >Priority: Blocker > Fix For: 2.4.0 > > > Running dataflow quickstart with the 2.4.0 RC2 uses a container named > beam-master-20180228, should be 2.4.0 or beam-2.4.0, as in previous releases. > For 2.3.0, this version was set on the release branch with > [https://github.com/apache/beam/commit/245cb4cbec9c1dd7d4f9ff8eb3ab38667c5472e9#diff-0032670e9006c6eba52569423c64e6d2] > Perhaps the release guide needs to be updated? > wget > https://repository.apache.org/content/repositories/orgapachebeam-1030/org/apache/beam/beam-runners-google-cloud-dataflow-java/2.4.0/beam-runners-google-cloud-dataflow-java-2.4.0.jar > jar xf beam-runners-google-cloud-dataflow-java-2.4.0.jar > org/apache/beam/runners/dataflow/dataflow.properties > grep container.version org/apache/beam/runners/dataflow/dataflow.properties -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3868) Move state and time logic out of the ParDoEvaluator
[ https://issues.apache.org/jira/browse/BEAM-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3868: -- Fix Version/s: (was: 2.4.0) > Move state and time logic out of the ParDoEvaluator > --- > > Key: BEAM-3868 > URL: https://issues.apache.org/jira/browse/BEAM-3868 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Batkhuyag Batsaikhan >Assignee: Batkhuyag Batsaikhan >Priority: Minor > > ParDoEvaluator has State and Timer logic that ideally belongs to > StatefulParDoEvaluator. If possible, move these logic to its appropriate > place. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3871) Ensure all non-root transforms have a Parent
Robert Bradshaw created BEAM-3871: - Summary: Ensure all non-root transforms have a Parent Key: BEAM-3871 URL: https://issues.apache.org/jira/browse/BEAM-3871 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay Sometimes it is set to None, but even top-level transforms should have it set to the root transform. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3865) Incorrect timestamp on merging window outputs.
Robert Bradshaw created BEAM-3865: - Summary: Incorrect timestamp on merging window outputs. Key: BEAM-3865 URL: https://issues.apache.org/jira/browse/BEAM-3865 Project: Beam Issue Type: Bug Components: sdk-py-core Affects Versions: 2.3.0, 2.2.0 Reporter: Robert Bradshaw Assignee: Ahmet Altay Looks like we're setting multiple watermark holds with one arbitrarily being held. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3815) 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or beam-2.4.0
[ https://issues.apache.org/jira/browse/BEAM-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3815: -- Fix Version/s: 2.4.0 > 2.4.0 RC2 uses java worker version beam-master-20180228, should be 2.4.0 or > beam-2.4.0 > -- > > Key: BEAM-3815 > URL: https://issues.apache.org/jira/browse/BEAM-3815 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.4.0 >Reporter: Alan Myrvold >Assignee: Robert Bradshaw >Priority: Blocker > Fix For: 2.4.0 > > > Running dataflow quickstart with the 2.4.0 RC2 uses a container named > beam-master-20180228, should be 2.4.0 or beam-2.4.0, as in previous releases. > For 2.3.0, this version was set on the release branch with > [https://github.com/apache/beam/commit/245cb4cbec9c1dd7d4f9ff8eb3ab38667c5472e9#diff-0032670e9006c6eba52569423c64e6d2] > Perhaps the release guide needs to be updated? > wget > https://repository.apache.org/content/repositories/orgapachebeam-1030/org/apache/beam/beam-runners-google-cloud-dataflow-java/2.4.0/beam-runners-google-cloud-dataflow-java-2.4.0.jar > jar xf beam-runners-google-cloud-dataflow-java-2.4.0.jar > org/apache/beam/runners/dataflow/dataflow.properties > grep container.version org/apache/beam/runners/dataflow/dataflow.properties -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3812) Avoid pickling PTransforms in proto representation
Robert Bradshaw created BEAM-3812: - Summary: Avoid pickling PTransforms in proto representation Key: BEAM-3812 URL: https://issues.apache.org/jira/browse/BEAM-3812 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay Any transform that requires passing information through the runner protos should have an explicit urn and payload. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3799) Nexmark Query 10 breaks with direct runner
[ https://issues.apache.org/jira/browse/BEAM-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389910#comment-16389910 ] Robert Bradshaw commented on BEAM-3799: --- The direct runner does extra checking. Looks like the benchmarks is incorrectly written. > Nexmark Query 10 breaks with direct runner > -- > > Key: BEAM-3799 > URL: https://issues.apache.org/jira/browse/BEAM-3799 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.4.0, 2.5.0 >Reporter: Ismaël Mejía >Assignee: Thomas Groh >Priority: Major > Fix For: 2.4.0 > > > While running query 10 with the direct runner like this: > {quote}mvn exec:java -Dexec.mainClass=org.apache.beam.sdk.nexmark.Main > -Pdirect-runner -Dexec.args="--runner=DirectRunner --query=10 > --streaming=false --manageResources=false --monitorJobs=true > --enforceEncodability=true --enforceImmutability=true" -pl 'sdks/java/nexmark' > {quote} > I found that it breaks with the direct runner with following exception (it > works ok with the other runners): > {quote}[WARNING] > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke > (NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke > (DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke (Method.java:498) > at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:294) > at java.lang.Thread.run (Thread.java:748) > Caused by: org.apache.beam.sdk.util.IllegalMutationException: PTransform > Query10/Query10.UploadEvents/ParMultiDo(Anonymous) mutated value KV{null, > 2015-07-15T00:00:09.999Z shard-3-of-00025 0 ON_TIME null > } after it was output (new value was KV{null, 2015-07-15T00:00:09.999Z > shard-3-of-00025 0 ON_TIME null > }). Values must not be mutated in any way after being output. > at > org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.commit > (ImmutabilityCheckingBundleFactory.java:134) > at org.apache.beam.runners.direct.EvaluationContext.commitBundles > (EvaluationContext.java:212) > at org.apache.beam.runners.direct.EvaluationContext.handleResult > (EvaluationContext.java:152) > at > org.apache.beam.runners.direct.QuiescenceDriver$TimerIterableCompletionCallback.handleResult > (QuiescenceDriver.java:258) > at org.apache.beam.runners.direct.DirectTransformExecutor.finishBundle > (DirectTransformExecutor.java:190) > at org.apache.beam.runners.direct.DirectTransformExecutor.run > (DirectTransformExecutor.java:127) > at java.util.concurrent.Executors$RunnableAdapter.call > (Executors.java:511) > at java.util.concurrent.FutureTask.run (FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker > (ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run > (ThreadPoolExecutor.java:624) > at java.lang.Thread.run (Thread.java:748) > Caused by: org.apache.beam.sdk.util.IllegalMutationException: Value KV{null, > 2015-07-15T00:00:09.999Z shard-3-of-00025 0 ON_TIME null > } mutated illegally, new value was KV{null, 2015-07-15T00:00:09.999Z > shard-3-of-00025 0 ON_TIME null > }. Encoding was > rO0ABXNyADZvcmcuYXBhY2hlLmJlYW0uc2RrLm5leG1hcmsucXVlcmllcy5RdWVyeTEwJE91dHB1dEZpbGUWUg9rZM1SvgIABUoABWluZGV4TAAIZmlsZW5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAAxtYXhUaW1lc3RhbXB0ABdMb3JnL2pvZGEvdGltZS9JbnN0YW50O0wABXNoYXJkcQB-AAFMAAZ0aW1pbmd0ADpMb3JnL2FwYWNoZS9iZWFtL3Nkay90cmFuc2Zvcm1zL3dpbmRvd2luZy9QYW5lSW5mbyRUaW1pbmc7eHAAAHBzcgAVb3JnLmpvZGEudGltZS5JbnN0YW50Lci-0MYOnM0CAAFKAAdpTWlsbGlzeHFOjwLrD3QAFHNoYXJkLTAwMDAzLW9mLTAwMDI1fnIAOG9yZy5hcGFjaGUuYmVhbS5zZGsudHJhbnNmb3Jtcy53aW5kb3dpbmcuUGFuZUluZm8kVGltaW5nAAASAAB4cgAOamF2YS5sYW5nLkVudW0AABIAAHhwdAAHT05fVElNRQ, > now > rO0ABXNyADZvcmcuYXBhY2hlLmJlYW0uc2RrLm5leG1hcmsucXVlcmllcy5RdWVyeTEwJE91dHB1dEZpbGUWUg9rZM1SvgIABUoABWluZGV4TAAIZmlsZW5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAAxtYXhUaW1lc3RhbXB0ABdMb3JnL2pvZGEvdGltZS9JbnN0YW50O0wABXNoYXJkcQB-AAFMAAZ0aW1pbmd0ADpMb3JnL2FwYWNoZS9iZWFtL3Nkay90cmFuc2Zvcm1zL3dpbmRvd2luZy9QYW5lSW5mbyRUaW1pbmc7eHAAAHBzcgAVb3JnLmpvZGEudGltZS5JbnN0YW50Lci-0MYOnM0CAAFKAAdpTWlsbGlzeHFOjwLrD3QAFHNoYXJkLTAwMDAzLW9mLTAwMDI1fnIAOG9yZy5hcGFjaGUuYmVhbS5zZGsudHJhbnNmb3Jtcy53aW5kb3dpbmcuUGFuZUluZm8kVGltaW5nAAASAAB4cgAOamF2YS5sYW5nLkVudW0AABIAAHhwdAAHT05fVElNRQ. > at > org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.illegalMutation > (MutationDetectors.java:144) > at > org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.verifyUnmodifiedThrowingCheckedExceptions >
[jira] [Updated] (BEAM-3409) Unexpected behavior of DoFn teardown method running in unit tests
[ https://issues.apache.org/jira/browse/BEAM-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3409: -- Fix Version/s: (was: 2.4.0) 2.5.0 > Unexpected behavior of DoFn teardown method running in unit tests > -- > > Key: BEAM-3409 > URL: https://issues.apache.org/jira/browse/BEAM-3409 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.3.0 >Reporter: Alexey Romanenko >Assignee: Romain Manni-Bucau >Priority: Blocker > Labels: test > Fix For: 2.5.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > Writing a unit test, I found out a strange behaviour of Teardown method of > DoFn implementation when I run this method in unit tests using TestPipeline. > To be more precise, it doesn’t wait until teardown() method will be finished, > it just exits from this method after about 1 sec (on my machine) even if it > should take longer (very simple example - running infinite loop inside this > method or put thread in sleep). In the same time, when I run the same code > from main() with ordinary Pipeline and direct runner, then it’s ok and it > works as expected - teardown() method will be performed completely despite > how much time it will take. > I created two test cases to reproduce this issue - the first one to run with > main() and the second one to run with junit. They use the same implementation > of DoFn (class LongTearDownFn) and expects that teardown method will be > running at least for SLEEP_TIME ms. In case of running as junit test it's not > a case (see output log). > - run with main() > https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/main/java/TearDown.java > - run with junit > https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/test/java/TearDownTest.java -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3792) Python submits portable pipelines to the Flink-served endpoint.
Robert Bradshaw created BEAM-3792: - Summary: Python submits portable pipelines to the Flink-served endpoint. Key: BEAM-3792 URL: https://issues.apache.org/jira/browse/BEAM-3792 Project: Beam Issue Type: Sub-task Components: runner-flink Reporter: Robert Bradshaw Assignee: Robert Bradshaw -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3780) Add a utility to instantiate a partially unknown coder
[ https://issues.apache.org/jira/browse/BEAM-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386905#comment-16386905 ] Robert Bradshaw commented on BEAM-3780: --- Isn't this capability already present as part of LengthPrefixUnknownCoders? Or is this a proposal to refactor/modify how that works? > Add a utility to instantiate a partially unknown coder > -- > > Key: BEAM-3780 > URL: https://issues.apache.org/jira/browse/BEAM-3780 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Thomas Groh >Assignee: Thomas Groh >Priority: Major > > Coders must be understood by the SDK harness that is encoding or decoding the > associated elements. However, the pipeline runner is capable of constructing > partial coders, where an unknown coder is replaced with a ByteArrayCoder. It > then can decompose the portions of elements it is aware of, without having to > understand the custom element encodings. > > This should go in CoderTranslation, as an alternative to the full-fidelity > rehydration of a coder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3409) Unexpected behavior of DoFn teardown method running in unit tests
[ https://issues.apache.org/jira/browse/BEAM-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3409. --- Resolution: Fixed > Unexpected behavior of DoFn teardown method running in unit tests > -- > > Key: BEAM-3409 > URL: https://issues.apache.org/jira/browse/BEAM-3409 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.3.0 >Reporter: Alexey Romanenko >Assignee: Romain Manni-Bucau >Priority: Blocker > Labels: test > Fix For: 2.4.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Writing a unit test, I found out a strange behaviour of Teardown method of > DoFn implementation when I run this method in unit tests using TestPipeline. > To be more precise, it doesn’t wait until teardown() method will be finished, > it just exits from this method after about 1 sec (on my machine) even if it > should take longer (very simple example - running infinite loop inside this > method or put thread in sleep). In the same time, when I run the same code > from main() with ordinary Pipeline and direct runner, then it’s ok and it > works as expected - teardown() method will be performed completely despite > how much time it will take. > I created two test cases to reproduce this issue - the first one to run with > main() and the second one to run with junit. They use the same implementation > of DoFn (class LongTearDownFn) and expects that teardown method will be > running at least for SLEEP_TIME ms. In case of running as junit test it's not > a case (see output log). > - run with main() > https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/main/java/TearDown.java > - run with junit > https://github.com/aromanenko-dev/beam-samples/blob/master/runners-tests/src/test/java/TearDownTest.java -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3768) Compile error for Flink translation
Robert Bradshaw created BEAM-3768: - Summary: Compile error for Flink translation Key: BEAM-3768 URL: https://issues.apache.org/jira/browse/BEAM-3768 Project: Beam Issue Type: Test Components: runner-flink Reporter: Robert Bradshaw Assignee: Aljoscha Krettek Fix For: 2.4.0 2018-03-01T21:22:58.234 [INFO] --- maven-compiler-plugin:3.7.0:compile (default-compile) @ beam-runners-flink_2.11 --- 2018-03-01T21:22:58.258 [INFO] Changes detected - recompiling the module! 2018-03-01T21:22:58.259 [INFO] Compiling 75 source files to /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/target/classes 2018-03-01T21:22:59.555 [INFO] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/DedupingOperator.java: Some input files use or override a deprecated API. 2018-03-01T21:22:59.556 [INFO] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/DedupingOperator.java: Recompile with -Xlint:deprecation for details. 2018-03-01T21:22:59.556 [INFO] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/SideInputInitializer.java: Some input files use unchecked or unsafe operations. 2018-03-01T21:22:59.556 [INFO] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/functions/SideInputInitializer.java: Recompile with -Xlint:unchecked for details. 2018-03-01T21:22:59.556 [INFO] - 2018-03-01T21:22:59.556 [ERROR] COMPILATION ERROR : 2018-03-01T21:22:59.556 [INFO] - 2018-03-01T21:22:59.557 [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_MavenInstall@2/src/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java:[359,11] cannot infer type arguments for org.apache.beam.runners.core.ProcessFnRunner<> reason: cannot infer type-variable(s) InputT,OutputT,RestrictionT (argument mismatch; org.apache.beam.runners.core.DoFnRunner cannot be converted to org.apache.beam.runners.core.SimpleDoFnRunner>,OutputT>) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3479) Add a regression test for the DoFn classloader selection
[ https://issues.apache.org/jira/browse/BEAM-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382865#comment-16382865 ] Robert Bradshaw commented on BEAM-3479: --- Bumping to the next version, as this is non-blocking (per comments on the PR). > Add a regression test for the DoFn classloader selection > > > Key: BEAM-3479 > URL: https://issues.apache.org/jira/browse/BEAM-3479 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Romain Manni-Bucau >Assignee: Romain Manni-Bucau >Priority: Major > Fix For: 2.5.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Follow up task after https://github.com/apache/beam/pull/4235 merge. This > task is about ensuring we test that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3479) Add a regression test for the DoFn classloader selection
[ https://issues.apache.org/jira/browse/BEAM-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3479: -- Fix Version/s: (was: 2.4.0) 2.5.0 > Add a regression test for the DoFn classloader selection > > > Key: BEAM-3479 > URL: https://issues.apache.org/jira/browse/BEAM-3479 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Romain Manni-Bucau >Assignee: Romain Manni-Bucau >Priority: Major > Fix For: 2.5.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Follow up task after https://github.com/apache/beam/pull/4235 merge. This > task is about ensuring we test that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3611) Split KafkaIO.java into smaller files
[ https://issues.apache.org/jira/browse/BEAM-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3611. --- Resolution: Fixed Closing this as resolved, based on an inspection of the PR. > Split KafkaIO.java into smaller files > - > > Key: BEAM-3611 > URL: https://issues.apache.org/jira/browse/BEAM-3611 > Project: Beam > Issue Type: Improvement > Components: io-java-kafka >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Minor > Fix For: 2.4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > KafkaIO.java has grown too big and includes both source and sink > implementation. Better to move these to own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3700) PipelineOptionsFactory leaks memory
[ https://issues.apache.org/jira/browse/BEAM-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3700: -- Fix Version/s: (was: 2.4.0) 2.5.0 > PipelineOptionsFactory leaks memory > --- > > Key: BEAM-3700 > URL: https://issues.apache.org/jira/browse/BEAM-3700 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Romain Manni-Bucau >Assignee: Romain Manni-Bucau >Priority: Major > Fix For: 2.5.0 > > > PipelineOptionsFactory has a lot of cache but no way to reset it. This task > is about adding a public method to be able to control it in integrations > (runners likely). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (BEAM-3689) Direct runner leak a reader for every 10 input records
[ https://issues.apache.org/jira/browse/BEAM-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw closed BEAM-3689. - Resolution: Fixed Based on my reading of the PR, this is fixed. > Direct runner leak a reader for every 10 input records > -- > > Key: BEAM-3689 > URL: https://issues.apache.org/jira/browse/BEAM-3689 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 2.4.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Direct runner reads 10 records at a time from a reader. I think the intention > is to reuse the reader, but it reuses only if the reader is idle initially, > not when the source has messages available. > When I was testing KafkaIO with direct runner it kept opening new reader for > every 10 records and soon ran out of file descriptors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3702) Support system properties source for pipeline options
[ https://issues.apache.org/jira/browse/BEAM-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3702: -- Fix Version/s: (was: 2.4.0) 2.5.0 > Support system properties source for pipeline options > - > > Key: BEAM-3702 > URL: https://issues.apache.org/jira/browse/BEAM-3702 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Romain Manni-Bucau >Assignee: Romain Manni-Bucau >Priority: Major > Fix For: 2.5.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3749) support customized trigger/accumulationMode in BeamSql
[ https://issues.apache.org/jira/browse/BEAM-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3749: -- Fix Version/s: (was: 2.4.0) 2.5.0 > support customized trigger/accumulationMode in BeamSql > -- > > Key: BEAM-3749 > URL: https://issues.apache.org/jira/browse/BEAM-3749 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Xu Mingmin >Assignee: Xu Mingmin >Priority: Major > Fix For: 2.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently BeamSql use {{DefaultTrigger}} for aggregation operations. > By adding two options {{withTrigger(Trigger)}} and > {{withAccumulationMode(AccumulationMode)}}, developers can specify their own > aggregation strategies with BeamSql. > [~xumingming] [~kedin] [~kenn] for any comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3749) support customized trigger/accumulationMode in BeamSql
[ https://issues.apache.org/jira/browse/BEAM-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381530#comment-16381530 ] Robert Bradshaw commented on BEAM-3749: --- I wonder if sink-based triggers would be a better fit here than adding triggering to an sql statement. > support customized trigger/accumulationMode in BeamSql > -- > > Key: BEAM-3749 > URL: https://issues.apache.org/jira/browse/BEAM-3749 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Xu Mingmin >Assignee: Xu Mingmin >Priority: Major > Fix For: 2.4.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently BeamSql use {{DefaultTrigger}} for aggregation operations. > By adding two options {{withTrigger(Trigger)}} and > {{withAccumulationMode(AccumulationMode)}}, developers can specify their own > aggregation strategies with BeamSql. > [~xumingming] [~kedin] [~kenn] for any comments. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3713) Consider moving away from nose to nose2 or pytest.
Robert Bradshaw created BEAM-3713: - Summary: Consider moving away from nose to nose2 or pytest. Key: BEAM-3713 URL: https://issues.apache.org/jira/browse/BEAM-3713 Project: Beam Issue Type: Test Components: sdk-py-core, testing Reporter: Robert Bradshaw Assignee: Ahmet Altay Per [https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,] , nose is in maintenance mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (BEAM-3205) Publicly document known coder wire formats and their URNs
[ https://issues.apache.org/jira/browse/BEAM-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw closed BEAM-3205. - Resolution: Fixed Fix Version/s: 2.4.0 > Publicly document known coder wire formats and their URNs > - > > Key: BEAM-3205 > URL: https://issues.apache.org/jira/browse/BEAM-3205 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Kenneth Knowles >Assignee: Robert Bradshaw >Priority: Major > Labels: portability > Fix For: 2.4.0 > > > Overarching issue: We need to get our Google Docs, markdown, and email > threads that sketch the Beam model as it is developed into a centralized > place with clear information architecture / navigation, and draw the line > that "if it isn't reachable from here in an obvious way it isn't the spec". > [1] > Specific issue: Which coders are required for a runner and SDK to understand? > Which coders are otherwise considered standardized? What is the abstract > specification for their wire format? > Today we have > https://github.com/apache/beam/blob/master/model/fn-execution/src/test/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml > which is the beginning of a compliance test suite for standardized coders. > This would really benefit from: > - narrative descriptions of the formats, including _abstract_ specification > (not examples) and perhaps motivation > - specification of which are required and which are merely "well known" > - ties into BEAM-3203 in terms of which coders are required to decode to > compatible value in every SDK > - once we have an abstract spec and some examples, and one language has > robust coders that pass the examples, we could turn it around and treat that > implementation as a reference impl for fuzz testing > Any sort of fancy hacking that blends the tests with the narrative is fine, > though mostly I think they'll end up covering disjoint topics. > [1] I filed BEAM-2567 and BEAM-2568 and ported > https://beam.apache.org/contribute/runner-guide/, and [~herohde] put together > https://beam.apache.org/contribute/portability/ and > https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3074) Propagate pipeline protos through Dataflow API from Python
[ https://issues.apache.org/jira/browse/BEAM-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3074. --- Resolution: Fixed Fix Version/s: 2.4.0 > Propagate pipeline protos through Dataflow API from Python > -- > > Key: BEAM-3074 > URL: https://issues.apache.org/jira/browse/BEAM-3074 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Kenneth Knowles >Assignee: Robert Bradshaw >Priority: Major > Labels: portability > Fix For: 2.4.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3644) Speed up Python DirectRunner execution by using the FnApiRunner when possible
[ https://issues.apache.org/jira/browse/BEAM-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3644: -- Description: Local execution of Beam pipelines on the current Python DirectRunner currently suffers from performance issues, which makes it hard for pipeline authors to iterate, especially on medium to large size datasets. We would like to optimize and make this a better experience for Beam users. The FnApiRunner was written as a way of leveraging the portability framework execution code path for local execution for portability development. We've found it also offers great speedups in batch execution, so we propose to switch to use this runner in batch pipelines. For example, WordCount on the Shakespeare dataset with a single CPU core now takes 50 seconds to run, compared to 12 minutes before, a 15x performance improvement that users can get for free, with no pipeline changes. was: Local execution of Beam pipelines on the current Python DirectRunner currently suffers from performance issues, which makes it hard for pipeline authors to iterate, especially on medium to large size datasets. We would like to optimize and make this a better experience for Beam users. In the past few months, Robert implemented the FnApiRunner as a way of leveraging the portability framework execution code path for local execution. We've found great speedups in batch execution, so we propose to switch to use this runner in batch pipelines. For example, WordCount on the Shakespeare dataset with a single CPU core now takes 50 seconds to run, compared to 12 minutes before, a 15x performance improvement that users can get for free, with no pipeline changes. > Speed up Python DirectRunner execution by using the FnApiRunner when possible > - > > Key: BEAM-3644 > URL: https://issues.apache.org/jira/browse/BEAM-3644 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Charles Chen >Assignee: Charles Chen >Priority: Major > > Local execution of Beam pipelines on the current Python DirectRunner > currently suffers from performance issues, which makes it hard for pipeline > authors to iterate, especially on medium to large size datasets. We would > like to optimize and make this a better experience for Beam users. > The FnApiRunner was written as a way of leveraging the portability framework > execution code path for local execution for portability development. We've > found it also offers great speedups in batch execution, so we propose to > switch to use this runner in batch pipelines. For example, WordCount on the > Shakespeare dataset with a single CPU core now takes 50 seconds to run, > compared to 12 minutes before, a 15x performance improvement that users can > get for free, with no pipeline changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3625) DoFn.XxxParam does not work for Map and FlatMap
Robert Bradshaw created BEAM-3625: - Summary: DoFn.XxxParam does not work for Map and FlatMap Key: BEAM-3625 URL: https://issues.apache.org/jira/browse/BEAM-3625 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Robert Bradshaw -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3595) Normalize URNs across SDKs and runners.
Robert Bradshaw created BEAM-3595: - Summary: Normalize URNs across SDKs and runners. Key: BEAM-3595 URL: https://issues.apache.org/jira/browse/BEAM-3595 Project: Beam Issue Type: Bug Components: beam-model Reporter: Robert Bradshaw Assignee: Kenneth Knowles -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-2804) support TIMESTAMP in Sort
[ https://issues.apache.org/jira/browse/BEAM-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-2804. --- Resolution: Fixed Fix Version/s: 2.3.0 > support TIMESTAMP in Sort > - > > Key: BEAM-2804 > URL: https://issues.apache.org/jira/browse/BEAM-2804 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Xu Mingmin >Assignee: Shayang Zang >Priority: Minor > Labels: beginner > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn
[ https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-3510: -- Fix Version/s: 2.3.0 > Python Runner API serialization for CombinePerKey drops arguments to CombineFn > -- > > Key: BEAM-3510 > URL: https://issues.apache.org/jira/browse/BEAM-3510 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.2.0 >Reporter: Charles Chen >Assignee: Robert Bradshaw >Priority: Major > Fix For: 2.3.0 > > > Currently, the Runner API serialization for CombinePerKey drops the args and > kwargs that are passed at runtime to the CombineFn. We should fix this issue > so that the Runner API round-trip preserves semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn
[ https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3510. --- Resolution: Fixed > Python Runner API serialization for CombinePerKey drops arguments to CombineFn > -- > > Key: BEAM-3510 > URL: https://issues.apache.org/jira/browse/BEAM-3510 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.2.0 >Reporter: Charles Chen >Assignee: Robert Bradshaw >Priority: Major > Fix For: 2.3.0 > > > Currently, the Runner API serialization for CombinePerKey drops the args and > kwargs that are passed at runtime to the CombineFn. We should fix this issue > so that the Runner API round-trip preserves semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-3510) Python Runner API serialization for CombinePerKey drops arguments to CombineFn
[ https://issues.apache.org/jira/browse/BEAM-3510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-3510: - Assignee: Robert Bradshaw (was: Charles Chen) > Python Runner API serialization for CombinePerKey drops arguments to CombineFn > -- > > Key: BEAM-3510 > URL: https://issues.apache.org/jira/browse/BEAM-3510 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.2.0 >Reporter: Charles Chen >Assignee: Robert Bradshaw >Priority: Major > > Currently, the Runner API serialization for CombinePerKey drops the args and > kwargs that are passed at runtime to the CombineFn. We should fix this issue > so that the Runner API round-trip preserves semantics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-2906) Dataflow supports portable progress reporting
[ https://issues.apache.org/jira/browse/BEAM-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-2906. --- Resolution: Fixed Fix Version/s: 2.3.0 > Dataflow supports portable progress reporting > - > > Key: BEAM-2906 > URL: https://issues.apache.org/jira/browse/BEAM-2906 > Project: Beam > Issue Type: Sub-task > Components: runner-dataflow >Reporter: Henning Rohde >Assignee: Robert Bradshaw >Priority: Minor > Labels: portability > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-2906) Dataflow supports portable progress reporting
[ https://issues.apache.org/jira/browse/BEAM-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-2906: - Assignee: Robert Bradshaw > Dataflow supports portable progress reporting > - > > Key: BEAM-2906 > URL: https://issues.apache.org/jira/browse/BEAM-2906 > Project: Beam > Issue Type: Sub-task > Components: runner-dataflow >Reporter: Henning Rohde >Assignee: Robert Bradshaw >Priority: Minor > Labels: portability > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3490) Reasonable Python direct runner batch performance.
Robert Bradshaw created BEAM-3490: - Summary: Reasonable Python direct runner batch performance. Key: BEAM-3490 URL: https://issues.apache.org/jira/browse/BEAM-3490 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Robert Bradshaw The plan is to migrate to the FnApi Runner for batch workloads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-3419) Enable iterable side input for beam runners.
Robert Bradshaw created BEAM-3419: - Summary: Enable iterable side input for beam runners. Key: BEAM-3419 URL: https://issues.apache.org/jira/browse/BEAM-3419 Project: Beam Issue Type: New Feature Components: runner-core Reporter: Robert Bradshaw Assignee: Kenneth Knowles -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-3393) Empty flattens cannot be used as side inputs.
Robert Bradshaw created BEAM-3393: - Summary: Empty flattens cannot be used as side inputs. Key: BEAM-3393 URL: https://issues.apache.org/jira/browse/BEAM-3393 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw Assignee: Ahmet Altay {{with beam.Pipeline() as p: main = p | "CM" >> beam.Create([]) side1 = p | "C1" >> beam.Create([]) side2 = p | "C2" >> beam.Create([]) side = (side1, side2) | beam.Flatten() res = main | beam.Map(lambda x, side: x, beam.pvalue.AsList(side)) }} results in {{ File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/direct/evaluation_context.py", line 84, in add_values assert not view.has_result }} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (BEAM-3369) Python HEAD fails tests due to a ValueError
[ https://issues.apache.org/jira/browse/BEAM-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw resolved BEAM-3369. --- Resolution: Fixed Fix Version/s: 2.3.0 > Python HEAD fails tests due to a ValueError > > > Key: BEAM-3369 > URL: https://issues.apache.org/jira/browse/BEAM-3369 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: Not applicable >Reporter: Chamikara Jayalath >Assignee: Robert Bradshaw >Priority: Critical > Fix For: 2.3.0 > > > tfidf_test.py is failing. Full error message is given below. > ValueError: The key coder "TupleCoder[BytesCoder, FastPrimitivesCoder]" for > GroupByKey operation "GroupByKey" is not deterministic. This may result in > incorrect pipeline output. This can be fixed by adding a type hint to the > operation preceding the GroupByKey step, and for custom key classes, by > writing a deterministic custom Coder. Please see the documentation for more > details. > Seems to be due to following commit (passes without that). > https://github.com/apache/beam/commit/1acd1ae901eefbcc8249d90e12ca82db0f91e41e#commitcomment-26356503 > Robert, can you take a look ? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (BEAM-2937) Fn API combiner support w/ lifting to PGBK
[ https://issues.apache.org/jira/browse/BEAM-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-2937: - Assignee: Robert Bradshaw > Fn API combiner support w/ lifting to PGBK > -- > > Key: BEAM-2937 > URL: https://issues.apache.org/jira/browse/BEAM-2937 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Henning Rohde >Assignee: Robert Bradshaw > Labels: portability > > The FnAPI should support this optimization. Detailed design TBD. > Once design is ready, expand subtasks similarly to BEAM-2822. -- This message was sent by Atlassian JIRA (v6.4.14#64029)