[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step
[ https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903135#comment-15903135 ] Eugene Kirpichov commented on BEAM-68: -- I think this needs a discussion on the Beam dev@ mailing list. We should have a general approach to annotating pipeline elements with runner-specific information. I don't think "annotate the pipeline and let runners ignore it" is a good approach; the main reason being that this would violate the abstraction boundary, where a pipeline is first constructed in a runner-agnostic way (in fact while the runner is not even available), and then run. E.g. the set of all possible runner-specific annotations is not known in advance: while "step parallelism limit" seems relatively generic, suppose if say Apex allowed you to set an "Apex frobnication level" parameter on a step - it would look pretty weird if the Beam pipeline API had a withApexFrobnicationLevel method on a ParDo, and would introduce an illegal dependency from Beam to Apex. > Support for limiting parallelism of a step > -- > > Key: BEAM-68 > URL: https://issues.apache.org/jira/browse/BEAM-68 > Project: Beam > Issue Type: New Feature > Components: beam-model >Reporter: Daniel Halperin > > Users may want to limit the parallelism of a step. Two classic uses cases are: > - User wants to produce at most k files, so sets > TextIO.Write.withNumShards(k). > - External API only supports k QPS, so user sets a limit of k/(expected > QPS/step) on the ParDo that makes the API call. > Unfortunately, there is no way to do this effectively within the Beam model. > A GroupByKey with exactly k keys will guarantee that only k elements are > produced, but runners are free to break fusion in ways that each element may > be processed in parallel later. > To implement this functionaltiy, I believe we need to add this support to the > Beam Model. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1422) ParDo should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1422: -- Assignee: Eugene Kirpichov (was: Davor Bonaci) > ParDo should comply with PTransform style guide > --- > > Key: BEAM-1422 > URL: https://issues.apache.org/jira/browse/BEAM-1422 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > > Suggested changes: > - Get rid of ParDo.Unbound and UnboundMulti classes completely > - Get rid of static methods such as withSideInputs/Outputs() - the only entry > point should be via ParDo.of(). Respectively, get rid of non-static .of(). > - Rename ParDo.Bound and ParDo.BoundMulti respectively to simply ParDo and > ParDoWithSideOutputs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1678) Create MemcachedIO
[ https://issues.apache.org/jira/browse/BEAM-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905236#comment-15905236 ] Eugene Kirpichov commented on BEAM-1678: What would this transform / family of transforms do? > Create MemcachedIO > -- > > Key: BEAM-1678 > URL: https://issues.apache.org/jira/browse/BEAM-1678 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1679) create gRPC IO
[ https://issues.apache.org/jira/browse/BEAM-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905227#comment-15905227 ] Eugene Kirpichov commented on BEAM-1679: Discussion currently happening on the mailing list: https://lists.apache.org/thread.html/72bd291cb144c557cb485defcd3c3aaf05e3014da6317395dbeb6de5@%3Cuser.beam.apache.org%3E > create gRPC IO > -- > > Key: BEAM-1679 > URL: https://issues.apache.org/jira/browse/BEAM-1679 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Borisa Zivkovic >Assignee: Jean-Baptiste Onofré > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1581) JsonIO
[ https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905240#comment-15905240 ] Eugene Kirpichov commented on BEAM-1581: I would recommend to preserve the decomposition and make this (I suppose, FileBased?) source and sink use collections of String representing the JSON objects; abstracting only the way a sequence of JSON literals is represented in a file (is it one per line, or are they all in a huge JSON array, or are they delimited by an empty line, etc). The user can then choose which JSON library to use to generate or parse them. > JsonIO > -- > > Key: BEAM-1581 > URL: https://issues.apache.org/jira/browse/BEAM-1581 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Aviem Zur >Assignee: Aviem Zur > > A new IO (with source and sink) which will read/write Json files. > Similar to {{XmlSink}}. > Consider using methods/code (or refactor these) found in {{AsJsons}} and > {{ParseJsons}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1581) JsonIO
[ https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906648#comment-15906648 ] Eugene Kirpichov commented on BEAM-1581: [~aviemzur] - I think XML source/sink are a bad design in retrospect - they were both, as far as I remember, created mostly to demonstrate the whole idea of file-based sources/sinks, and before the best practices for transform development shaped up, and did not really come from unifying an experience with an array of XML use cases either. We should not use them as API guidance. The suggestion to have an abstract JsonIO seems to contradict the recommendation from the PTransform Style Guide (see https://beam.apache.org/contribute/ptransform-style-guide/#injecting-user-specified-behavior) to use PTransform composition as an extensibility device whenever possible (instead of inheritance) - and that recommendation is specifically directed at cases like this; the better alternative is to return String's and let the user compose it with a ParDo parsing the strings. [~eljefe6a] - "File as a self-contained JSON" means there's no JSON-specific logic, it's simply "File as a self-contained String" - we should definitely have that, but under a separate JIRA issue. Aviem / Jesse - could you perhaps come up with a list of common ways in which you have seen people store a collection of stuff in JSON file(s)? I think without that, or while keeping it implicit, we're kind of acting blindly. Let's list all the known use cases and abstract upward from that. > JsonIO > -- > > Key: BEAM-1581 > URL: https://issues.apache.org/jira/browse/BEAM-1581 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Aviem Zur >Assignee: Aviem Zur > > A new IO (with source and sink) which will read/write Json files. > Similarly to {{XmlSource}}/{{XmlSink}}, this IO should have a > {{JsonSource}}/{{JonSink}} which are a {{FileBaseSource}}/{{FileBasedSink}}. > Consider using methods/code (or refactor these) found in {{AsJsons}} and > {{ParseJsons}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1678) Create MemcachedIO
[ https://issues.apache.org/jira/browse/BEAM-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905722#comment-15905722 ] Eugene Kirpichov commented on BEAM-1678: Note that all IO's are families of PTransforms (usually one or two - read/write, sometimes more), which is why I referred to this as "transforms". I agree that support of the memcached protocol by many non-cache systems for write makes it reasonable to have Beam include a library for writing things to it - basically something like a PTransform>, PDone> that writes them to a given memcached-compatible endpoint. Is this all, or do you have something more in mind? > Create MemcachedIO > -- > > Key: BEAM-1678 > URL: https://issues.apache.org/jira/browse/BEAM-1678 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
Eugene Kirpichov created BEAM-1712: -- Summary: TestPipeline.run doesn't actually waitUntilFinish Key: BEAM-1712 URL: https://issues.apache.org/jira/browse/BEAM-1712 Project: Beam Issue Type: Bug Components: runner-flink, sdk-java-core, testing Reporter: Eugene Kirpichov Assignee: Stas Levin Priority: Blocker https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 it calls waitUntilFinish() only if 1) run wasn't called 2) enableAutoRunIfMissing is true. However in practice both of these are false. 1) run() is, in most tests, called. So effectively if you call .run() at all, then this thing doesn't call waitUntilFinish(). 2) enableAutoRunIfMissing() is set to true only via TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit test. This means that, for all tests that use TestPipeline - if the test waits until finish, it's only because of the grace of the particular runner. Which is like really bad. We're lucky because in practice TestDataflowRunner, TestApexRunner, TestSparkRunner in run() call themselves waitUntilFinish(). However, TestFlinkRunner doesn't - i.e. there currently might be tests that actually fail in Flink runner, undetected. The proper fix to this is to fix TestPipeline to always waitUntilFinish(). Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
[ https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923329#comment-15923329 ] Eugene Kirpichov commented on BEAM-1712: OK, seems like that is in fact what happens. So this bug then is currently NOT a release blocker because, for various reasons, all test runners including Flink actually wait until finish even though TestPipeline doesn't ask them to. > TestPipeline.run doesn't actually waitUntilFinish > - > > Key: BEAM-1712 > URL: https://issues.apache.org/jira/browse/BEAM-1712 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-core, testing >Reporter: Eugene Kirpichov >Assignee: Stas Levin >Priority: Blocker > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 > it calls waitUntilFinish() only if > 1) run wasn't called > 2) enableAutoRunIfMissing is true. > However in practice both of these are false. > 1) run() is, in most tests, called. So effectively if you call .run() at all, > then this thing doesn't call waitUntilFinish(). > 2) enableAutoRunIfMissing() is set to true only via > TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit > test. > This means that, for all tests that use TestPipeline - if the test waits > until finish, it's only because of the grace of the particular runner. Which > is like really bad. > We're lucky because in practice TestDataflowRunner, TestApexRunner, > TestSparkRunner in run() call themselves waitUntilFinish(). > However, TestFlinkRunner doesn't - i.e. there currently might be tests that > actually fail in Flink runner, undetected. > The proper fix to this is to fix TestPipeline to always waitUntilFinish(). > Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to > make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
[ https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov updated BEAM-1712: --- Priority: Major (was: Blocker) > TestPipeline.run doesn't actually waitUntilFinish > - > > Key: BEAM-1712 > URL: https://issues.apache.org/jira/browse/BEAM-1712 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-core, testing >Reporter: Eugene Kirpichov >Assignee: Stas Levin > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 > it calls waitUntilFinish() only if > 1) run wasn't called > 2) enableAutoRunIfMissing is true. > However in practice both of these are false. > 1) run() is, in most tests, called. So effectively if you call .run() at all, > then this thing doesn't call waitUntilFinish(). > 2) enableAutoRunIfMissing() is set to true only via > TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit > test. > This means that, for all tests that use TestPipeline - if the test waits > until finish, it's only because of the grace of the particular runner. Which > is like really bad. > We're lucky because in practice TestDataflowRunner, TestApexRunner, > TestSparkRunner in run() call themselves waitUntilFinish(). > However, TestFlinkRunner doesn't - i.e. there currently might be tests that > actually fail in Flink runner, undetected. > The proper fix to this is to fix TestPipeline to always waitUntilFinish(). > Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to > make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
[ https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923326#comment-15923326 ] Eugene Kirpichov commented on BEAM-1712: An optimistic interpretation is probably that Flink runner in batch mode simply waits in .run(). Not sure about streaming mode then. > TestPipeline.run doesn't actually waitUntilFinish > - > > Key: BEAM-1712 > URL: https://issues.apache.org/jira/browse/BEAM-1712 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-core, testing >Reporter: Eugene Kirpichov >Assignee: Stas Levin >Priority: Blocker > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 > it calls waitUntilFinish() only if > 1) run wasn't called > 2) enableAutoRunIfMissing is true. > However in practice both of these are false. > 1) run() is, in most tests, called. So effectively if you call .run() at all, > then this thing doesn't call waitUntilFinish(). > 2) enableAutoRunIfMissing() is set to true only via > TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit > test. > This means that, for all tests that use TestPipeline - if the test waits > until finish, it's only because of the grace of the particular runner. Which > is like really bad. > We're lucky because in practice TestDataflowRunner, TestApexRunner, > TestSparkRunner in run() call themselves waitUntilFinish(). > However, TestFlinkRunner doesn't - i.e. there currently might be tests that > actually fail in Flink runner, undetected. > The proper fix to this is to fix TestPipeline to always waitUntilFinish(). > Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to > make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
[ https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923351#comment-15923351 ] Eugene Kirpichov commented on BEAM-1712: As an intermediate step to unifying everything under TestPipeline calling waitUntilFinish, it'd be nice to make the current code less confusing at least: - TestPipeline should remove code that pretends that it calls waitUntilFinish - Test runners (TestBlahRunner) should document that they wait themselves - Tests should not manually call waitUntilFinish I.e. the responsibility should be centralized in some place; right now this place can only be individual test runners, and eventually it will be TestPipeline. > TestPipeline.run doesn't actually waitUntilFinish > - > > Key: BEAM-1712 > URL: https://issues.apache.org/jira/browse/BEAM-1712 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-core, testing >Reporter: Eugene Kirpichov >Assignee: Stas Levin > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 > it calls waitUntilFinish() only if > 1) run wasn't called > 2) enableAutoRunIfMissing is true. > However in practice both of these are false. > 1) run() is, in most tests, called. So effectively if you call .run() at all, > then this thing doesn't call waitUntilFinish(). > 2) enableAutoRunIfMissing() is set to true only via > TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit > test. > This means that, for all tests that use TestPipeline - if the test waits > until finish, it's only because of the grace of the particular runner. Which > is like really bad. > We're lucky because in practice TestDataflowRunner, TestApexRunner, > TestSparkRunner in run() call themselves waitUntilFinish(). > However, TestFlinkRunner doesn't - i.e. there currently might be tests that > actually fail in Flink runner, undetected. > The proper fix to this is to fix TestPipeline to always waitUntilFinish(). > Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to > make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish
[ https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923321#comment-15923321 ] Eugene Kirpichov commented on BEAM-1712: https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkRunnerResult.java#L86 seems to imply that Flink tests with PAssert are and always have been vacuous?... [~aljoscha] WDYT? > TestPipeline.run doesn't actually waitUntilFinish > - > > Key: BEAM-1712 > URL: https://issues.apache.org/jira/browse/BEAM-1712 > Project: Beam > Issue Type: Bug > Components: runner-flink, sdk-java-core, testing >Reporter: Eugene Kirpichov >Assignee: Stas Levin >Priority: Blocker > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124 > it calls waitUntilFinish() only if > 1) run wasn't called > 2) enableAutoRunIfMissing is true. > However in practice both of these are false. > 1) run() is, in most tests, called. So effectively if you call .run() at all, > then this thing doesn't call waitUntilFinish(). > 2) enableAutoRunIfMissing() is set to true only via > TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit > test. > This means that, for all tests that use TestPipeline - if the test waits > until finish, it's only because of the grace of the particular runner. Which > is like really bad. > We're lucky because in practice TestDataflowRunner, TestApexRunner, > TestSparkRunner in run() call themselves waitUntilFinish(). > However, TestFlinkRunner doesn't - i.e. there currently might be tests that > actually fail in Flink runner, undetected. > The proper fix to this is to fix TestPipeline to always waitUntilFinish(). > Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to > make sure Flink is safe. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1573) KafkaIO does not allow using Kafka serializers and deserializers
[ https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1589#comment-1589 ] Eugene Kirpichov commented on BEAM-1573: Yes, this would be a backward-incompatible change. We are sort of fine with making them if there is a good reason and before declaring Beam stable API (first 1.x release), and this would be one of a family of changes bringing beam in accordance with its own style guide - which we consider a necessary evil. We already removed coders from TextIO as part of that. So yes, please feel free to proceed, and I'll be happy to review the PR. > KafkaIO does not allow using Kafka serializers and deserializers > > > Key: BEAM-1573 > URL: https://issues.apache.org/jira/browse/BEAM-1573 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions >Affects Versions: 0.4.0, 0.5.0 >Reporter: peay >Assignee: Raghu Angadi >Priority: Minor > > KafkaIO does not allow to override the serializer and deserializer settings > of the Kafka consumer and producers it uses internally. Instead, it allows to > set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class > that calls the coder. > I appreciate that allowing to use Beam coders is good and consistent with the > rest of the system. However, is there a reason to completely disallow to use > custom Kafka serializers instead? > This is a limitation when working with an Avro schema registry for instance, > which requires custom serializers. One can write a `Coder` that wraps a > custom Kafka serializer, but that means two levels of un-necessary wrapping. > In addition, the `Coder` abstraction is not equivalent to Kafka's > `Serializer` which gets the topic name as input. Using a `Coder` wrapper > would require duplicating the output topic setting in the argument to > `KafkaIO` and when building the wrapper, which is not elegant and error prone. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1573) KafkaIO does not allow using Kafka serializers and deserializers
[ https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896989#comment-15896989 ] Eugene Kirpichov commented on BEAM-1573: At a conference right now, but quick comment: yes, as Raghu said, we're getting rid of Coder's as a general parsing mechanism. Use of coders for the purpose for which KafkaIO currently uses them is explicitly forbidden by the Beam PTransform Style Guide https://beam.apache.org/contribute/ptransform-style-guide/#coders . We should replace that with having KafkaIO return byte[] and having convenience utilities for deserializing these byte[] using Kafka deserializers, e.g. by wrapping the code Raghu posted as a utility in the kafka module (packaged, say, as a SerializableFunction). Raghu or @peay, perhaps consider sending a PR to fix this? It seems like it should be rather easy. Though it would merit a short discussion on d...@beam.apache.org first. > KafkaIO does not allow using Kafka serializers and deserializers > > > Key: BEAM-1573 > URL: https://issues.apache.org/jira/browse/BEAM-1573 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Affects Versions: 0.4.0, 0.5.0 >Reporter: peay >Assignee: Raghu Angadi >Priority: Minor > > KafkaIO does not allow to override the serializer and deserializer settings > of the Kafka consumer and producers it uses internally. Instead, it allows to > set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class > that calls the coder. > I appreciate that allowing to use Beam coders is good and consistent with the > rest of the system. However, is there a reason to completely disallow to use > custom Kafka serializers instead? > This is a limitation when working with an Avro schema registry for instance, > which requires custom serializers. One can write a `Coder` that wraps a > custom Kafka serializer, but that means two levels of un-necessary wrapping. > In addition, the `Coder` abstraction is not equivalent to Kafka's > `Serializer` which gets the topic name as input. Using a `Coder` wrapper > would require duplicating the output topic setting in the argument to > `KafkaIO` and when building the wrapper, which is not elegant and error prone. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1416) Write transform should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1416. -- > Write transform should comply with PTransform style guide > - > > Key: BEAM-1416 > URL: https://issues.apache.org/jira/browse/BEAM-1416 > Project: Beam > Issue Type: Bug > Components: sdk-java-gcp >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > Fix For: 0.6.0 > > > Suggested change: remove Bound class - Write itself should be the transform > class. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1426) SortValues should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1426. -- > SortValues should comply with PTransform style guide > > > Key: BEAM-1426 > URL: https://issues.apache.org/jira/browse/BEAM-1426 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > Fix For: 0.6.0 > > > Suggested changes: BufferedExternalSorter.Options should name its builder > methods .withBlah(), rather than .setBlah(). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1424) ToString should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889586#comment-15889586 ] Eugene Kirpichov commented on BEAM-1424: Actual backward-incompatible changes: kv renamed to kvs, iterable to iterables. > ToString should comply with PTransform style guide > -- > > Key: BEAM-1424 > URL: https://issues.apache.org/jira/browse/BEAM-1424 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > Fix For: 0.6.0 > > > Suggested changes: > - Rename of() to elements(), kv() to kvs(), iterable() to iterables() for > consistency between each other and with other Beam similar transforms e.g. > Flatten.Iterables > - These methods should return respectively named transforms: > ToString.Elements, ToString.KVs, ToString.Iterables -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1424) ToString should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1424. -- > ToString should comply with PTransform style guide > -- > > Key: BEAM-1424 > URL: https://issues.apache.org/jira/browse/BEAM-1424 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > Fix For: 0.6.0 > > > Suggested changes: > - Rename of() to elements(), kv() to kvs(), iterable() to iterables() for > consistency between each other and with other Beam similar transforms e.g. > Flatten.Iterables > - These methods should return respectively named transforms: > ToString.Elements, ToString.KVs, ToString.Iterables -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-849) Redesign PipelineResult API
[ https://issues.apache.org/jira/browse/BEAM-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890855#comment-15890855 ] Eugene Kirpichov commented on BEAM-849: --- I disagree that "unbounded pipelines" can't finish successfully. - Dataflow runner supports draining of pipelines, which leads to successful termination. - It is possible to run a pipeline like Create.of(1, 2, 3) + ParDo(do nothing) using a streaming runner, and it should terminate rather than hang. - One might ask "why run such a pipeline with a streaming runner", but it makes a lot more sense if the ParDo is splittable. E.g. Create.of(filename) + ParDo(tail file) + ParDo(process records) could use the low-latency capabilities of a streaming runner, but successfully terminate when the file is somehow "finalized". As a more mundane example - tests in https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java should pass in streaming runners as well as batch runners. - "Unbounded pipeline" is in general not a Beam concept - we should have a batch/streaming-agnostic meaning of "finished" in "waitUntilFinished". I propose the one that Dataflow runner uses for deciding when drain is completed: "all watermarks have progressed to infinity". > Redesign PipelineResult API > --- > > Key: BEAM-849 > URL: https://issues.apache.org/jira/browse/BEAM-849 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Pei He > > Current state: > Jira https://issues.apache.org/jira/browse/BEAM-443 addresses > waitUntilFinish() and cancel(). > However, there are additional work around PipelineResult: > need clearly defined contract and verification across all runners > need to revisit how to handle metrics/aggregators > need to be able to get logs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1562) Use a "signal" to stop streaming tests as they finish.
[ https://issues.apache.org/jira/browse/BEAM-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890861#comment-15890861 ] Eugene Kirpichov commented on BEAM-1562: See my comment on BEAM-849: as an alternative, we can use "watermarks progressed to infinity" as a termination signal. > Use a "signal" to stop streaming tests as they finish. > -- > > Key: BEAM-1562 > URL: https://issues.apache.org/jira/browse/BEAM-1562 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Amit Sela >Assignee: Amit Sela > > Streaming tests use a timeout that has to take a large enough buffer to avoid > slow runtimes stopping before test completes. > We can introduce a "poison pill" based on an {{EndOfStream}} element that > would be counted in a Metrics/Aggregator to know all data was processed. > Another option would be to follow the slowest WM and stop once it hits > end-of-time. > This requires the Spark runner to stop blocking the execution of a streaming > pipeline until it's complete - which is something relevant to > {{PipelineResult}} (re)design that is being discussed in BEAM-849 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1578) Runners should put PT overrides into a list rather than map
Eugene Kirpichov created BEAM-1578: -- Summary: Runners should put PT overrides into a list rather than map Key: BEAM-1578 URL: https://issues.apache.org/jira/browse/BEAM-1578 Project: Beam Issue Type: Bug Components: runner-dataflow, runner-direct Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov Priority: Minor It was not clear to me from the code that order of overrides is important. Map is not the best data structure for this, especially since no key lookups happen. Referring to https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java and https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java and possibly others. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1594) JOB_STATE_DRAINED should be terminal for waitUntilFinish()
Eugene Kirpichov created BEAM-1594: -- Summary: JOB_STATE_DRAINED should be terminal for waitUntilFinish() Key: BEAM-1594 URL: https://issues.apache.org/jira/browse/BEAM-1594 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov Dataflow runner supports pipeline draining for streaming pipelines, which stops consuming input and waits for all watermarks to progress to infinity and then declares the pipeline complete. When watermarks progress to infinity after a drain request, the Dataflow runner will mark the job DRAINED (In the future, when watermarks progress to infinity without a drain request, it will mark the job DONE instead). JOB_STATE_DRAINED is logically a terminal state - no more data will be processed and the pipeline's resources are torn down - and waitUntilFinish() should terminate if the job enters this state. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1427) BigQueryIO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1427: -- Assignee: Eugene Kirpichov > BigQueryIO should comply with PTransform style guide > > > Key: BEAM-1427 > URL: https://issues.apache.org/jira/browse/BEAM-1427 > Project: Beam > Issue Type: Bug > Components: sdk-java-gcp >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible > > Suggested changes: > - Remove Read/Write.Bound classes - Read and Write themselves should be the > transform classes > - Remove static builder-like .withBlah() methods > - (optional) use AutoValue -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1915) Remove OldDoFn dependency in ApexGroupByKeyOperator
[ https://issues.apache.org/jira/browse/BEAM-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961586#comment-15961586 ] Eugene Kirpichov commented on BEAM-1915: It happens to also be the last occurrence in the Beam repository. It is also used in the (non-opensourced) Dataflow worker, but we (Dataflow team) can just make a copy there and take care of it on our side - other than that, after this JIRA, OldDoFn can be deleted from Beam altogether. > Remove OldDoFn dependency in ApexGroupByKeyOperator > --- > > Key: BEAM-1915 > URL: https://issues.apache.org/jira/browse/BEAM-1915 > Project: Beam > Issue Type: Task > Components: runner-apex >Reporter: Thomas Weise >Assignee: Eugene Kirpichov >Priority: Minor > > It is the last remaining occurrence in the ApexRunner. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1915) Remove OldDoFn dependency in ApexGroupByKeyOperator
[ https://issues.apache.org/jira/browse/BEAM-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961587#comment-15961587 ] Eugene Kirpichov commented on BEAM-1915: cc: [~kenn] > Remove OldDoFn dependency in ApexGroupByKeyOperator > --- > > Key: BEAM-1915 > URL: https://issues.apache.org/jira/browse/BEAM-1915 > Project: Beam > Issue Type: Task > Components: runner-apex >Reporter: Thomas Weise >Assignee: Eugene Kirpichov >Priority: Minor > > It is the last remaining occurrence in the ApexRunner. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally
[ https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961590#comment-15961590 ] Eugene Kirpichov commented on BEAM-1910: Got the error again. == FAIL: test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) -- Traceback (most recent call last): File "/usr/local/google/home/kirpichov/incubator-beam/sdks/python/apache_beam/coders/slow_coders_test.py", line 41, in test_using_slow_impl import apache_beam.coders.stream AssertionError: ImportError not raised > test_using_slow_impl very flaky locally > --- > > Key: BEAM-1910 > URL: https://issues.apache.org/jira/browse/BEAM-1910 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Eugene Kirpichov >Assignee: Ahmet Altay > > Most times this test fails on my machine when running: > mvn verify -am -T 1C > test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL > ... > ___ summary > > ERROR: docs: commands failed > lint: commands succeeded > ERROR: py27: commands failed > py27cython: commands succeeded > py27gcp: commands succeeded > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at > org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) > at > org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) > at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Unfortunately the test doesn't print anything to maven output, so I don't > know what went wrong. I also don't know how to rerun the individual test > myself. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally
[ https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961592#comment-15961592 ] Eugene Kirpichov commented on BEAM-1910: I guess I also would like a way to run mvn verify but skip the python tests - they are not helpful when I'm verifying a Java-only PR. > test_using_slow_impl very flaky locally > --- > > Key: BEAM-1910 > URL: https://issues.apache.org/jira/browse/BEAM-1910 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Eugene Kirpichov >Assignee: Ahmet Altay > > Most times this test fails on my machine when running: > mvn verify -am -T 1C > test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL > ... > ___ summary > > ERROR: docs: commands failed > lint: commands succeeded > ERROR: py27: commands failed > py27cython: commands succeeded > py27gcp: commands succeeded > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at > org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) > at > org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) > at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Unfortunately the test doesn't print anything to maven output, so I don't > know what went wrong. I also don't know how to rerun the individual test > myself. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1903) Splittable DoFn should report watermarks via ProcessContext
[ https://issues.apache.org/jira/browse/BEAM-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1903. -- Resolution: Fixed Fix Version/s: First stable release > Splittable DoFn should report watermarks via ProcessContext > --- > > Key: BEAM-1903 > URL: https://issues.apache.org/jira/browse/BEAM-1903 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Fix For: First stable release > > > See design document: > https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1904) Remove DoFn.ProcessContinuation
[ https://issues.apache.org/jira/browse/BEAM-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1904. -- Resolution: Fixed Fix Version/s: First stable release This is effectively done in https://github.com/apache/beam/pull/2455 , the only thing remaining is physically remove the class, but I'll do it in a separate Dataflow worker dance. > Remove DoFn.ProcessContinuation > --- > > Key: BEAM-1904 > URL: https://issues.apache.org/jira/browse/BEAM-1904 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Fix For: First stable release > > > Design document: > https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1913) TFRecordIO should comply with PTransform style guide
Eugene Kirpichov created BEAM-1913: -- Summary: TFRecordIO should comply with PTransform style guide Key: BEAM-1913 URL: https://issues.apache.org/jira/browse/BEAM-1913 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java violates a few guidelines from https://beam.apache.org/contribute/ptransform-style-guide/ : - Use of Bound and Unbound types: the Read and Write classes themselves should be PTransform's - Should have one static entry point per verb - read() and write() - Both classes could benefit from AutoValue Basically, perform similar surgery like in https://github.com/apache/beam/pull/2149 but on smaller scale since this is a much smaller connector. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1428) KinesisIO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1428: -- Assignee: (was: Davor Bonaci) > KinesisIO should comply with PTransform style guide > --- > > Key: BEAM-1428 > URL: https://issues.apache.org/jira/browse/BEAM-1428 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Eugene Kirpichov > Labels: backward-incompatible, starter > > Suggested changes: > - KinesisIO.Read should be a PTransform itself > - It should have builder methods .withBlah() for setting the parameters, > instead of the current somewhat strange combination of the from() factory > methods and the using() methods -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-644) Primitive to shift the watermark while assigning timestamps
[ https://issues.apache.org/jira/browse/BEAM-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957959#comment-15957959 ] Eugene Kirpichov commented on BEAM-644: --- I don't think there's an SDF-specific issue here: it applies just as well to a non-splittable DoFn that would take a filename as input, and potentially produce elements whose timestamps are behind the timestamp of the input element (filename). Speaking of updates - no, I don't think there've been any updates. Perhaps in practice this can be solved by feeding the SDF elements with a timestamp of "infinite past", and from then on, relying on the SDF's own output watermark reporting. > Primitive to shift the watermark while assigning timestamps > --- > > Key: BEAM-644 > URL: https://issues.apache.org/jira/browse/BEAM-644 > Project: Beam > Issue Type: New Feature > Components: beam-model, beam-model-runner-api >Reporter: Kenneth Knowles > > There is a general need, especially important in the presence of > SplittableDoFn, to be able to assign new timestamps to elements without > making them late or droppable. > - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one > to produce late data, but does not allow one to shift the watermark so the > new data is on-time. > - For a SplittableDoFn, one may receive an element such as the name of a log > file that contains elements for the day preceding the log file. The timestamp > on the filename must currently be the beginning of the log. If such elements > are constantly flowing, it may be OK, but since we don't know that element is > coming, in that absence of data, the watermark may advance. We need a way to > keep it far enough back even in the absence of data holding it back. > One idea is a new primitive ShiftWatermark / AdjustTimestamps with the > following pieces: > - A constant duration (positive or negative) D by which to shift the > watermark. > - A function from TimestampedElement to new timestamp that is >= t + D > So, for example, AdjustTimestamps(<-60 minutes>, f) would allow f to make > timestamps up to 60 minutes earlier. > With this primitive added, outputWithTimestamp and withAllowedTimestampSkew > could be removed, simplifying DoFn. > Alternatively, all of this functionality could be bolted on to DoFn. > This ticket is not a proposal, but a record of the issue and ideas that were > mentioned. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1910) test_using_slow_impl very flaky locally
Eugene Kirpichov created BEAM-1910: -- Summary: test_using_slow_impl very flaky locally Key: BEAM-1910 URL: https://issues.apache.org/jira/browse/BEAM-1910 Project: Beam Issue Type: Bug Components: sdk-py Reporter: Eugene Kirpichov Assignee: Ahmet Altay Most times this test fails on my machine when running: mvn verify -am -T 1C test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL ... ___ summary ERROR: docs: commands failed lint: commands succeeded ERROR: py27: commands failed py27cython: commands succeeded py27gcp: commands succeeded [ERROR] Command execution failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185) at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Unfortunately the test doesn't print anything to maven output, so I don't know what went wrong. I also don't know how to rerun the individual test myself. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1581) JSON source and sink
[ https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961285#comment-15961285 ] Eugene Kirpichov commented on BEAM-1581: Yeah, JacksonIO sounds like good idea. It also has fewer risk that users will come to rely on it as "the" way to use JSON from Beam while we introduce more ways to use it with different APIs. > JSON source and sink > > > Key: BEAM-1581 > URL: https://issues.apache.org/jira/browse/BEAM-1581 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Aviem Zur >Assignee: Aviem Zur > > JSON source and sink to read/write JSON files. > Similarly to {{XmlSource}}/{{XmlSink}}, these be a {{JsonSource}}/{{JonSink}} > which are a {{FileBaseSource}}/{{FileBasedSink}}. > Consider using methods/code (or refactor these) found in {{AsJsons}} and > {{ParseJsons}} > The {{PCollection}} of objects the user passes to the transform should be > embedded in a valid JSON file > The most common pattern for this is a large object with an array member which > holds all the data objects and other members for metadata. > Examples of public JSON APIs: https://www.sitepoint.com/10-example-json-files/ > Another pattern used is a file which is simply a JSON array of objects. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1913) TFRecordIO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov updated BEAM-1913: --- Labels: backward-incompatible starter (was: backwards-incompatible starter) > TFRecordIO should comply with PTransform style guide > > > Key: BEAM-1913 > URL: https://issues.apache.org/jira/browse/BEAM-1913 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov > Labels: backward-incompatible, starter > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java > violates a few guidelines from > https://beam.apache.org/contribute/ptransform-style-guide/ : > - Use of Bound and Unbound types: the Read and Write classes themselves > should be PTransform's > - Should have one static entry point per verb - read() and write() > - Both classes could benefit from AutoValue > Basically, perform similar surgery like in > https://github.com/apache/beam/pull/2149 but on smaller scale since this is a > much smaller connector. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1914) XML IO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1914: -- Assignee: (was: Davor Bonaci) > XML IO should comply with PTransform style guide > > > Key: BEAM-1914 > URL: https://issues.apache.org/jira/browse/BEAM-1914 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov > Labels: backward-incompatible, starter > > Currently we have XmlSource and XmlSink in the Java SDK. They violate the > PTransform style guide in several respects: > - They should be grouped into an XmlIO class with read() and write() verbs, > like all the other similar connectors > - The source/sink classes should be made private or package-local > - Should get rid of XmlSink.Bound - XmlSink itself should inherit from > FileBasedSink > - Could optionally benefit from AutoValue > See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1914) XML IO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov updated BEAM-1914: --- Labels: backward-incompatible starter (was: ) > XML IO should comply with PTransform style guide > > > Key: BEAM-1914 > URL: https://issues.apache.org/jira/browse/BEAM-1914 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Davor Bonaci > Labels: backward-incompatible, starter > > Currently we have XmlSource and XmlSink in the Java SDK. They violate the > PTransform style guide in several respects: > - They should be grouped into an XmlIO class with read() and write() verbs, > like all the other similar connectors > - The source/sink classes should be made private or package-local > - Should get rid of XmlSink.Bound - XmlSink itself should inherit from > FileBasedSink > - Could optionally benefit from AutoValue > See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1414) CountingInput should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov updated BEAM-1414: --- Labels: backward-incompatible starter (was: backward-incompatible) > CountingInput should comply with PTransform style guide > --- > > Key: BEAM-1414 > URL: https://issues.apache.org/jira/browse/BEAM-1414 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Davor Bonaci > Labels: backward-incompatible, starter > Fix For: First stable release > > > Suggested changes: > - Rename the whole class and its inner transforms to sound more verb-like, > e.g.: GenerateRange.Bounded/Unbounded (as opposed to current > CountingInput.BoundedCountingInput) > - Provide a more unified API between bounded and unbounded cases: > GenerateRange.from(100) should return a GenerateRange.Unbounded; > GenerateRange.from(100).to(200) should return a GenerateRange.Bounded. They > both should accept a timestampFn. The unbounded one _should not_ have a > withMaxNumRecords builder - that's redundant with specifying the range. > - (optional) Use AutoValue -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1414) CountingInput should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1414: -- Assignee: (was: Davor Bonaci) > CountingInput should comply with PTransform style guide > --- > > Key: BEAM-1414 > URL: https://issues.apache.org/jira/browse/BEAM-1414 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov > Labels: backward-incompatible, starter > Fix For: First stable release > > > Suggested changes: > - Rename the whole class and its inner transforms to sound more verb-like, > e.g.: GenerateRange.Bounded/Unbounded (as opposed to current > CountingInput.BoundedCountingInput) > - Provide a more unified API between bounded and unbounded cases: > GenerateRange.from(100) should return a GenerateRange.Unbounded; > GenerateRange.from(100).to(200) should return a GenerateRange.Bounded. They > both should accept a timestampFn. The unbounded one _should not_ have a > withMaxNumRecords builder - that's redundant with specifying the range. > - (optional) Use AutoValue -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (BEAM-447) Stop referring to types with Bound/Unbound
[ https://issues.apache.org/jira/browse/BEAM-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947688#comment-15947688 ] Eugene Kirpichov edited comment on BEAM-447 at 4/7/17 9:46 PM: --- Remaining instances: TextIO, AvroIO (handled by https://github.com/apache/beam/pull/1927), XmlSink, TFRecordIO. was (Author: jkff): Remaining instances: TextIO, AvroIO (handled by https://github.com/apache/beam/pull/1927), Window (handled by a soon-to-be-sent-for-review PR of mine), XmlSink, TFRecordIO. > Stop referring to types with Bound/Unbound > -- > > Key: BEAM-447 > URL: https://issues.apache.org/jira/browse/BEAM-447 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Davor Bonaci > Labels: backward-incompatible > Fix For: First stable release > > > Bounded and Unbounded are used to refer to PCollections, and the overlap is > confusing. These classes should be renamed to be both more specific (e.g. > ParDo.LackingDoFnSingleOutput, ParDo.SingleOutput, Window.AssignWindows) > which remove the overlap. > examples: > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L658 > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L868 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1914) XML IO should comply with PTransform style guide
Eugene Kirpichov created BEAM-1914: -- Summary: XML IO should comply with PTransform style guide Key: BEAM-1914 URL: https://issues.apache.org/jira/browse/BEAM-1914 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Davor Bonaci Currently we have XmlSource and XmlSink in the Java SDK. They violate the PTransform style guide in several respects: - They should be grouped into an XmlIO class with read() and write() verbs, like all the other similar connectors - The source/sink classes should be made private or package-local - Should get rid of XmlSink.Bound - XmlSink itself should inherit from FileBasedSink - Could optionally benefit from AutoValue See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-447) Stop referring to types with Bound/Unbound
[ https://issues.apache.org/jira/browse/BEAM-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-447. - Resolution: Duplicate I've filed separate JIRAs for the remaining sub-items. > Stop referring to types with Bound/Unbound > -- > > Key: BEAM-447 > URL: https://issues.apache.org/jira/browse/BEAM-447 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Davor Bonaci > Labels: backward-incompatible > Fix For: First stable release > > > Bounded and Unbounded are used to refer to PCollections, and the overlap is > confusing. These classes should be renamed to be both more specific (e.g. > ParDo.LackingDoFnSingleOutput, ParDo.SingleOutput, Window.AssignWindows) > which remove the overlap. > examples: > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L658 > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L868 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1979) Outputting to an undeclared side output TupleTag should throw
Eugene Kirpichov created BEAM-1979: -- Summary: Outputting to an undeclared side output TupleTag should throw Key: BEAM-1979 URL: https://issues.apache.org/jira/browse/BEAM-1979 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov Currently some runners (perhaps all of them?..), when you output to a side output tag that wasn't passed to ParDo.withOutputTags, simply drop the data on the floor (or increase some aggregators). Silent data dropping is never okay, and it's never okay to output to an undeclared tag. So instead this should throw. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1913) TFRecordIO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1913: -- Assignee: Eugene Kirpichov > TFRecordIO should comply with PTransform style guide > > > Key: BEAM-1913 > URL: https://issues.apache.org/jira/browse/BEAM-1913 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible, starter > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java > violates a few guidelines from > https://beam.apache.org/contribute/ptransform-style-guide/ : > - Use of Bound and Unbound types: the Read and Write classes themselves > should be PTransform's > - Should have one static entry point per verb - read() and write() > - Both classes could benefit from AutoValue > Basically, perform similar surgery like in > https://github.com/apache/beam/pull/2149 but on smaller scale since this is a > much smaller connector. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1914) XML IO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-1914: -- Assignee: Eugene Kirpichov > XML IO should comply with PTransform style guide > > > Key: BEAM-1914 > URL: https://issues.apache.org/jira/browse/BEAM-1914 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible, starter > > Currently we have XmlSource and XmlSink in the Java SDK. They violate the > PTransform style guide in several respects: > - They should be grouped into an XmlIO class with read() and write() verbs, > like all the other similar connectors > - The source/sink classes should be made private or package-local > - Should get rid of XmlSink.Bound - XmlSink itself should inherit from > FileBasedSink > - Could optionally benefit from AutoValue > See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1913) TFRecordIO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1913. -- Resolution: Fixed Fix Version/s: First stable release > TFRecordIO should comply with PTransform style guide > > > Key: BEAM-1913 > URL: https://issues.apache.org/jira/browse/BEAM-1913 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Labels: backward-incompatible, starter > Fix For: First stable release > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java > violates a few guidelines from > https://beam.apache.org/contribute/ptransform-style-guide/ : > - Use of Bound and Unbound types: the Read and Write classes themselves > should be PTransform's > - Should have one static entry point per verb - read() and write() > - Both classes could benefit from AutoValue > Basically, perform similar surgery like in > https://github.com/apache/beam/pull/2149 but on smaller scale since this is a > much smaller connector. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1355) HDFS IO should comply with PTransform style guide
[ https://issues.apache.org/jira/browse/BEAM-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973593#comment-15973593 ] Eugene Kirpichov commented on BEAM-1355: Stephen - do you think then, perhaps, that we should delete HDFSFileSource/Sink for the first stable release, as soon as BEAM-2005 is done? > HDFS IO should comply with PTransform style guide > - > > Key: BEAM-1355 > URL: https://issues.apache.org/jira/browse/BEAM-1355 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Eugene Kirpichov >Assignee: Stephen Sisk > Labels: backward-incompatible > > https://github.com/apache/beam/tree/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs > does not comply with > https://beam.apache.org/contribute/ptransform-style-guide/ in a number of > ways: > - It is not packaged as a PTransform (should be: HDFSIO.Read,Write or > something like that) > - Should probably use AutoValue for specifying parameters > Stephen knows about the current state of HDFS IO. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally
[ https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973669#comment-15973669 ] Eugene Kirpichov commented on BEAM-1910: Vikas, any updates? This is heavily interfering with the ability to run mvn verify locally. > test_using_slow_impl very flaky locally > --- > > Key: BEAM-1910 > URL: https://issues.apache.org/jira/browse/BEAM-1910 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Eugene Kirpichov >Assignee: Ahmet Altay > > Most times this test fails on my machine when running: > mvn verify -am -T 1C > test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL > ... > ___ summary > > ERROR: docs: commands failed > lint: commands succeeded > ERROR: py27: commands failed > py27cython: commands succeeded > py27gcp: commands succeeded > [ERROR] Command execution failed. > org.apache.commons.exec.ExecuteException: Process exited with an error: 1 > (Exit value: 1) > at > org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404) > at > org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764) > at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711) > at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289) > at > org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) > at > org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) > at > org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185) > at > org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Unfortunately the test doesn't print anything to maven output, so I don't > know what went wrong. I also don't know how to rerun the individual test > myself. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106185#comment-16106185 ] Eugene Kirpichov commented on BEAM-2671: It's overwhelmingly likely the last one as I mentioned above. On Sat, Jul 29, 2017, 9:50 AM Jean-Baptiste Onofré (JIRA)> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark > runner > > > Key: BEAM-2671 > URL: https://issues.apache.org/jira/browse/BEAM-2671 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Jean-Baptiste Onofré > Fix For: 2.1.0, 2.2.0 > > > Error message: > Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: > Expected: iterable over [] in any order > but: Not matched: "late" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105124#comment-16105124 ] Eugene Kirpichov commented on BEAM-2671: I'm not sure about bumping it to 2.2.0. It looks quite dangerous and I'd be hesitant to ignore it without checking how dangerous it actually is. Do we have other tests that verify that Spark runner does not incorrectly drop data in streaming pipelines? > CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark > runner > > > Key: BEAM-2671 > URL: https://issues.apache.org/jira/browse/BEAM-2671 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Jean-Baptiste Onofré > Fix For: 2.1.0, 2.2.0 > > > Error message: > Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: > Expected: iterable over [] in any order > but: Not matched: "late" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111508#comment-16111508 ] Eugene Kirpichov commented on BEAM-1190: GCS object listing is strongly consistent now. However the issue still applies for other filesystems like S3. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2716) AvroReader should refuse dynamic splits while in the last block
Eugene Kirpichov created BEAM-2716: -- Summary: AvroReader should refuse dynamic splits while in the last block Key: BEAM-2716 URL: https://issues.apache.org/jira/browse/BEAM-2716 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov Priority: Minor AvroReader is able to detect when it's in the last block: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L728 It could also use this information to avoid wastefully producing dynamic splits starting in the range of the current block. One way to do this would be to have OffsetRangeTracker have a "claim range" operation: claim range of [a, b) is, in terms of correctness, equivalent to claiming "a" (it checks whether "a" is within the range), but sets the last claimed position to "b" rather than "a", thus protecting more positions from being split away. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2716) AvroReader should refuse dynamic splits while in the last block
[ https://issues.apache.org/jira/browse/BEAM-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111671#comment-16111671 ] Eugene Kirpichov commented on BEAM-2716: I think this affects only Avro, other file-based formats either don't support dynamic splits, or don't know block lengths upfront. > AvroReader should refuse dynamic splits while in the last block > --- > > Key: BEAM-2716 > URL: https://issues.apache.org/jira/browse/BEAM-2716 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov >Priority: Minor > > AvroReader is able to detect when it's in the last block: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L728 > It could also use this information to avoid wastefully producing dynamic > splits starting in the range of the current block. > One way to do this would be to have OffsetRangeTracker have a "claim range" > operation: claim range of [a, b) is, in terms of correctness, equivalent to > claiming "a" (it checks whether "a" is within the range), but sets the last > claimed position to "b" rather than "a", thus protecting more positions from > being split away. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2670. -- Resolution: Fixed > ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark > runner > -- > > Key: BEAM-2670 > URL: https://issues.apache.org/jira/browse/BEAM-2670 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Eugene Kirpichov >Priority: Minor > Fix For: 2.1.0, 2.2.0 > > > Tests succeed when run alone but fail when the whole ParDoTest is run. May be > related to PipelineOptions reusing / not cleaning. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2712) SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions.
Eugene Kirpichov created BEAM-2712: -- Summary: SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions. Key: BEAM-2712 URL: https://issues.apache.org/jira/browse/BEAM-2712 Project: Beam Issue Type: Bug Components: runner-apex, runner-core, runner-flink, runner-spark Reporter: Eugene Kirpichov Assignee: Kenneth Knowles https://github.com/apache/beam/pull/3654 introduces SerializablePipelineOptions, which on deserialization calls FileSystems.setDefaultPipelineOptions. This is obviously problematic and racy in case the same process uses SerializablePipelineOptions with different filesystem-related options in them. The reason the PR does this is, Flink and Apex runners were already doing it in their respective SerializablePipelineOptions-like classes (being removed in the PR); and Spark wasn't but probably should have. I believe this is done for the sake of having the proper filesystem options automatically available on workers in all places where any kind of PipelineOptions are used. Instead, all 3 runners should pick a better place to initialize their workers, and explicitly call FileSystems.setDefaultPipelineOptions there. It would be even better if FileSystems.setDefaultPipelineOptions didn't exist at all, but that's a topic for a separate JIRA. CC'ing runner contributors [~aljoscha] [~aviemzur] [~thw] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (BEAM-2712) SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions.
[ https://issues.apache.org/jira/browse/BEAM-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-2712: -- Assignee: (was: Kenneth Knowles) > SerializablePipelineOptions should not call > FileSystems.setDefaultPipelineOptions. > -- > > Key: BEAM-2712 > URL: https://issues.apache.org/jira/browse/BEAM-2712 > Project: Beam > Issue Type: Bug > Components: runner-apex, runner-core, runner-flink, runner-spark >Reporter: Eugene Kirpichov > > https://github.com/apache/beam/pull/3654 introduces > SerializablePipelineOptions, which on deserialization calls > FileSystems.setDefaultPipelineOptions. > This is obviously problematic and racy in case the same process uses > SerializablePipelineOptions with different filesystem-related options in them. > The reason the PR does this is, Flink and Apex runners were already doing it > in their respective SerializablePipelineOptions-like classes (being removed > in the PR); and Spark wasn't but probably should have. > I believe this is done for the sake of having the proper filesystem options > automatically available on workers in all places where any kind of > PipelineOptions are used. Instead, all 3 runners should pick a better place > to initialize their workers, and explicitly call > FileSystems.setDefaultPipelineOptions there. > It would be even better if FileSystems.setDefaultPipelineOptions didn't exist > at all, but that's a topic for a separate JIRA. > CC'ing runner contributors [~aljoscha] [~aviemzur] [~thw] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2706) Create JdbcIO.readAll()
Eugene Kirpichov created BEAM-2706: -- Summary: Create JdbcIO.readAll() Key: BEAM-2706 URL: https://issues.apache.org/jira/browse/BEAM-2706 Project: Beam Issue Type: Bug Components: sdk-java-extensions Reporter: Eugene Kirpichov Assignee: Jean-Baptiste Onofré Similarly to TextIO.readAll(), AvroIO.readAll(), SpannerIO.readAll(), and in the general spirit of making connectors more dynamic: it would be nice to have JdbcIO.readAll() that reads a PCollection of queries, or perhaps better, a parameterized query with a PCollection of parameter values and a user callback for setting the parameters on a query based on the PCollection element. JB, as the author of JdbcIO, would you be interested in implementing this at some point? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-1323) Add parallelism/splitting in JdbcIO
[ https://issues.apache.org/jira/browse/BEAM-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109644#comment-16109644 ] Eugene Kirpichov commented on BEAM-1323: Filed https://issues.apache.org/jira/browse/BEAM-2706 for the latter, and I suggest to close the current JIRA unless we have a compelling use case for which generic built-in splitting support in JdbcIO.read() is both practical and easier to use. > Add parallelism/splitting in JdbcIO > --- > > Key: BEAM-1323 > URL: https://issues.apache.org/jira/browse/BEAM-1323 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré > > Now, the JDBC IO is basically a {{DoFn}} executed with a {{ParDo}}. So, it > means that parallelism is "limited" and executed on one executor. > We can imagine to create several JDBC {{BoundedSource}}s splitting the SQL > query in subset (for instance using row id paging or any "splitting/limit" > we can figure based on the original SQL query) (something similar to what > Sqoop is doing). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2684) AmqpIOTest is flaky
[ https://issues.apache.org/jira/browse/BEAM-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114002#comment-16114002 ] Eugene Kirpichov commented on BEAM-2684: As an intermediate step, perhaps, add a bunch of logging to the test, so that when someone else hits it on their PR, you have more information to work with? > AmqpIOTest is flaky > --- > > Key: BEAM-2684 > URL: https://issues.apache.org/jira/browse/BEAM-2684 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions >Reporter: Eugene Kirpichov >Assignee: Jean-Baptiste Onofré > > This test is often timing out, and has been doing that for a while, causing > unrelated PRs to fail randomly. I've gotten into the habit of excluding > sdks/java/io/amqp when running "mvn verify" and I suppose it's not a good > habit :) > Example failure: > https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/13424/console -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (BEAM-2768) Fix bigquery.WriteTables generating non-unique job identifiers
[ https://issues.apache.org/jira/browse/BEAM-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-2768: -- Assignee: Reuven Lax (was: Kenneth Knowles) > Fix bigquery.WriteTables generating non-unique job identifiers > -- > > Key: BEAM-2768 > URL: https://issues.apache.org/jira/browse/BEAM-2768 > Project: Beam > Issue Type: Bug > Components: beam-model >Affects Versions: 2.0.0 >Reporter: Matti Remes >Assignee: Reuven Lax > > This is a result of BigQueryIO not creating unique job ids for batch inserts, > thus BigQuery API responding with a 409 conflict error: > {code:java} > Request failed with code 409, will NOT retry: > https://www.googleapis.com/bigquery/v2/projects//jobs > {code} > The jobs are initiated in a step BatchLoads/SinglePartitionWriteTables, > called by step's WriteTables ParDo: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L511-L521 > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L148 > It would probably be a good idea to append a UUIDs as part of a job id. > Edit: This is a major bug blocking using BigQuery as a sink for bounded input. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2768) Fix bigquery.WriteTables generating non-unique job identifiers
[ https://issues.apache.org/jira/browse/BEAM-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127687#comment-16127687 ] Eugene Kirpichov commented on BEAM-2768: Could you tell more about how you're using BigQueryIO.Write (it has many modes - it would be best if you could show a code snippet where you're applying BigQueryIO.write() in your pipeline, removing all personal data but at least exactly showing all BigQueryIO API methods you're using) and what exact version of Beam SDK you're using? Your links point to the master branch, but the bug description says 2.0.0 - these versions have very different implementations of BigQueryIO.Write. Looking at the current code, the job id *does* contain a random UUID that comes from https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L348. > Fix bigquery.WriteTables generating non-unique job identifiers > -- > > Key: BEAM-2768 > URL: https://issues.apache.org/jira/browse/BEAM-2768 > Project: Beam > Issue Type: Bug > Components: beam-model >Affects Versions: 2.0.0 >Reporter: Matti Remes >Assignee: Reuven Lax > > This is a result of BigQueryIO not creating unique job ids for batch inserts, > thus BigQuery API responding with a 409 conflict error: > {code:java} > Request failed with code 409, will NOT retry: > https://www.googleapis.com/bigquery/v2/projects//jobs > {code} > The jobs are initiated in a step BatchLoads/SinglePartitionWriteTables, > called by step's WriteTables ParDo: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L511-L521 > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L148 > It would probably be a good idea to append a UUIDs as part of a job id. > Edit: This is a major bug blocking using BigQuery as a sink for bounded input. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126416#comment-16126416 ] Eugene Kirpichov commented on BEAM-2140: Sorry for the delayed response. The output watermark should be held by the watermark hold set by the SDF: https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java#L378 - does your implementation support output watermark holds? > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2700) BigQueryIO should support using file load jobs when using unbounded collections
[ https://issues.apache.org/jira/browse/BEAM-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2700. -- Resolution: Fixed Fix Version/s: 2.2.0 > BigQueryIO should support using file load jobs when using unbounded > collections > --- > > Key: BEAM-2700 > URL: https://issues.apache.org/jira/browse/BEAM-2700 > Project: Beam > Issue Type: Bug > Components: sdk-java-gcp >Affects Versions: 2.2.0 >Reporter: Reuven Lax >Assignee: Reuven Lax > Fix For: 2.2.0 > > > Currently the method used for inserting into BigQuery is based on the input > PCollection. Bounded input using file load jobs, unbounded input uses > streaming inserts. However while streaming inserts have far lower latency, > then cost quite a bit more and they provide weaker consistency guarantees. > Users should be able to choose which method to use, irrespective of the input > PCollection. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2447) Reintroduce DoFn.ProcessContinuation
[ https://issues.apache.org/jira/browse/BEAM-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2447. -- Resolution: Fixed Fix Version/s: 2.2.0 > Reintroduce DoFn.ProcessContinuation > > > Key: BEAM-2447 > URL: https://issues.apache.org/jira/browse/BEAM-2447 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Fix For: 2.2.0 > > > ProcessContinuation.resume() is useful for tailing files - when we reach > current EOF, we want to voluntarily suspend the process() call rather than > wait for runner to checkpoint us. > In BEAM-1903, DoFn.ProcessContinuation was removed because there was > ambiguity about the semantics of resume() especially w.r.t. the following > situation described in > https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit > : the runner has taken a checkpoint on the tracker, and then the > ProcessElement call returns resume() signaling that the work is still not > done - then there's 2 checkpoints to deal with. > Instead, the proper way to refine this semantics is: > - After checkpoint() on a RestrictionTracker, the tracker MUST fail all > subsequent tryClaim() calls, and MUST succeed in checkDone(). > - After a failed tryClaim() call, the ProcessElement method MUST return stop() > - So ProcessElement can return resume() only *instead* of doing tryClaim() > - Then, if the runner has already taken a checkpoint but tracker has returned > resume(), we do not need to take a new checkpoint - the one already taken > already accurately describes the remainder of the work. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2623) Create Watch transform for emitting new elements in a family of growing sets
Eugene Kirpichov created BEAM-2623: -- Summary: Create Watch transform for emitting new elements in a family of growing sets Key: BEAM-2623 URL: https://issues.apache.org/jira/browse/BEAM-2623 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov See http://s.apache.org/beam-watch-transform -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2406) NullPointerException when writing an empty table to BigQuery
[ https://issues.apache.org/jira/browse/BEAM-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2406. -- Resolution: Fixed Fix Version/s: 2.1.0 > NullPointerException when writing an empty table to BigQuery > > > Key: BEAM-2406 > URL: https://issues.apache.org/jira/browse/BEAM-2406 > Project: Beam > Issue Type: Bug > Components: sdk-java-gcp >Affects Versions: 2.0.0 >Reporter: Ben Chambers >Assignee: Reuven Lax >Priority: Minor > Fix For: 2.1.0 > > > Originally reported on Stackoverflow: > https://stackoverflow.com/questions/44314030/handling-empty-pcollections-with-bigquery-in-apache-beam > It looks like if there is no data to write, then WritePartitions will return > a null destination, as explicitly stated in the comments: > https://github.com/apache/beam/blob/v2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WritePartition.java#L126 > But, the ConstantTableDestination doesn't turn that into the constant > destination as the comment promises, instead it returns that `null` > destination: > https://github.com/apache/beam/blob/53c9bf4cd325035fabde192c63652ef6d591b93c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinationsHelpers.java#L74 > This leads to a null pointer error here since the `tableDestination` is that > null result from calling `getTable`: > https://github.com/apache/beam/blob/v2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L97 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2628) AvroSource.split() sequentially opens every matched file
Eugene Kirpichov created BEAM-2628: -- Summary: AvroSource.split() sequentially opens every matched file Key: BEAM-2628 URL: https://issues.apache.org/jira/browse/BEAM-2628 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov When you do AvroIO.read().from(filepattern), during splitting of AvroSource the filepattern gets expanded into N files, and then for each of the N files we do this: https://github.com/apache/beam/blob/v2.0.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L259 This is very slow. E.g. one job was reading 15,000 files, and it took almost 2 hours to split the source because opening each file and reading schema was taking about 0.5s. I'm not quite sure why we need the file metadata while splitting... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)
[ https://issues.apache.org/jira/browse/BEAM-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov updated BEAM-2640: --- Issue Type: New Feature (was: Bug) > Introduce Create.ofProvider(ValueProvider) > -- > > Key: BEAM-2640 > URL: https://issues.apache.org/jira/browse/BEAM-2640 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > When you have a ValueProvider that may or may not be accessible at > construction time, a common task is to wrap it into a single-element > PCollection. This is especially common when migrating an IO connector that > used something like Create.of(query) followed by a ParDo, to having query be > a ValueProvider. > Currently this is done in an icky way (e.g. > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615) > We should have a convenience helper for that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)
Eugene Kirpichov created BEAM-2640: -- Summary: Introduce Create.ofProvider(ValueProvider) Key: BEAM-2640 URL: https://issues.apache.org/jira/browse/BEAM-2640 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov When you have a ValueProvider that may or may not be accessible at construction time, a common task is to wrap it into a single-element PCollection. This is especially common when migrating an IO connector that used something like Create.of(query) followed by a ParDo, to having query be a ValueProvider. Currently this is done in an icky way (e.g. https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615) We should have a convenience helper for that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2532) BigQueryIO source should avoid expensive JSON schema parsing for every record
[ https://issues.apache.org/jira/browse/BEAM-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2532. -- Resolution: Fixed Assignee: Neville Li (was: Chamikara Jayalath) Fix Version/s: 2.2.0 > BigQueryIO source should avoid expensive JSON schema parsing for every record > - > > Key: BEAM-2532 > URL: https://issues.apache.org/jira/browse/BEAM-2532 > Project: Beam > Issue Type: Improvement > Components: sdk-java-gcp >Affects Versions: 2.0.0 >Reporter: Marian Dvorsky >Assignee: Neville Li >Priority: Minor > Fix For: 2.2.0 > > > BigQueryIO source converts the schema from JSON for every input row, here: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L159 > This is the performance bottleneck in a simple pipeline with BigQueryIO > source. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2544) AvroIOTest is flaky
[ https://issues.apache.org/jira/browse/BEAM-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2544. -- Resolution: Fixed Fix Version/s: 2.2.0 > AvroIOTest is flaky > --- > > Key: BEAM-2544 > URL: https://issues.apache.org/jira/browse/BEAM-2544 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Alex Filatov >Assignee: Eugene Kirpichov >Priority: Minor > Fix For: 2.2.0 > > > "Write then read" tests randomly fail. > Steps to reproduce: > cd /runners/direct-java > mvn clean compile > mvn surefire:test@validates-runner-tests -Dtest=AvroIOTest > Repeat last step until a failure (on my machine failure rate is approx 1/3). > Example: > [ERROR] > testAvroIOWriteAndReadSchemaUpgrade(org.apache.beam.sdk.io.AvroIOTest) Time > elapsed: 0.198 s <<< ERROR! > java.lang.RuntimeException: java.io.FileNotFoundException: > /var/folders/1c/sl733g5s1g7_4mq61_qmbjx4gn/T/junit3332447750239941326/output.avro > (No such file or directory) > at > org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:340) > at > org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302) > at > org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283) > at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:340) > at > org.apache.beam.sdk.io.AvroIOTest.testAvroIOWriteAndReadSchemaUpgrade(AvroIOTest.java:275) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239) > at > org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:321) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at > org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:321) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:27) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55) > at > org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137) > at > org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107) > at > org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83) > at > org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75) > at > org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:157) > at >
[jira] [Closed] (BEAM-2628) AvroSource.split() sequentially opens every matched file
[ https://issues.apache.org/jira/browse/BEAM-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2628. -- Resolution: Fixed Fix Version/s: 2.2.0 > AvroSource.split() sequentially opens every matched file > > > Key: BEAM-2628 > URL: https://issues.apache.org/jira/browse/BEAM-2628 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Fix For: 2.2.0 > > > When you do AvroIO.read().from(filepattern), during splitting of AvroSource > the filepattern gets expanded into N files, and then for each of the N files > we do this: > https://github.com/apache/beam/blob/v2.0.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L259 > This is very slow. E.g. one job was reading 15,000 files, and it took almost > 2 hours to split the source because opening each file and reading schema was > taking about 0.5s. > I'm not quite sure why we need the file metadata while splitting... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2608) Unclosed BoundedReader in TextIO#ReadTextFn#process()
[ https://issues.apache.org/jira/browse/BEAM-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2608. -- Resolution: Fixed Fix Version/s: 2.2.0 > Unclosed BoundedReader in TextIO#ReadTextFn#process() > - > > Key: BEAM-2608 > URL: https://issues.apache.org/jira/browse/BEAM-2608 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Ted Yu >Assignee: Eugene Kirpichov >Priority: Minor > Fix For: 2.2.0 > > > {code} > BoundedSource.BoundedReader reader = > source > .createForSubrangeOfFile(metadata, range.getFrom(), > range.getTo()) > .createReader(c.getPipelineOptions()); > {code} > The reader should be closed upon return. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2656) Introduce AvroIO.readAll()
Eugene Kirpichov created BEAM-2656: -- Summary: Introduce AvroIO.readAll() Key: BEAM-2656 URL: https://issues.apache.org/jira/browse/BEAM-2656 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov TextIO.readAll() is nifty and performant when reading a large number of files. We should similarly have AvroIO.readAll(). Maybe other connectors too in the future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2680) Improve scalability of the Watch transform
Eugene Kirpichov created BEAM-2680: -- Summary: Improve scalability of the Watch transform Key: BEAM-2680 URL: https://issues.apache.org/jira/browse/BEAM-2680 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov https://github.com/apache/beam/pull/3565 introduces the Watch transform http://s.apache.org/beam-watch-transform. The implementation leaves several scalability-related TODOs: 1) The state stores hashes and timestamps of outputs that have already been output and should be omitted from future polls. We could garbage-collect this state, e.g. dropping elements from "completed" and from addNewAsPending() if their timestamp is more than X behind the watermark. 2) When a poll returns a huge number of elements, we don't necessarily have to add all of them into state.pending - instead we could add only N elements and ignore others, relying on future poll rounds to provide them, in order to avoid blowing up the state. Combined with garbage collection of GrowthState.completed, this would make the transform scalable to very large poll results. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2682) Merge AvroIOTest and AvroIOTransformTest
Eugene Kirpichov created BEAM-2682: -- Summary: Merge AvroIOTest and AvroIOTransformTest Key: BEAM-2682 URL: https://issues.apache.org/jira/browse/BEAM-2682 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov These two tests seem to have exactly the same purpose. They should be merged into AvroIOTest. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2677) AvroIO.read without specifying a schema
Eugene Kirpichov created BEAM-2677: -- Summary: AvroIO.read without specifying a schema Key: BEAM-2677 URL: https://issues.apache.org/jira/browse/BEAM-2677 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov Sometimes it is inconvenient to require the user of AvroIO.read/readAll to specify a Schema for the Avro files they are reading, especially if different files may have different schemas. It is possible to read GenericRecord objects from an Avro file, however it is not possible to provide a Coder for GenericRecord without knowing the schema: a GenericRecord knows its schema so we can encode it into a byte array, but we can not decode it from a byte array without knowing the schema (and encoding the full schema together with every record would be impractical). Instead, a reasonable approach is to treat schemaless GenericRecord as unencodable and use the same approach as JdbcIO - a user-specified parse callback. Suggested API: AvroIO.parseGenericRecords(SerializableFunctionparseFn).from(filepattern). CC: [~mkhadikov] [~reuvenlax] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103911#comment-16103911 ] Eugene Kirpichov commented on BEAM-2671: Actually that test is https://issues.apache.org/jira/browse/BEAM-1868 - does that mean it also should be fixed for 2.1.0 RC3? > CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark > runner > > > Key: BEAM-2671 > URL: https://issues.apache.org/jira/browse/BEAM-2671 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Jean-Baptiste Onofré > Fix For: 2.1.0, 2.2.0 > > > Error message: > Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: > Expected: iterable over [] in any order > but: Not matched: "late" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov reassigned BEAM-2670: -- Assignee: Eugene Kirpichov (was: Stas Levin) > ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark > runner > -- > > Key: BEAM-2670 > URL: https://issues.apache.org/jira/browse/BEAM-2670 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Eugene Kirpichov >Priority: Minor > Fix For: 2.1.0, 2.2.0 > > > Tests succeed when run alone but fail when the whole ParDoTest is run. May be > related to PipelineOptions reusing / not cleaning. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103815#comment-16103815 ] Eugene Kirpichov commented on BEAM-2670: I found the issue and have a fix. > ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark > runner > -- > > Key: BEAM-2670 > URL: https://issues.apache.org/jira/browse/BEAM-2670 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Stas Levin >Priority: Minor > Fix For: 2.1.0, 2.2.0 > > > Tests succeed when run alone but fail when the whole ParDoTest is run. May be > related to PipelineOptions reusing / not cleaning. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103868#comment-16103868 ] Eugene Kirpichov commented on BEAM-2671: CreateStreamTest.testMultiOutputParDo() is failing as well. > CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark > runner > > > Key: BEAM-2671 > URL: https://issues.apache.org/jira/browse/BEAM-2671 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Jean-Baptiste Onofré > Fix For: 2.1.0, 2.2.0 > > > Error message: > Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: > Expected: iterable over [] in any order > but: Not matched: "late" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner
[ https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104459#comment-16104459 ] Eugene Kirpichov commented on BEAM-2671: I wonder if these tests have something to do with https://github.com/apache/beam/commit/20820fa5477ffcdd4a9ef2e9340353ed3c5691a9 (part of PR https://github.com/apache/beam/pull/3343 ) - it seems the last stable build of Spark ValidatesRunner was right before this commit https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastStableBuild/ [~aviemzur] , could you take a look perhaps? > CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark > runner > > > Key: BEAM-2671 > URL: https://issues.apache.org/jira/browse/BEAM-2671 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Jean-Baptiste Onofré > Fix For: 2.1.0, 2.2.0 > > > Error message: > Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: > Expected: iterable over [] in any order > but: Not matched: "late" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2684) AmqpIOTest is flaky
Eugene Kirpichov created BEAM-2684: -- Summary: AmqpIOTest is flaky Key: BEAM-2684 URL: https://issues.apache.org/jira/browse/BEAM-2684 Project: Beam Issue Type: Bug Components: sdk-java-extensions Reporter: Eugene Kirpichov Assignee: Jean-Baptiste Onofré This test is often timing out, and has been doing that for a while, causing unrelated PRs to fail randomly. I've gotten into the habit of excluding sdks/java/io/amqp when running "mvn verify" and I suppose it's not a good habit :) Example failure: https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/13424/console -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-1820) Source.getDefaultOutputCoder() should be @Nullable
[ https://issues.apache.org/jira/browse/BEAM-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-1820. -- Resolution: Won't Fix Fix Version/s: Not applicable The consensus on dev@ is that instead, PTransform's should always provide a Coder for their output; and the natural generalization to Source is that sources also always should provide a Coder for their output; letting the user configure the Source with an explicit Coder if necessary. > Source.getDefaultOutputCoder() should be @Nullable > -- > > Key: BEAM-1820 > URL: https://issues.apache.org/jira/browse/BEAM-1820 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Łukasz Gajowy > Labels: easyfix, starter > Fix For: Not applicable > > > Source.getDefaultOutputCoder() returns a coder for elements produced by the > source. > However, the Source objects are nearly always hidden from the user and > instead encapsulated in a transform. Often, an enclosing transform has a > better idea of what coder should be used to encode these elements (e.g. a > user supplied a Coder to that transform's configuration). In that case, it'd > be good if Source.getDefaultOutputCoder() could just return null, and coder > would have to be handled by the enclosing transform or perhaps specified on > the output of that transform explicitly. > Right now there's a bunch of code in the SDK and runners that assumes > Source.getDefaultOutputCoder() returns non-null. That code would need to be > fixed to instead use the coder set on the collection produced by > Read.from(source). > It all appears pretty easy to fix, so this is a good starter item. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2686) PTransform's should always set a Coder on their output PCollections
Eugene Kirpichov created BEAM-2686: -- Summary: PTransform's should always set a Coder on their output PCollections Key: BEAM-2686 URL: https://issues.apache.org/jira/browse/BEAM-2686 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov See discussion on https://lists.apache.org/thread.html/1dde0b5a93c2983cbab5f68ce7c74580102f5bb2baaa816585d7eabb@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)
[ https://issues.apache.org/jira/browse/BEAM-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093588#comment-16093588 ] Eugene Kirpichov commented on BEAM-2640: Coder inference for this is complicated by the fact that ValueProvider does not expose a TypeDescriptor for the value being provided. Even though it could. I'm gonna start with requiring to explicitly provide a coder. > Introduce Create.ofProvider(ValueProvider) > -- > > Key: BEAM-2640 > URL: https://issues.apache.org/jira/browse/BEAM-2640 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > When you have a ValueProvider that may or may not be accessible at > construction time, a common task is to wrap it into a single-element > PCollection. This is especially common when migrating an IO connector that > used something like Create.of(query) followed by a ParDo, to having query be > a ValueProvider. > Currently this is done in an icky way (e.g. > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615) > We should have a convenience helper for that. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2643) Add TextIO.read_all() to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093633#comment-16093633 ] Eugene Kirpichov commented on BEAM-2643: See also https://github.com/apache/beam/pull/3598 > Add TextIO.read_all() to Python SDK > --- > > Key: BEAM-2643 > URL: https://issues.apache.org/jira/browse/BEAM-2643 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Reporter: Chamikara Jayalath > > Java SDK now has TextIO.read_all() API that allows reading a massive number > of files by moving from using the BoundedSource API (which may perform > expensive source operations on the control plane) to using ParDo operations. > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L170 > This API should be added for Python SDK as well. > This form of reading files does not support dynamic work rebalancing for now. > But this should not matter much when reading a massive number of relatively > small files. In the future this API can support dynamic work rebalancing > through Splittable DoFn. > cc: [~jkff] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2641) Improve discoverability of TextIO.readAll() as a replacement of TextIO.read() for large globs
Eugene Kirpichov created BEAM-2641: -- Summary: Improve discoverability of TextIO.readAll() as a replacement of TextIO.read() for large globs Key: BEAM-2641 URL: https://issues.apache.org/jira/browse/BEAM-2641 Project: Beam Issue Type: Improvement Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov TextIO.readAll() dramatically outperforms TextIO.read() when reading very large numbers of files (hundreds of thousands or millions or more). However, it is not obvious that this is what you should use if you have such a filepattern in TextIO.read(). We should take a variety of measures to make it more discoverable, e.g.: * Add a parameter to TextIO.read(), like "withHintManyFiles()" * Log something suggesting the use of that hint when splitting TextIO if the filepattern is very large * Improve documentation * Post something on StackOverflow about this -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2607) Enforce that SDF must return stop() after a failed tryClaim() call
Eugene Kirpichov created BEAM-2607: -- Summary: Enforce that SDF must return stop() after a failed tryClaim() call Key: BEAM-2607 URL: https://issues.apache.org/jira/browse/BEAM-2607 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov https://github.com/apache/beam/pull/3360 reintroduces DoFn.ProcessContinuation with some refinements to its semantics - see https://issues.apache.org/jira/browse/BEAM-2447. One of the refinements is that, if the ProcessElement call unsuccessfully calls tryClaim() on the RestrictionTracker, the call MUST return stop(). The current JIRA is to enforce this automatically. Right now this is not possible because tryClaim() is not formally a method in RestrictionTracker (only concrete classes provide it, but not the base class) and runners can not hook into it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-2511) TextIO should support reading a PCollection of filenames
[ https://issues.apache.org/jira/browse/BEAM-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Kirpichov closed BEAM-2511. -- Resolution: Fixed Fix Version/s: 2.2.0 > TextIO should support reading a PCollection of filenames > > > Key: BEAM-2511 > URL: https://issues.apache.org/jira/browse/BEAM-2511 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > Fix For: 2.2.0 > > > Motivation and proposed implementation in https://s.apache.org/textio-sdf -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068782#comment-16068782 ] Eugene Kirpichov commented on BEAM-2140: Conceptually, watermarks are for PCollections - lower bound on timestamps of new elements that may get added to the collection. However, at the implementation level, watermarks are assigned to transforms: they have an "input watermark" and "output watermark" (I suppose, per input and per output). The difference between the output watermark of a transform producing PC and the input watermark of a transform consuming PC is as follows: the input watermark is held by "pending elements", that we know need to be processed, but yet haven't. The input watermark is also held by the event-time of pending timers set by the transform. In other words, logically the transform's input is (output of the producer of the input) + (timers set by the transform itself), and the input watermark is held by both of these. Currently the input watermark of a transform is held only by _event-time_ timers; however, it makes sense to hold it also by _processing-time_ timers. For that we need to assign them an event-time timestamp. Currently this isn't happening at all (except assigning an "effective timestamp" to output from the timer firing, when it fires - it is assigned from the current input watermark). The suggestion in case of SDF is to use the ProcessContinuation's output watermark as the event-time for the residual timer. We also discussed handling of processing-time timers in batch. Coming from the point of view that things should work exactly the same way in batch - setting a processing-time timer in batch for firing in 5 minutes should actually fire it after 5 minutes, including possibly delaying the bundle until processing-time timers quiesce. Motivating use case is, say, using an SDF-based polling continuous glob expander in a batch pipeline - it should process the same set of files it would in a streaming pipeline. A few questions I still do not understand: - Where exactly do the processing-timers get dropped, and on what condition? Kenn says that event-time timers don't get dropped: we just forbid setting them if they would be already "late". - When can an input to the SDF, or a timer set by the SDF be late at all; and should the SDF drop them? Technically a runner is free to drop late data at any point in the pipeline, but in practice it happens after GBKs; and semantically an SDF need not involve a GBK, so it should be allowed to just not drop anything late, no? - like a regular DoFn would (as long as it doesn't leak state) Seems like we also should file JIRAs for the following: - state leakage - handling processing-time timers in batch properly - holding watermark by processing-time timers - allowing the timer API (internals or the user-facing one) to specifying event-time of processing-time timers - more? > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068740#comment-16068740 ] Eugene Kirpichov commented on BEAM-2140: So, to elaborate on what Kenn said. We dug a bit deeper into this yesterday and came up with the following conclusions. 1) The reason that this stuff works in Dataflow and Direct runner is that, for running SDF, they use a code path that simply _does not drop late data/timers or GC state_. These happen in LateDataDroppingRunner and ReduceFnRunner and StatefulDoFnRunner - and the path for running ProcessFn does not involve any of these. Aljoscha, maybe you can see why your current codepaths for running ProcessFn in Flink involve dropping of late data / late timers, and make them not involve it? :) (I'm not sure where this dropping happens in Flink) 2) As a consequence, however, state doesn't get GC'd. In practice this means that, if you apply an SDF to input that is in many windows (e.g. to input windowed by fixed or sliding windows), it will slowly leak state. However, in practice this is likely not a huge concern because SDFs are expected to mostly be used when the amount of input is not super large (at least compared to output), and it is usually globally windowed. Especially in streaming use cases. I.e. it can be treated as a "Known issue" rather than "SDF does not work at all". *I would recommend proceeding to implement it in Flink runner with this same known issue*, and then solving the issue uniformly across all runners. Posting this comment for now and writing another on how to do it without state leakage. > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2536) Simplify specifying coders on PCollectionTuple
Eugene Kirpichov created BEAM-2536: -- Summary: Simplify specifying coders on PCollectionTuple Key: BEAM-2536 URL: https://issues.apache.org/jira/browse/BEAM-2536 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Currently when using a multi-output ParDo, the user usually has to do one of the following: 1) Use anonymous class: new TupleTag() {} - in order to reify the Foo type and make coder inference work. In this case, a frequent problem is that the anonymous class captures a large enclosing class, and either doesn't serialize at all, or at least serializes to something bulky. 2) Explicitly do tuple.get(myTag).setCoder(...) Both of these are suboptimal. Could we have e.g. a constructor for TupleTag that explicitly takes a TypeDescriptor? Or even a Coder? Or a family of factory methods for TupleTagList that take these? E.g.: in.apply(ParDo.of(...).withOutputTags(mainTag, TupleTagList.of(side1, FooCoder.of()).and(side2, BarCoder.of())); I would suggest both: TupleTag constructor should optionally take a TypeDescriptor; and TupleTagList.of() and .and() should optionally take a Coder. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2532) BigQueryIO source should avoid expensive JSON schema parsing for every record
[ https://issues.apache.org/jira/browse/BEAM-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070428#comment-16070428 ] Eugene Kirpichov commented on BEAM-2532: [~reuvenlax] I think you had some thoughts about how to make this whole thing much faster? Or was it for Write? > BigQueryIO source should avoid expensive JSON schema parsing for every record > - > > Key: BEAM-2532 > URL: https://issues.apache.org/jira/browse/BEAM-2532 > Project: Beam > Issue Type: Improvement > Components: sdk-java-gcp >Affects Versions: 2.0.0 >Reporter: Marian Dvorsky >Assignee: Stephen Sisk >Priority: Minor > > BigQueryIO source converts the schema from JSON for every input row, here: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L159 > This is the performance bottleneck in a simple pipeline with BigQueryIO > source. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065429#comment-16065429 ] Eugene Kirpichov edited comment on BEAM-2140 at 6/27/17 8:38 PM: - Okay, I see that I misunderstood what the watermark hold does, and now I'm not sure how anything works at all (i.e. why timers set by SDF are not constantly dropped) - in direct and dataflow runner :-| For SDF specifically, I think it would make sense to *advance the watermark of the input same as if the DoFn was not splittable* - i.e. consider the input element "consumed" only when the ProcessElement call terminates with no residual restriction. In other words, I guess, set an "input watermark hold" (in addition to output watermark hold)? Is such a thing possible? Does it make equal sense for non-splittable DoFn's that use timers? was (Author: jkff): Okay, I see that I misunderstood what the watermark hold does, and now I'm not sure how anything works at all (i.e. why timers set by SDF are not constantly dropped) - in direct and dataflow runner :-| For SDF specifically, I think it would make sense to **advance the watermark of the input same as if the DoFn was not splittable** - i.e. consider the input element "consumed" only when the ProcessElement call terminates with no residual restriction. In other words, I guess, set an "input watermark hold" (in addition to output watermark hold)? Is such a thing possible? Does it make equal sense for non-splittable DoFn's that use timers? > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065429#comment-16065429 ] Eugene Kirpichov commented on BEAM-2140: Okay, I see that I misunderstood what the watermark hold does, and now I'm not sure how anything works at all (i.e. why timers set by SDF are not constantly dropped) - in direct and dataflow runner :-| For SDF specifically, I think it would make sense to **advance the watermark of the input same as if the DoFn was not splittable** - i.e. consider the input element "consumed" only when the ProcessElement call terminates with no residual restriction. In other words, I guess, set an "input watermark hold" (in addition to output watermark hold)? Is such a thing possible? Does it make equal sense for non-splittable DoFn's that use timers? > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065626#comment-16065626 ] Eugene Kirpichov commented on BEAM-2140: _we can't advance the watermark as though it was non-splittable in the unbounded case_ - why is that / why is it a bad thing that the watermark of the PCollection being fed into the SDF would not advance? E.g. imagine it's a Create.of(pubsub topic name) + ParDo(read pubsub forever) - is it important to advance the watermark of the Create.of()? Alternatively, imagine it's: read filepatterns from pubsub + TextIO.readAll().watchForNewFiles().watchFilesForNewEntries(), which has several SDFs in this. Would there be a problem with advancing the watermark of the PCollection of filepatterns only after the watch termination conditions of TextIO.readAll() are hit and this filepattern is no longer watched? Alternatively - worst case I guess: read Pubsub topic names from Kafka, and read each topic forever. I'd assume that the user would be interested in advancement of the watermark of the PCollection of pubsub records rather than the PCollection of Pubsub topic names? I'm not sure the Pubsub topic names in Kafka would even need to have meaningful timestamps (rather than infinite past). > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
[ https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065627#comment-16065627 ] Eugene Kirpichov commented on BEAM-2140: ...Or is the problem that the watermark of the output PCollection has to be smaller than watermark of the input, so if input doesn't advance then output can't advance either? > Fix SplittableDoFn ValidatesRunner tests in FlinkRunner > --- > > Key: BEAM-2140 > URL: https://issues.apache.org/jira/browse/BEAM-2140 > Project: Beam > Issue Type: Bug > Components: runner-flink >Reporter: Aljoscha Krettek >Assignee: Aljoscha Krettek > > As discovered as part of BEAM-1763, there is a failing SDF test. We disabled > the tests to unblock the open PR for BEAM-1763. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-673) Data locality for Read.Bounded
[ https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983950#comment-15983950 ] Eugene Kirpichov commented on BEAM-673: --- On second thought, this might be related to SDF: processing different restrictions of the same element may have different requirements. Or more like: a design for DoFn's giving hints to runners about their resource requirements would need to include some data dependence. I don't have a good idea about how to express it in a way that will be modular and will combine well with the rest of the Beam model and various tricks runners are allowed to do (such as fusion or materialization). > Data locality for Read.Bounded > -- > > Key: BEAM-673 > URL: https://issues.apache.org/jira/browse/BEAM-673 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Amit Sela >Assignee: Ismaël Mejía > Fix For: First stable release > > > In some distributed filesystems, such as HDFS, we should be able to hint to > Spark the preferred locations of splits. > Here is an example of how Spark does that for Hadoop RDDs: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249 -- This message was sent by Atlassian JIRA (v6.3.15#6346)