[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2017-03-09 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903135#comment-15903135
 ] 

Eugene Kirpichov commented on BEAM-68:
--

I think this needs a discussion on the Beam dev@ mailing list. We should have a 
general approach to annotating pipeline elements with runner-specific 
information. I don't think "annotate the pipeline and let runners ignore it" is 
a good approach; the main reason being that this would violate the abstraction 
boundary, where a pipeline is first constructed in a runner-agnostic way (in 
fact while the runner is not even available), and then run. E.g. the set of all 
possible runner-specific annotations is not known in advance: while "step 
parallelism limit" seems relatively generic, suppose if say Apex allowed you to 
set an "Apex frobnication level" parameter on a step - it would look pretty 
weird if the Beam pipeline API had a withApexFrobnicationLevel method on a 
ParDo, and would introduce an illegal dependency from Beam to Apex.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1422) ParDo should comply with PTransform style guide

2017-03-03 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1422:
--

Assignee: Eugene Kirpichov  (was: Davor Bonaci)

> ParDo should comply with PTransform style guide
> ---
>
> Key: BEAM-1422
> URL: https://issues.apache.org/jira/browse/BEAM-1422
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
>
> Suggested changes:
> - Get rid of ParDo.Unbound and UnboundMulti classes completely
> - Get rid of static methods such as withSideInputs/Outputs() - the only entry 
> point should be via ParDo.of(). Respectively, get rid of non-static .of().
> - Rename ParDo.Bound and ParDo.BoundMulti respectively to simply ParDo and 
> ParDoWithSideOutputs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1678) Create MemcachedIO

2017-03-10 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905236#comment-15905236
 ] 

Eugene Kirpichov commented on BEAM-1678:


What would this transform / family of transforms do?

> Create MemcachedIO
> --
>
> Key: BEAM-1678
> URL: https://issues.apache.org/jira/browse/BEAM-1678
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1679) create gRPC IO

2017-03-10 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905227#comment-15905227
 ] 

Eugene Kirpichov commented on BEAM-1679:


Discussion currently happening on the mailing list: 
https://lists.apache.org/thread.html/72bd291cb144c557cb485defcd3c3aaf05e3014da6317395dbeb6de5@%3Cuser.beam.apache.org%3E

> create gRPC IO
> --
>
> Key: BEAM-1679
> URL: https://issues.apache.org/jira/browse/BEAM-1679
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Borisa Zivkovic
>Assignee: Jean-Baptiste Onofré
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1581) JsonIO

2017-03-10 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905240#comment-15905240
 ] 

Eugene Kirpichov commented on BEAM-1581:


I would recommend to preserve the decomposition and make this (I suppose, 
FileBased?) source and sink use collections of String representing the JSON 
objects; abstracting only the way a sequence of JSON literals is represented in 
a file (is it one per line, or are they all in a huge JSON array, or are they 
delimited by an empty line, etc).

The user can then choose which JSON library to use to generate or parse them.

> JsonIO
> --
>
> Key: BEAM-1581
> URL: https://issues.apache.org/jira/browse/BEAM-1581
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> A new IO (with source and sink) which will read/write Json files.
> Similar to {{XmlSink}}.
> Consider using methods/code (or refactor these) found in {{AsJsons}} and 
> {{ParseJsons}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1581) JsonIO

2017-03-12 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906648#comment-15906648
 ] 

Eugene Kirpichov commented on BEAM-1581:


[~aviemzur] - I think XML source/sink are a bad design in retrospect - they 
were both, as far as I remember, created mostly to demonstrate the whole idea 
of file-based sources/sinks, and before the best practices for transform 
development shaped up, and did not really come from unifying an experience with 
an array of XML use cases either. We should not use them as API guidance.

The suggestion to have an abstract JsonIO seems to contradict the 
recommendation from the PTransform Style Guide (see 
https://beam.apache.org/contribute/ptransform-style-guide/#injecting-user-specified-behavior)
 to use PTransform composition as an extensibility device whenever possible 
(instead of inheritance) - and that recommendation is specifically directed at 
cases like this; the better alternative is to return String's and let the user 
compose it with a ParDo parsing the strings.

[~eljefe6a] - "File as a self-contained JSON" means there's no JSON-specific 
logic, it's simply "File as a self-contained String" - we should definitely 
have that, but under a separate JIRA issue.

Aviem / Jesse - could you perhaps come up with a list of common ways in which 
you have seen people store a collection of stuff in JSON file(s)? I think 
without that, or while keeping it implicit, we're kind of acting blindly. Let's 
list all the known use cases and abstract upward from that.

> JsonIO
> --
>
> Key: BEAM-1581
> URL: https://issues.apache.org/jira/browse/BEAM-1581
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> A new IO (with source and sink) which will read/write Json files.
> Similarly to {{XmlSource}}/{{XmlSink}}, this IO should have a 
> {{JsonSource}}/{{JonSink}} which are a {{FileBaseSource}}/{{FileBasedSink}}.
> Consider using methods/code (or refactor these) found in {{AsJsons}} and 
> {{ParseJsons}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1678) Create MemcachedIO

2017-03-10 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905722#comment-15905722
 ] 

Eugene Kirpichov commented on BEAM-1678:


Note that all IO's are families of PTransforms (usually one or two - 
read/write, sometimes more), which is why I referred to this as "transforms".

I agree that support of the memcached protocol by many non-cache systems for 
write makes it reasonable to have Beam include a library for writing things to 
it - basically something like a PTransform>, 
PDone> that writes them to a given memcached-compatible endpoint. Is this all, 
or do you have something more in mind?

> Create MemcachedIO
> --
>
> Key: BEAM-1678
> URL: https://issues.apache.org/jira/browse/BEAM-1678
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1712:
--

 Summary: TestPipeline.run doesn't actually waitUntilFinish
 Key: BEAM-1712
 URL: https://issues.apache.org/jira/browse/BEAM-1712
 Project: Beam
  Issue Type: Bug
  Components: runner-flink, sdk-java-core, testing
Reporter: Eugene Kirpichov
Assignee: Stas Levin
Priority: Blocker


https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124

it calls waitUntilFinish() only if
1) run wasn't called
2) enableAutoRunIfMissing is true.

However in practice both of these are false.
1) run() is, in most tests, called. So effectively if you call .run() at all, 
then this thing doesn't call waitUntilFinish().
2) enableAutoRunIfMissing() is set to true only via 
TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
test.

This means that, for all tests that use TestPipeline - if the test waits until 
finish, it's only because of the grace of the particular runner. Which is like 
really bad.

We're lucky because in practice TestDataflowRunner, TestApexRunner, 
TestSparkRunner in run() call themselves waitUntilFinish().

However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
actually fail in Flink runner, undetected.

The proper fix to this is to fix TestPipeline to always waitUntilFinish().
Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923329#comment-15923329
 ] 

Eugene Kirpichov commented on BEAM-1712:


OK, seems like that is in fact what happens. So this bug then is currently NOT 
a release blocker because, for various reasons, all test runners including 
Flink actually wait until finish even though TestPipeline doesn't ask them to.

> TestPipeline.run doesn't actually waitUntilFinish
> -
>
> Key: BEAM-1712
> URL: https://issues.apache.org/jira/browse/BEAM-1712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core, testing
>Reporter: Eugene Kirpichov
>Assignee: Stas Levin
>Priority: Blocker
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124
> it calls waitUntilFinish() only if
> 1) run wasn't called
> 2) enableAutoRunIfMissing is true.
> However in practice both of these are false.
> 1) run() is, in most tests, called. So effectively if you call .run() at all, 
> then this thing doesn't call waitUntilFinish().
> 2) enableAutoRunIfMissing() is set to true only via 
> TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
> test.
> This means that, for all tests that use TestPipeline - if the test waits 
> until finish, it's only because of the grace of the particular runner. Which 
> is like really bad.
> We're lucky because in practice TestDataflowRunner, TestApexRunner, 
> TestSparkRunner in run() call themselves waitUntilFinish().
> However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
> actually fail in Flink runner, undetected.
> The proper fix to this is to fix TestPipeline to always waitUntilFinish().
> Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
> make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-1712:
---
Priority: Major  (was: Blocker)

> TestPipeline.run doesn't actually waitUntilFinish
> -
>
> Key: BEAM-1712
> URL: https://issues.apache.org/jira/browse/BEAM-1712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core, testing
>Reporter: Eugene Kirpichov
>Assignee: Stas Levin
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124
> it calls waitUntilFinish() only if
> 1) run wasn't called
> 2) enableAutoRunIfMissing is true.
> However in practice both of these are false.
> 1) run() is, in most tests, called. So effectively if you call .run() at all, 
> then this thing doesn't call waitUntilFinish().
> 2) enableAutoRunIfMissing() is set to true only via 
> TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
> test.
> This means that, for all tests that use TestPipeline - if the test waits 
> until finish, it's only because of the grace of the particular runner. Which 
> is like really bad.
> We're lucky because in practice TestDataflowRunner, TestApexRunner, 
> TestSparkRunner in run() call themselves waitUntilFinish().
> However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
> actually fail in Flink runner, undetected.
> The proper fix to this is to fix TestPipeline to always waitUntilFinish().
> Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
> make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923326#comment-15923326
 ] 

Eugene Kirpichov commented on BEAM-1712:


An optimistic interpretation is probably that Flink runner in batch mode simply 
waits in .run(). Not sure about streaming mode then.

> TestPipeline.run doesn't actually waitUntilFinish
> -
>
> Key: BEAM-1712
> URL: https://issues.apache.org/jira/browse/BEAM-1712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core, testing
>Reporter: Eugene Kirpichov
>Assignee: Stas Levin
>Priority: Blocker
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124
> it calls waitUntilFinish() only if
> 1) run wasn't called
> 2) enableAutoRunIfMissing is true.
> However in practice both of these are false.
> 1) run() is, in most tests, called. So effectively if you call .run() at all, 
> then this thing doesn't call waitUntilFinish().
> 2) enableAutoRunIfMissing() is set to true only via 
> TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
> test.
> This means that, for all tests that use TestPipeline - if the test waits 
> until finish, it's only because of the grace of the particular runner. Which 
> is like really bad.
> We're lucky because in practice TestDataflowRunner, TestApexRunner, 
> TestSparkRunner in run() call themselves waitUntilFinish().
> However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
> actually fail in Flink runner, undetected.
> The proper fix to this is to fix TestPipeline to always waitUntilFinish().
> Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
> make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923351#comment-15923351
 ] 

Eugene Kirpichov commented on BEAM-1712:


As an intermediate step to unifying everything under TestPipeline calling 
waitUntilFinish, it'd be nice to make the current code less confusing at least:

- TestPipeline should remove code that pretends that it calls waitUntilFinish
- Test runners (TestBlahRunner) should document that they wait themselves
- Tests should not manually call waitUntilFinish

I.e. the responsibility should be centralized in some place; right now this 
place can only be individual test runners, and eventually it will be 
TestPipeline.

> TestPipeline.run doesn't actually waitUntilFinish
> -
>
> Key: BEAM-1712
> URL: https://issues.apache.org/jira/browse/BEAM-1712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core, testing
>Reporter: Eugene Kirpichov
>Assignee: Stas Levin
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124
> it calls waitUntilFinish() only if
> 1) run wasn't called
> 2) enableAutoRunIfMissing is true.
> However in practice both of these are false.
> 1) run() is, in most tests, called. So effectively if you call .run() at all, 
> then this thing doesn't call waitUntilFinish().
> 2) enableAutoRunIfMissing() is set to true only via 
> TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
> test.
> This means that, for all tests that use TestPipeline - if the test waits 
> until finish, it's only because of the grace of the particular runner. Which 
> is like really bad.
> We're lucky because in practice TestDataflowRunner, TestApexRunner, 
> TestSparkRunner in run() call themselves waitUntilFinish().
> However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
> actually fail in Flink runner, undetected.
> The proper fix to this is to fix TestPipeline to always waitUntilFinish().
> Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
> make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1712) TestPipeline.run doesn't actually waitUntilFinish

2017-03-13 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923321#comment-15923321
 ] 

Eugene Kirpichov commented on BEAM-1712:


https://github.com/apache/beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/FlinkRunnerResult.java#L86
 seems to imply that Flink tests with PAssert are and always have been 
vacuous?...

[~aljoscha] WDYT?

> TestPipeline.run doesn't actually waitUntilFinish
> -
>
> Key: BEAM-1712
> URL: https://issues.apache.org/jira/browse/BEAM-1712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core, testing
>Reporter: Eugene Kirpichov
>Assignee: Stas Levin
>Priority: Blocker
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestPipeline.java#L124
> it calls waitUntilFinish() only if
> 1) run wasn't called
> 2) enableAutoRunIfMissing is true.
> However in practice both of these are false.
> 1) run() is, in most tests, called. So effectively if you call .run() at all, 
> then this thing doesn't call waitUntilFinish().
> 2) enableAutoRunIfMissing() is set to true only via 
> TestPipeline.enableAutoRunIfMissing(), which is called only from its own unit 
> test.
> This means that, for all tests that use TestPipeline - if the test waits 
> until finish, it's only because of the grace of the particular runner. Which 
> is like really bad.
> We're lucky because in practice TestDataflowRunner, TestApexRunner, 
> TestSparkRunner in run() call themselves waitUntilFinish().
> However, TestFlinkRunner doesn't - i.e. there currently might be tests that 
> actually fail in Flink runner, undetected.
> The proper fix to this is to fix TestPipeline to always waitUntilFinish().
> Currently testing a quick-fix in https://github.com/apache/beam/pull/2240 to 
> make sure Flink is safe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1573) KafkaIO does not allow using Kafka serializers and deserializers

2017-03-06 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1589#comment-1589
 ] 

Eugene Kirpichov commented on BEAM-1573:


Yes, this would be a backward-incompatible change. We are sort of fine with 
making them if there is a good reason and before declaring Beam stable API 
(first 1.x release), and this would be one of a family of changes bringing beam 
in accordance with its own style guide - which we consider a necessary evil. We 
already removed coders from TextIO as part of that.

So yes, please feel free to proceed, and I'll be happy to review the PR.

> KafkaIO does not allow using Kafka serializers and deserializers
> 
>
> Key: BEAM-1573
> URL: https://issues.apache.org/jira/browse/BEAM-1573
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Affects Versions: 0.4.0, 0.5.0
>Reporter: peay
>Assignee: Raghu Angadi
>Priority: Minor
>
> KafkaIO does not allow to override the serializer and deserializer settings 
> of the Kafka consumer and producers it uses internally. Instead, it allows to 
> set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class 
> that calls the coder.
> I appreciate that allowing to use Beam coders is good and consistent with the 
> rest of the system. However, is there a reason to completely disallow to use 
> custom Kafka serializers instead?
> This is a limitation when working with an Avro schema registry for instance, 
> which requires custom serializers. One can write a `Coder` that wraps a 
> custom Kafka serializer, but that means two levels of un-necessary wrapping.
> In addition, the `Coder` abstraction is not equivalent to Kafka's 
> `Serializer` which gets the topic name as input. Using a `Coder` wrapper 
> would require duplicating the output topic setting in the argument to 
> `KafkaIO` and when building the wrapper, which is not elegant and error prone.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1573) KafkaIO does not allow using Kafka serializers and deserializers

2017-03-06 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896989#comment-15896989
 ] 

Eugene Kirpichov commented on BEAM-1573:


At a conference right now, but quick comment: yes, as Raghu said, we're getting 
rid of Coder's as a general parsing mechanism. Use of coders for the purpose 
for which KafkaIO currently uses them is explicitly forbidden by the Beam 
PTransform Style Guide 
https://beam.apache.org/contribute/ptransform-style-guide/#coders .

We should replace that with having KafkaIO return byte[] and having convenience 
utilities for deserializing these byte[] using Kafka deserializers, e.g. by 
wrapping the code Raghu posted as a utility in the kafka module (packaged, say, 
as a SerializableFunction).

Raghu or @peay, perhaps consider sending a PR to fix this? It seems like it 
should be rather easy. Though it would merit a short discussion on 
d...@beam.apache.org first.

> KafkaIO does not allow using Kafka serializers and deserializers
> 
>
> Key: BEAM-1573
> URL: https://issues.apache.org/jira/browse/BEAM-1573
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Affects Versions: 0.4.0, 0.5.0
>Reporter: peay
>Assignee: Raghu Angadi
>Priority: Minor
>
> KafkaIO does not allow to override the serializer and deserializer settings 
> of the Kafka consumer and producers it uses internally. Instead, it allows to 
> set a `Coder`, and has a simple Kafka serializer/deserializer wrapper class 
> that calls the coder.
> I appreciate that allowing to use Beam coders is good and consistent with the 
> rest of the system. However, is there a reason to completely disallow to use 
> custom Kafka serializers instead?
> This is a limitation when working with an Avro schema registry for instance, 
> which requires custom serializers. One can write a `Coder` that wraps a 
> custom Kafka serializer, but that means two levels of un-necessary wrapping.
> In addition, the `Coder` abstraction is not equivalent to Kafka's 
> `Serializer` which gets the topic name as input. Using a `Coder` wrapper 
> would require duplicating the output topic setting in the argument to 
> `KafkaIO` and when building the wrapper, which is not elegant and error prone.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1416) Write transform should comply with PTransform style guide

2017-02-28 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1416.
--

> Write transform should comply with PTransform style guide
> -
>
> Key: BEAM-1416
> URL: https://issues.apache.org/jira/browse/BEAM-1416
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
> Fix For: 0.6.0
>
>
> Suggested change: remove Bound class - Write itself should be the transform 
> class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1426) SortValues should comply with PTransform style guide

2017-02-28 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1426.
--

> SortValues should comply with PTransform style guide
> 
>
> Key: BEAM-1426
> URL: https://issues.apache.org/jira/browse/BEAM-1426
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
> Fix For: 0.6.0
>
>
> Suggested changes: BufferedExternalSorter.Options should name its builder 
> methods .withBlah(), rather than .setBlah().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1424) ToString should comply with PTransform style guide

2017-02-28 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889586#comment-15889586
 ] 

Eugene Kirpichov commented on BEAM-1424:


Actual backward-incompatible changes: kv renamed to kvs, iterable to iterables.

> ToString should comply with PTransform style guide
> --
>
> Key: BEAM-1424
> URL: https://issues.apache.org/jira/browse/BEAM-1424
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
> Fix For: 0.6.0
>
>
> Suggested changes:
> - Rename of() to elements(), kv() to kvs(), iterable() to iterables() for 
> consistency between each other and with other Beam similar transforms e.g. 
> Flatten.Iterables
> - These methods should return respectively named transforms: 
> ToString.Elements, ToString.KVs, ToString.Iterables



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1424) ToString should comply with PTransform style guide

2017-02-28 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1424.
--

> ToString should comply with PTransform style guide
> --
>
> Key: BEAM-1424
> URL: https://issues.apache.org/jira/browse/BEAM-1424
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
> Fix For: 0.6.0
>
>
> Suggested changes:
> - Rename of() to elements(), kv() to kvs(), iterable() to iterables() for 
> consistency between each other and with other Beam similar transforms e.g. 
> Flatten.Iterables
> - These methods should return respectively named transforms: 
> ToString.Elements, ToString.KVs, ToString.Iterables



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-849) Redesign PipelineResult API

2017-03-01 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890855#comment-15890855
 ] 

Eugene Kirpichov commented on BEAM-849:
---

I disagree that "unbounded pipelines" can't finish successfully.

- Dataflow runner supports draining of pipelines, which leads to successful 
termination.
- It is possible to run a pipeline like Create.of(1, 2, 3) + ParDo(do nothing) 
using a streaming runner, and it should terminate rather than hang.
- One might ask "why run such a pipeline with a streaming runner", but it makes 
a lot more sense if the ParDo is splittable. E.g. Create.of(filename) + 
ParDo(tail file) + ParDo(process records) could use the low-latency 
capabilities of a streaming runner, but successfully terminate when the file is 
somehow "finalized". As a more mundane example - tests in 
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
 should pass in streaming runners as well as batch runners.
- "Unbounded pipeline" is in general not a Beam concept - we should have a 
batch/streaming-agnostic meaning of "finished" in "waitUntilFinished". I 
propose the one that Dataflow runner uses for deciding when drain is completed: 
"all watermarks have progressed to infinity".

> Redesign PipelineResult API
> ---
>
> Key: BEAM-849
> URL: https://issues.apache.org/jira/browse/BEAM-849
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Pei He
>
> Current state: 
> Jira https://issues.apache.org/jira/browse/BEAM-443 addresses 
> waitUntilFinish() and cancel(). 
> However, there are additional work around PipelineResult: 
> need clearly defined contract and verification across all runners 
> need to revisit how to handle metrics/aggregators 
> need to be able to get logs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1562) Use a "signal" to stop streaming tests as they finish.

2017-03-01 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890861#comment-15890861
 ] 

Eugene Kirpichov commented on BEAM-1562:


See my comment on BEAM-849: as an alternative, we can use "watermarks 
progressed to infinity" as a termination signal.

> Use a "signal" to stop streaming tests as they finish.
> --
>
> Key: BEAM-1562
> URL: https://issues.apache.org/jira/browse/BEAM-1562
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Amit Sela
>
> Streaming tests use a timeout that has to take a large enough buffer to avoid 
> slow runtimes stopping before test completes.
> We can introduce a "poison pill" based on an {{EndOfStream}} element that 
> would be counted in a Metrics/Aggregator to know all data was processed.
> Another option would be to follow the slowest WM and stop once it hits 
> end-of-time.
> This requires the Spark runner to stop blocking the execution of a streaming 
> pipeline until it's complete - which is something relevant to 
> {{PipelineResult}} (re)design that is being discussed in BEAM-849   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1578) Runners should put PT overrides into a list rather than map

2017-02-28 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1578:
--

 Summary: Runners should put PT overrides into a list rather than 
map
 Key: BEAM-1578
 URL: https://issues.apache.org/jira/browse/BEAM-1578
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow, runner-direct
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov
Priority: Minor


It was not clear to me from the code that order of overrides is important. Map 
is not the best data structure for this, especially since no key lookups happen.

Referring to
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
and
https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java
and possibly others.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1594) JOB_STATE_DRAINED should be terminal for waitUntilFinish()

2017-03-02 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1594:
--

 Summary: JOB_STATE_DRAINED should be terminal for waitUntilFinish()
 Key: BEAM-1594
 URL: https://issues.apache.org/jira/browse/BEAM-1594
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


Dataflow runner supports pipeline draining for streaming pipelines, which stops 
consuming input and waits for all watermarks to progress to infinity and then 
declares the pipeline complete. When watermarks progress to infinity after a 
drain request, the Dataflow runner will mark the job DRAINED (In the future, 
when watermarks progress to infinity without a drain request, it will mark the 
job DONE instead). JOB_STATE_DRAINED is logically a terminal state - no more 
data will be processed and the pipeline's resources are torn down - and 
waitUntilFinish() should terminate if the job enters this state.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1427) BigQueryIO should comply with PTransform style guide

2017-03-02 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1427:
--

Assignee: Eugene Kirpichov

> BigQueryIO should comply with PTransform style guide
> 
>
> Key: BEAM-1427
> URL: https://issues.apache.org/jira/browse/BEAM-1427
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible
>
> Suggested changes:
> - Remove Read/Write.Bound classes - Read and Write themselves should be the 
> transform classes
> - Remove static builder-like .withBlah() methods
> - (optional) use AutoValue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1915) Remove OldDoFn dependency in ApexGroupByKeyOperator

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961586#comment-15961586
 ] 

Eugene Kirpichov commented on BEAM-1915:


It happens to also be the last occurrence in the Beam repository. It is also 
used in the (non-opensourced) Dataflow worker, but we (Dataflow team) can just 
make a copy there and take care of it on our side - other than that, after this 
JIRA, OldDoFn can be deleted from Beam altogether.

> Remove OldDoFn dependency in ApexGroupByKeyOperator
> ---
>
> Key: BEAM-1915
> URL: https://issues.apache.org/jira/browse/BEAM-1915
> Project: Beam
>  Issue Type: Task
>  Components: runner-apex
>Reporter: Thomas Weise
>Assignee: Eugene Kirpichov
>Priority: Minor
>
> It is the last remaining occurrence in the ApexRunner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1915) Remove OldDoFn dependency in ApexGroupByKeyOperator

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961587#comment-15961587
 ] 

Eugene Kirpichov commented on BEAM-1915:


cc: [~kenn]

> Remove OldDoFn dependency in ApexGroupByKeyOperator
> ---
>
> Key: BEAM-1915
> URL: https://issues.apache.org/jira/browse/BEAM-1915
> Project: Beam
>  Issue Type: Task
>  Components: runner-apex
>Reporter: Thomas Weise
>Assignee: Eugene Kirpichov
>Priority: Minor
>
> It is the last remaining occurrence in the ApexRunner.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961590#comment-15961590
 ] 

Eugene Kirpichov commented on BEAM-1910:


Got the error again.

==
FAIL: test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders)
--
Traceback (most recent call last):
  File 
"/usr/local/google/home/kirpichov/incubator-beam/sdks/python/apache_beam/coders/slow_coders_test.py",
 line 41, in test_using_slow_impl
import apache_beam.coders.stream
AssertionError: ImportError not raised


> test_using_slow_impl very flaky locally
> ---
>
> Key: BEAM-1910
> URL: https://issues.apache.org/jira/browse/BEAM-1910
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Eugene Kirpichov
>Assignee: Ahmet Altay
>
> Most times this test fails on my machine when running:
> mvn verify -am -T 1C
> test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL
> ...
> ___ summary 
> 
> ERROR:   docs: commands failed
>   lint: commands succeeded
> ERROR:   py27: commands failed
>   py27cython: commands succeeded
>   py27gcp: commands succeeded
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
>   at 
> org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
>   at 
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
>   at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Unfortunately the test doesn't print anything to maven output, so I don't 
> know what went wrong. I also don't know how to rerun the individual test 
> myself.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961592#comment-15961592
 ] 

Eugene Kirpichov commented on BEAM-1910:


I guess I also would like a way to run mvn verify but skip the python tests - 
they are not helpful when I'm verifying a Java-only PR.

> test_using_slow_impl very flaky locally
> ---
>
> Key: BEAM-1910
> URL: https://issues.apache.org/jira/browse/BEAM-1910
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Eugene Kirpichov
>Assignee: Ahmet Altay
>
> Most times this test fails on my machine when running:
> mvn verify -am -T 1C
> test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL
> ...
> ___ summary 
> 
> ERROR:   docs: commands failed
>   lint: commands succeeded
> ERROR:   py27: commands failed
>   py27cython: commands succeeded
>   py27gcp: commands succeeded
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
>   at 
> org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
>   at 
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
>   at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Unfortunately the test doesn't print anything to maven output, so I don't 
> know what went wrong. I also don't know how to rerun the individual test 
> myself.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1903) Splittable DoFn should report watermarks via ProcessContext

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1903.
--
   Resolution: Fixed
Fix Version/s: First stable release

> Splittable DoFn should report watermarks via ProcessContext
> ---
>
> Key: BEAM-1903
> URL: https://issues.apache.org/jira/browse/BEAM-1903
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: First stable release
>
>
> See design document:
> https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1904) Remove DoFn.ProcessContinuation

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1904.
--
   Resolution: Fixed
Fix Version/s: First stable release

This is effectively done in https://github.com/apache/beam/pull/2455 , the only 
thing remaining is physically remove the class, but I'll do it in a separate 
Dataflow worker dance.

> Remove DoFn.ProcessContinuation
> ---
>
> Key: BEAM-1904
> URL: https://issues.apache.org/jira/browse/BEAM-1904
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: First stable release
>
>
> Design document:
> https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1913) TFRecordIO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1913:
--

 Summary: TFRecordIO should comply with PTransform style guide
 Key: BEAM-1913
 URL: https://issues.apache.org/jira/browse/BEAM-1913
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov


https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java
 violates a few guidelines from 
https://beam.apache.org/contribute/ptransform-style-guide/ :

- Use of Bound and Unbound types: the Read and Write classes themselves should 
be PTransform's
- Should have one static entry point per verb - read() and write()
- Both classes could benefit from AutoValue

Basically, perform similar surgery like in 
https://github.com/apache/beam/pull/2149 but on smaller scale since this is a 
much smaller connector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1428) KinesisIO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1428:
--

Assignee: (was: Davor Bonaci)

> KinesisIO should comply with PTransform style guide
> ---
>
> Key: BEAM-1428
> URL: https://issues.apache.org/jira/browse/BEAM-1428
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>  Labels: backward-incompatible, starter
>
> Suggested changes:
> - KinesisIO.Read should be a PTransform itself
> - It should have builder methods .withBlah() for setting the parameters, 
> instead of the current somewhat strange combination of the from() factory 
> methods and the using() methods



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-644) Primitive to shift the watermark while assigning timestamps

2017-04-05 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15957959#comment-15957959
 ] 

Eugene Kirpichov commented on BEAM-644:
---

I don't think there's an SDF-specific issue here: it applies just as well to a 
non-splittable DoFn that would take a filename as input, and potentially 
produce elements whose timestamps are behind the timestamp of the input element 
(filename).

Speaking of updates - no, I don't think there've been any updates. Perhaps in 
practice this can be solved by feeding the SDF elements with a timestamp of 
"infinite past", and from then on, relying on the SDF's own output watermark 
reporting.

> Primitive to shift the watermark while assigning timestamps
> ---
>
> Key: BEAM-644
> URL: https://issues.apache.org/jira/browse/BEAM-644
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model, beam-model-runner-api
>Reporter: Kenneth Knowles
>
> There is a general need, especially important in the presence of 
> SplittableDoFn, to be able to assign new timestamps to elements without 
> making them late or droppable.
>  - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one 
> to produce late data, but does not allow one to shift the watermark so the 
> new data is on-time.
>  - For a SplittableDoFn, one may receive an element such as the name of a log 
> file that contains elements for the day preceding the log file. The timestamp 
> on the filename must currently be the beginning of the log. If such elements 
> are constantly flowing, it may be OK, but since we don't know that element is 
> coming, in that absence of data, the watermark may advance. We need a way to 
> keep it far enough back even in the absence of data holding it back.
> One idea is a new primitive ShiftWatermark / AdjustTimestamps with the 
> following pieces:
>  - A constant duration (positive or negative) D by which to shift the 
> watermark.
>  - A function from TimestampedElement to new timestamp that is >= t + D
> So, for example, AdjustTimestamps(<-60 minutes>, f) would allow f to make 
> timestamps up to 60 minutes earlier.
> With this primitive added, outputWithTimestamp and withAllowedTimestampSkew 
> could be removed, simplifying DoFn.
> Alternatively, all of this functionality could be bolted on to DoFn.
> This ticket is not a proposal, but a record of the issue and ideas that were 
> mentioned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1910) test_using_slow_impl very flaky locally

2017-04-07 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1910:
--

 Summary: test_using_slow_impl very flaky locally
 Key: BEAM-1910
 URL: https://issues.apache.org/jira/browse/BEAM-1910
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Eugene Kirpichov
Assignee: Ahmet Altay


Most times this test fails on my machine when running:

mvn verify -am -T 1C

test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL
...
___ summary 
ERROR:   docs: commands failed
  lint: commands succeeded
ERROR:   py27: commands failed
  py27cython: commands succeeded
  py27gcp: commands succeeded
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)
at 
org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
at 
org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185)
at 
org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Unfortunately the test doesn't print anything to maven output, so I don't know 
what went wrong. I also don't know how to rerun the individual test myself.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1581) JSON source and sink

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961285#comment-15961285
 ] 

Eugene Kirpichov commented on BEAM-1581:


Yeah, JacksonIO sounds like good idea. It also has fewer risk that users will 
come to rely on it as "the" way to use JSON from Beam while we introduce more 
ways to use it with different APIs.

> JSON source and sink
> 
>
> Key: BEAM-1581
> URL: https://issues.apache.org/jira/browse/BEAM-1581
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> JSON source and sink to read/write JSON files.
> Similarly to {{XmlSource}}/{{XmlSink}}, these be a {{JsonSource}}/{{JonSink}} 
> which are a {{FileBaseSource}}/{{FileBasedSink}}.
> Consider using methods/code (or refactor these) found in {{AsJsons}} and 
> {{ParseJsons}}
> The {{PCollection}} of objects the user passes to the transform should be 
> embedded in a valid JSON file
> The most common pattern for this is a large object with an array member which 
> holds all the data objects and other members for metadata.
> Examples of public JSON APIs: https://www.sitepoint.com/10-example-json-files/
> Another pattern used is a file which is simply a JSON array of objects.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1913) TFRecordIO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-1913:
---
Labels: backward-incompatible starter  (was: backwards-incompatible starter)

> TFRecordIO should comply with PTransform style guide
> 
>
> Key: BEAM-1913
> URL: https://issues.apache.org/jira/browse/BEAM-1913
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>  Labels: backward-incompatible, starter
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java
>  violates a few guidelines from 
> https://beam.apache.org/contribute/ptransform-style-guide/ :
> - Use of Bound and Unbound types: the Read and Write classes themselves 
> should be PTransform's
> - Should have one static entry point per verb - read() and write()
> - Both classes could benefit from AutoValue
> Basically, perform similar surgery like in 
> https://github.com/apache/beam/pull/2149 but on smaller scale since this is a 
> much smaller connector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1914) XML IO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1914:
--

Assignee: (was: Davor Bonaci)

> XML IO should comply with PTransform style guide
> 
>
> Key: BEAM-1914
> URL: https://issues.apache.org/jira/browse/BEAM-1914
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>  Labels: backward-incompatible, starter
>
> Currently we have XmlSource and XmlSink in the Java SDK. They violate the 
> PTransform style guide in several respects:
> - They should be grouped into an XmlIO class with read() and write() verbs, 
> like all the other similar connectors
> - The source/sink classes should be made private or package-local
> - Should get rid of XmlSink.Bound - XmlSink itself should inherit from 
> FileBasedSink
> - Could optionally benefit from AutoValue
> See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1914) XML IO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-1914:
---
Labels: backward-incompatible starter  (was: )

> XML IO should comply with PTransform style guide
> 
>
> Key: BEAM-1914
> URL: https://issues.apache.org/jira/browse/BEAM-1914
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Davor Bonaci
>  Labels: backward-incompatible, starter
>
> Currently we have XmlSource and XmlSink in the Java SDK. They violate the 
> PTransform style guide in several respects:
> - They should be grouped into an XmlIO class with read() and write() verbs, 
> like all the other similar connectors
> - The source/sink classes should be made private or package-local
> - Should get rid of XmlSink.Bound - XmlSink itself should inherit from 
> FileBasedSink
> - Could optionally benefit from AutoValue
> See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1414) CountingInput should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-1414:
---
Labels: backward-incompatible starter  (was: backward-incompatible)

> CountingInput should comply with PTransform style guide
> ---
>
> Key: BEAM-1414
> URL: https://issues.apache.org/jira/browse/BEAM-1414
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Davor Bonaci
>  Labels: backward-incompatible, starter
> Fix For: First stable release
>
>
> Suggested changes:
> - Rename the whole class and its inner transforms to sound more verb-like, 
> e.g.: GenerateRange.Bounded/Unbounded (as opposed to current 
> CountingInput.BoundedCountingInput)
> - Provide a more unified API between bounded and unbounded cases: 
> GenerateRange.from(100) should return a GenerateRange.Unbounded; 
> GenerateRange.from(100).to(200) should return a GenerateRange.Bounded. They 
> both should accept a timestampFn. The unbounded one _should not_ have a 
> withMaxNumRecords builder - that's redundant with specifying the range.
> - (optional) Use AutoValue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1414) CountingInput should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1414:
--

Assignee: (was: Davor Bonaci)

> CountingInput should comply with PTransform style guide
> ---
>
> Key: BEAM-1414
> URL: https://issues.apache.org/jira/browse/BEAM-1414
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>  Labels: backward-incompatible, starter
> Fix For: First stable release
>
>
> Suggested changes:
> - Rename the whole class and its inner transforms to sound more verb-like, 
> e.g.: GenerateRange.Bounded/Unbounded (as opposed to current 
> CountingInput.BoundedCountingInput)
> - Provide a more unified API between bounded and unbounded cases: 
> GenerateRange.from(100) should return a GenerateRange.Unbounded; 
> GenerateRange.from(100).to(200) should return a GenerateRange.Bounded. They 
> both should accept a timestampFn. The unbounded one _should not_ have a 
> withMaxNumRecords builder - that's redundant with specifying the range.
> - (optional) Use AutoValue



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (BEAM-447) Stop referring to types with Bound/Unbound

2017-04-07 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947688#comment-15947688
 ] 

Eugene Kirpichov edited comment on BEAM-447 at 4/7/17 9:46 PM:
---

Remaining instances: TextIO, AvroIO (handled by 
https://github.com/apache/beam/pull/1927), XmlSink, TFRecordIO.


was (Author: jkff):
Remaining instances: TextIO, AvroIO (handled by 
https://github.com/apache/beam/pull/1927), Window (handled by a 
soon-to-be-sent-for-review PR of mine), XmlSink, TFRecordIO.

> Stop referring to types with Bound/Unbound
> --
>
> Key: BEAM-447
> URL: https://issues.apache.org/jira/browse/BEAM-447
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Thomas Groh
>Assignee: Davor Bonaci
>  Labels: backward-incompatible
> Fix For: First stable release
>
>
> Bounded and Unbounded are used to refer to PCollections, and the overlap is 
> confusing.  These classes should be renamed to be both more specific (e.g. 
> ParDo.LackingDoFnSingleOutput, ParDo.SingleOutput, Window.AssignWindows) 
> which remove the overlap.
> examples:
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L658
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L868



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1914) XML IO should comply with PTransform style guide

2017-04-07 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1914:
--

 Summary: XML IO should comply with PTransform style guide
 Key: BEAM-1914
 URL: https://issues.apache.org/jira/browse/BEAM-1914
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Davor Bonaci


Currently we have XmlSource and XmlSink in the Java SDK. They violate the 
PTransform style guide in several respects:

- They should be grouped into an XmlIO class with read() and write() verbs, 
like all the other similar connectors
- The source/sink classes should be made private or package-local
- Should get rid of XmlSink.Bound - XmlSink itself should inherit from 
FileBasedSink
- Could optionally benefit from AutoValue

See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-447) Stop referring to types with Bound/Unbound

2017-04-07 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-447.
-
Resolution: Duplicate

I've filed separate JIRAs for the remaining sub-items.

> Stop referring to types with Bound/Unbound
> --
>
> Key: BEAM-447
> URL: https://issues.apache.org/jira/browse/BEAM-447
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Thomas Groh
>Assignee: Davor Bonaci
>  Labels: backward-incompatible
> Fix For: First stable release
>
>
> Bounded and Unbounded are used to refer to PCollections, and the overlap is 
> confusing.  These classes should be renamed to be both more specific (e.g. 
> ParDo.LackingDoFnSingleOutput, ParDo.SingleOutput, Window.AssignWindows) 
> which remove the overlap.
> examples:
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L658
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L868



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1979) Outputting to an undeclared side output TupleTag should throw

2017-04-14 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-1979:
--

 Summary: Outputting to an undeclared side output TupleTag should 
throw
 Key: BEAM-1979
 URL: https://issues.apache.org/jira/browse/BEAM-1979
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


Currently some runners (perhaps all of them?..), when you output to a side 
output tag that wasn't passed to ParDo.withOutputTags, simply drop the data on 
the floor (or increase some aggregators).

Silent data dropping is never okay, and it's never okay to output to an 
undeclared tag. So instead this should throw.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1913) TFRecordIO should comply with PTransform style guide

2017-04-17 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1913:
--

Assignee: Eugene Kirpichov

> TFRecordIO should comply with PTransform style guide
> 
>
> Key: BEAM-1913
> URL: https://issues.apache.org/jira/browse/BEAM-1913
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible, starter
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java
>  violates a few guidelines from 
> https://beam.apache.org/contribute/ptransform-style-guide/ :
> - Use of Bound and Unbound types: the Read and Write classes themselves 
> should be PTransform's
> - Should have one static entry point per verb - read() and write()
> - Both classes could benefit from AutoValue
> Basically, perform similar surgery like in 
> https://github.com/apache/beam/pull/2149 but on smaller scale since this is a 
> much smaller connector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1914) XML IO should comply with PTransform style guide

2017-04-17 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-1914:
--

Assignee: Eugene Kirpichov

> XML IO should comply with PTransform style guide
> 
>
> Key: BEAM-1914
> URL: https://issues.apache.org/jira/browse/BEAM-1914
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible, starter
>
> Currently we have XmlSource and XmlSink in the Java SDK. They violate the 
> PTransform style guide in several respects:
> - They should be grouped into an XmlIO class with read() and write() verbs, 
> like all the other similar connectors
> - The source/sink classes should be made private or package-local
> - Should get rid of XmlSink.Bound - XmlSink itself should inherit from 
> FileBasedSink
> - Could optionally benefit from AutoValue
> See e.g. the PR with BigQuery fixes https://github.com/apache/beam/pull/2149



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1913) TFRecordIO should comply with PTransform style guide

2017-04-18 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1913.
--
   Resolution: Fixed
Fix Version/s: First stable release

> TFRecordIO should comply with PTransform style guide
> 
>
> Key: BEAM-1913
> URL: https://issues.apache.org/jira/browse/BEAM-1913
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>  Labels: backward-incompatible, starter
> Fix For: First stable release
>
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java
>  violates a few guidelines from 
> https://beam.apache.org/contribute/ptransform-style-guide/ :
> - Use of Bound and Unbound types: the Read and Write classes themselves 
> should be PTransform's
> - Should have one static entry point per verb - read() and write()
> - Both classes could benefit from AutoValue
> Basically, perform similar surgery like in 
> https://github.com/apache/beam/pull/2149 but on smaller scale since this is a 
> much smaller connector.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1355) HDFS IO should comply with PTransform style guide

2017-04-18 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973593#comment-15973593
 ] 

Eugene Kirpichov commented on BEAM-1355:


Stephen - do you think then, perhaps, that we should delete HDFSFileSource/Sink 
for the first stable release, as soon as BEAM-2005 is done?

> HDFS IO should comply with PTransform style guide
> -
>
> Key: BEAM-1355
> URL: https://issues.apache.org/jira/browse/BEAM-1355
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>Assignee: Stephen Sisk
>  Labels: backward-incompatible
>
> https://github.com/apache/beam/tree/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs
>  does not comply with 
> https://beam.apache.org/contribute/ptransform-style-guide/ in a number of 
> ways:
> - It is not packaged as a PTransform (should be: HDFSIO.Read,Write or 
> something like that)
> - Should probably use AutoValue for specifying parameters
> Stephen knows about the current state of HDFS IO.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1910) test_using_slow_impl very flaky locally

2017-04-18 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973669#comment-15973669
 ] 

Eugene Kirpichov commented on BEAM-1910:


Vikas, any updates? This is heavily interfering with the ability to run mvn 
verify locally.

> test_using_slow_impl very flaky locally
> ---
>
> Key: BEAM-1910
> URL: https://issues.apache.org/jira/browse/BEAM-1910
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Eugene Kirpichov
>Assignee: Ahmet Altay
>
> Most times this test fails on my machine when running:
> mvn verify -am -T 1C
> test_using_slow_impl (apache_beam.coders.slow_coders_test.SlowCoders) ... FAIL
> ...
> ___ summary 
> 
> ERROR:   docs: commands failed
>   lint: commands succeeded
> ERROR:   py27: commands failed
>   py27cython: commands succeeded
>   py27gcp: commands succeeded
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
>   at 
> org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
>   at 
> org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
>   at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
>   at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:185)
>   at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:181)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Unfortunately the test doesn't print anything to maven output, so I don't 
> know what went wrong. I also don't know how to rerun the individual test 
> myself.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner

2017-07-29 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106185#comment-16106185
 ] 

Eugene Kirpichov commented on BEAM-2671:


It's overwhelmingly likely the last one as I mentioned above.

On Sat, Jul 29, 2017, 9:50 AM Jean-Baptiste Onofré (JIRA) 



> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark 
> runner
> 
>
> Key: BEAM-2671
> URL: https://issues.apache.org/jira/browse/BEAM-2671
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.1.0, 2.2.0
>
>
> Error message:
> Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: 
> Expected: iterable over [] in any order
>  but: Not matched: "late"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner

2017-07-28 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105124#comment-16105124
 ] 

Eugene Kirpichov commented on BEAM-2671:


I'm not sure about bumping it to 2.2.0. It looks quite dangerous and I'd be 
hesitant to ignore it without checking how dangerous it actually is. Do we have 
other tests that verify that Spark runner does not incorrectly drop data in 
streaming pipelines?

> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark 
> runner
> 
>
> Key: BEAM-2671
> URL: https://issues.apache.org/jira/browse/BEAM-2671
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.1.0, 2.2.0
>
>
> Error message:
> Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: 
> Expected: iterable over [] in any order
>  but: Not matched: "late"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2017-08-02 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111508#comment-16111508
 ] 

Eugene Kirpichov commented on BEAM-1190:


GCS object listing is strongly consistent now. However the issue still applies 
for other filesystems like S3.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2716) AvroReader should refuse dynamic splits while in the last block

2017-08-02 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2716:
--

 Summary: AvroReader should refuse dynamic splits while in the last 
block
 Key: BEAM-2716
 URL: https://issues.apache.org/jira/browse/BEAM-2716
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov
Priority: Minor


AvroReader is able to detect when it's in the last block:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L728

It could also use this information to avoid wastefully producing dynamic splits 
starting in the range of the current block.

One way to do this would be to have OffsetRangeTracker have a "claim range" 
operation: claim range of [a, b) is, in terms of correctness, equivalent to 
claiming "a" (it checks whether "a" is within the range), but sets the last 
claimed position to "b" rather than "a", thus protecting more positions from 
being split away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2716) AvroReader should refuse dynamic splits while in the last block

2017-08-02 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111671#comment-16111671
 ] 

Eugene Kirpichov commented on BEAM-2716:


I think this affects only Avro, other file-based formats either don't support 
dynamic splits, or don't know block lengths upfront.

> AvroReader should refuse dynamic splits while in the last block
> ---
>
> Key: BEAM-2716
> URL: https://issues.apache.org/jira/browse/BEAM-2716
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>Priority: Minor
>
> AvroReader is able to detect when it's in the last block:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L728
> It could also use this information to avoid wastefully producing dynamic 
> splits starting in the range of the current block.
> One way to do this would be to have OffsetRangeTracker have a "claim range" 
> operation: claim range of [a, b) is, in terms of correctness, equivalent to 
> claiming "a" (it checks whether "a" is within the range), but sets the last 
> claimed position to "b" rather than "a", thus protecting more positions from 
> being split away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner

2017-08-02 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2670.
--
Resolution: Fixed

> ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark 
> runner
> --
>
> Key: BEAM-2670
> URL: https://issues.apache.org/jira/browse/BEAM-2670
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Eugene Kirpichov
>Priority: Minor
> Fix For: 2.1.0, 2.2.0
>
>
> Tests succeed when run alone but fail when the whole ParDoTest is run. May be 
> related to PipelineOptions reusing / not cleaning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2712) SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions.

2017-08-01 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2712:
--

 Summary: SerializablePipelineOptions should not call 
FileSystems.setDefaultPipelineOptions.
 Key: BEAM-2712
 URL: https://issues.apache.org/jira/browse/BEAM-2712
 Project: Beam
  Issue Type: Bug
  Components: runner-apex, runner-core, runner-flink, runner-spark
Reporter: Eugene Kirpichov
Assignee: Kenneth Knowles


https://github.com/apache/beam/pull/3654 introduces 
SerializablePipelineOptions, which on deserialization calls 
FileSystems.setDefaultPipelineOptions.

This is obviously problematic and racy in case the same process uses 
SerializablePipelineOptions with different filesystem-related options in them.

The reason the PR does this is, Flink and Apex runners were already doing it in 
their respective SerializablePipelineOptions-like classes (being removed in the 
PR); and Spark wasn't but probably should have.

I believe this is done for the sake of having the proper filesystem options 
automatically available on workers in all places where any kind of 
PipelineOptions are used. Instead, all 3 runners should pick a better place to 
initialize their workers, and explicitly call 
FileSystems.setDefaultPipelineOptions there.

It would be even better if FileSystems.setDefaultPipelineOptions didn't exist 
at all, but that's a topic for a separate JIRA.

CC'ing runner contributors [~aljoscha] [~aviemzur] [~thw]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2712) SerializablePipelineOptions should not call FileSystems.setDefaultPipelineOptions.

2017-08-01 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-2712:
--

Assignee: (was: Kenneth Knowles)

> SerializablePipelineOptions should not call 
> FileSystems.setDefaultPipelineOptions.
> --
>
> Key: BEAM-2712
> URL: https://issues.apache.org/jira/browse/BEAM-2712
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex, runner-core, runner-flink, runner-spark
>Reporter: Eugene Kirpichov
>
> https://github.com/apache/beam/pull/3654 introduces 
> SerializablePipelineOptions, which on deserialization calls 
> FileSystems.setDefaultPipelineOptions.
> This is obviously problematic and racy in case the same process uses 
> SerializablePipelineOptions with different filesystem-related options in them.
> The reason the PR does this is, Flink and Apex runners were already doing it 
> in their respective SerializablePipelineOptions-like classes (being removed 
> in the PR); and Spark wasn't but probably should have.
> I believe this is done for the sake of having the proper filesystem options 
> automatically available on workers in all places where any kind of 
> PipelineOptions are used. Instead, all 3 runners should pick a better place 
> to initialize their workers, and explicitly call 
> FileSystems.setDefaultPipelineOptions there.
> It would be even better if FileSystems.setDefaultPipelineOptions didn't exist 
> at all, but that's a topic for a separate JIRA.
> CC'ing runner contributors [~aljoscha] [~aviemzur] [~thw]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2706) Create JdbcIO.readAll()

2017-08-01 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2706:
--

 Summary: Create JdbcIO.readAll()
 Key: BEAM-2706
 URL: https://issues.apache.org/jira/browse/BEAM-2706
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Reporter: Eugene Kirpichov
Assignee: Jean-Baptiste Onofré


Similarly to TextIO.readAll(), AvroIO.readAll(), SpannerIO.readAll(), and in 
the general spirit of making connectors more dynamic: it would be nice to have 
JdbcIO.readAll() that reads a PCollection of queries, or perhaps better, a 
parameterized query with a PCollection of parameter values and a user callback 
for setting the parameters on a query based on the PCollection element.

JB, as the author of JdbcIO, would you be interested in implementing this at 
some point?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-1323) Add parallelism/splitting in JdbcIO

2017-08-01 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109644#comment-16109644
 ] 

Eugene Kirpichov commented on BEAM-1323:


Filed https://issues.apache.org/jira/browse/BEAM-2706 for the latter, and I 
suggest to close the current JIRA unless we have a compelling use case for 
which generic built-in splitting support in JdbcIO.read() is both practical and 
easier to use.

> Add parallelism/splitting in JdbcIO
> ---
>
> Key: BEAM-1323
> URL: https://issues.apache.org/jira/browse/BEAM-1323
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-extensions
>Reporter: Jean-Baptiste Onofré
>Assignee: Jean-Baptiste Onofré
>
> Now, the JDBC IO is basically a {{DoFn}} executed with a {{ParDo}}. So, it 
> means that parallelism is "limited" and executed on one executor.
> We can imagine to create several JDBC {{BoundedSource}}s splitting the SQL 
> query in  subset (for instance using row id paging or any "splitting/limit" 
> we can figure based on the original SQL query) (something similar to what 
> Sqoop is doing).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2684) AmqpIOTest is flaky

2017-08-04 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114002#comment-16114002
 ] 

Eugene Kirpichov commented on BEAM-2684:


As an intermediate step, perhaps, add a bunch of logging to the test, so that 
when someone else hits it on their PR, you have more information to work with?

> AmqpIOTest is flaky
> ---
>
> Key: BEAM-2684
> URL: https://issues.apache.org/jira/browse/BEAM-2684
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-extensions
>Reporter: Eugene Kirpichov
>Assignee: Jean-Baptiste Onofré
>
> This test is often timing out, and has been doing that for a while, causing 
> unrelated PRs to fail randomly. I've gotten into the habit of excluding 
> sdks/java/io/amqp when running "mvn verify" and I suppose it's not a good 
> habit :) 
> Example failure: 
> https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/13424/console



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2768) Fix bigquery.WriteTables generating non-unique job identifiers

2017-08-15 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-2768:
--

Assignee: Reuven Lax  (was: Kenneth Knowles)

> Fix bigquery.WriteTables generating non-unique job identifiers
> --
>
> Key: BEAM-2768
> URL: https://issues.apache.org/jira/browse/BEAM-2768
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Affects Versions: 2.0.0
>Reporter: Matti Remes
>Assignee: Reuven Lax
>
> This is a result of BigQueryIO not creating unique job ids for batch inserts, 
> thus BigQuery API responding with a 409 conflict error:
> {code:java}
> Request failed with code 409, will NOT retry: 
> https://www.googleapis.com/bigquery/v2/projects//jobs
> {code}
> The jobs are initiated in a step BatchLoads/SinglePartitionWriteTables, 
> called by step's WriteTables ParDo:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L511-L521
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L148
> It would probably be a good idea to append a UUIDs as part of a job id.
> Edit: This is a major bug blocking using BigQuery as a sink for bounded input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2768) Fix bigquery.WriteTables generating non-unique job identifiers

2017-08-15 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127687#comment-16127687
 ] 

Eugene Kirpichov commented on BEAM-2768:


Could you tell more about how you're using BigQueryIO.Write (it has many modes 
- it would be best if you could show a code snippet where you're applying 
BigQueryIO.write() in your pipeline, removing all personal data but at least 
exactly showing all BigQueryIO API methods you're using) and what exact version 
of Beam SDK you're using? Your links point to the master branch, but the bug 
description says 2.0.0 - these versions have very different implementations of 
BigQueryIO.Write.

Looking at the current code, the job id *does* contain a random UUID that comes 
from 
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L348.

> Fix bigquery.WriteTables generating non-unique job identifiers
> --
>
> Key: BEAM-2768
> URL: https://issues.apache.org/jira/browse/BEAM-2768
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Affects Versions: 2.0.0
>Reporter: Matti Remes
>Assignee: Reuven Lax
>
> This is a result of BigQueryIO not creating unique job ids for batch inserts, 
> thus BigQuery API responding with a 409 conflict error:
> {code:java}
> Request failed with code 409, will NOT retry: 
> https://www.googleapis.com/bigquery/v2/projects//jobs
> {code}
> The jobs are initiated in a step BatchLoads/SinglePartitionWriteTables, 
> called by step's WriteTables ParDo:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L511-L521
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L148
> It would probably be a good idea to append a UUIDs as part of a job id.
> Edit: This is a major bug blocking using BigQuery as a sink for bounded input.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-08-14 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126416#comment-16126416
 ] 

Eugene Kirpichov commented on BEAM-2140:


Sorry for the delayed response. The output watermark should be held by the 
watermark hold set by the SDF: 
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java#L378
 - does your implementation support output watermark holds?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2700) BigQueryIO should support using file load jobs when using unbounded collections

2017-08-14 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2700.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> BigQueryIO should support using file load jobs when using unbounded 
> collections
> ---
>
> Key: BEAM-2700
> URL: https://issues.apache.org/jira/browse/BEAM-2700
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.2.0
>Reporter: Reuven Lax
>Assignee: Reuven Lax
> Fix For: 2.2.0
>
>
> Currently the method used for inserting into BigQuery is based on the input 
> PCollection. Bounded input using file load jobs, unbounded input uses 
> streaming inserts. However while streaming inserts have far lower latency, 
> then cost quite a bit more and they provide weaker consistency guarantees. 
> Users should be able to choose which method to use, irrespective of the input 
> PCollection.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2447) Reintroduce DoFn.ProcessContinuation

2017-07-12 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2447.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Reintroduce DoFn.ProcessContinuation
> 
>
> Key: BEAM-2447
> URL: https://issues.apache.org/jira/browse/BEAM-2447
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: 2.2.0
>
>
> ProcessContinuation.resume() is useful for tailing files - when we reach 
> current EOF, we want to voluntarily suspend the process() call rather than 
> wait for runner to checkpoint us.
> In BEAM-1903, DoFn.ProcessContinuation was removed because there was 
> ambiguity about the semantics of resume() especially w.r.t. the following 
> situation described in 
> https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit
>  : the runner has taken a checkpoint on the tracker, and then the 
> ProcessElement call returns resume() signaling that the work is still not 
> done - then there's 2 checkpoints to deal with.
> Instead, the proper way to refine this semantics is:
> - After checkpoint() on a RestrictionTracker, the tracker MUST fail all 
> subsequent tryClaim() calls, and MUST succeed in checkDone().
> - After a failed tryClaim() call, the ProcessElement method MUST return stop()
> - So ProcessElement can return resume() only *instead* of doing tryClaim()
> - Then, if the runner has already taken a checkpoint but tracker has returned 
> resume(), we do not need to take a new checkpoint - the one already taken 
> already accurately describes the remainder of the work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2623) Create Watch transform for emitting new elements in a family of growing sets

2017-07-14 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2623:
--

 Summary: Create Watch transform for emitting new elements in a 
family of growing sets
 Key: BEAM-2623
 URL: https://issues.apache.org/jira/browse/BEAM-2623
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


See http://s.apache.org/beam-watch-transform



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2406) NullPointerException when writing an empty table to BigQuery

2017-07-15 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2406.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> NullPointerException when writing an empty table to BigQuery
> 
>
> Key: BEAM-2406
> URL: https://issues.apache.org/jira/browse/BEAM-2406
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Ben Chambers
>Assignee: Reuven Lax
>Priority: Minor
> Fix For: 2.1.0
>
>
> Originally reported on Stackoverflow:
> https://stackoverflow.com/questions/44314030/handling-empty-pcollections-with-bigquery-in-apache-beam
> It looks like if there is no data to write, then WritePartitions will return 
> a null destination, as explicitly stated in the comments:
> https://github.com/apache/beam/blob/v2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WritePartition.java#L126
> But, the ConstantTableDestination doesn't turn that into the constant 
> destination as the comment promises, instead it returns that `null` 
> destination:
> https://github.com/apache/beam/blob/53c9bf4cd325035fabde192c63652ef6d591b93c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinationsHelpers.java#L74
> This leads to a null pointer error here since the `tableDestination` is that 
> null result from calling `getTable`:
> https://github.com/apache/beam/blob/v2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L97



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2628) AvroSource.split() sequentially opens every matched file

2017-07-17 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2628:
--

 Summary: AvroSource.split() sequentially opens every matched file
 Key: BEAM-2628
 URL: https://issues.apache.org/jira/browse/BEAM-2628
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


When you do AvroIO.read().from(filepattern), during splitting of AvroSource the 
filepattern gets expanded into N files, and then for each of the N files we do 
this: 
https://github.com/apache/beam/blob/v2.0.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L259

This is very slow. E.g. one job was reading 15,000 files, and it took almost 2 
hours to split the source because opening each file and reading schema was 
taking about 0.5s.

I'm not quite sure why we need the file metadata while splitting...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)

2017-07-19 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-2640:
---
Issue Type: New Feature  (was: Bug)

> Introduce Create.ofProvider(ValueProvider)
> --
>
> Key: BEAM-2640
> URL: https://issues.apache.org/jira/browse/BEAM-2640
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> When you have a ValueProvider that may or may not be accessible at 
> construction time, a common task is to wrap it into a single-element 
> PCollection. This is especially common when migrating an IO connector that 
> used something like Create.of(query) followed by a ParDo, to having query be 
> a ValueProvider.
> Currently this is done in an icky way (e.g. 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615)
> We should have a convenience helper for that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)

2017-07-19 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2640:
--

 Summary: Introduce Create.ofProvider(ValueProvider)
 Key: BEAM-2640
 URL: https://issues.apache.org/jira/browse/BEAM-2640
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


When you have a ValueProvider that may or may not be accessible at 
construction time, a common task is to wrap it into a single-element 
PCollection. This is especially common when migrating an IO connector that 
used something like Create.of(query) followed by a ParDo, to having query be a 
ValueProvider.

Currently this is done in an icky way (e.g. 
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615)

We should have a convenience helper for that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2532) BigQueryIO source should avoid expensive JSON schema parsing for every record

2017-07-18 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2532.
--
   Resolution: Fixed
 Assignee: Neville Li  (was: Chamikara Jayalath)
Fix Version/s: 2.2.0

> BigQueryIO source should avoid expensive JSON schema parsing for every record
> -
>
> Key: BEAM-2532
> URL: https://issues.apache.org/jira/browse/BEAM-2532
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Marian Dvorsky
>Assignee: Neville Li
>Priority: Minor
> Fix For: 2.2.0
>
>
> BigQueryIO source converts the schema from JSON for every input row, here:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L159
> This is the performance bottleneck in a simple pipeline with BigQueryIO 
> source.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2544) AvroIOTest is flaky

2017-07-19 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2544.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> AvroIOTest is flaky
> ---
>
> Key: BEAM-2544
> URL: https://issues.apache.org/jira/browse/BEAM-2544
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Alex Filatov
>Assignee: Eugene Kirpichov
>Priority: Minor
> Fix For: 2.2.0
>
>
> "Write then read" tests randomly fail.
> Steps to reproduce:
> cd /runners/direct-java
> mvn clean compile
> mvn surefire:test@validates-runner-tests -Dtest=AvroIOTest
> Repeat last step until a failure (on my machine failure rate is approx 1/3).
> Example:
> [ERROR] 
> testAvroIOWriteAndReadSchemaUpgrade(org.apache.beam.sdk.io.AvroIOTest)  Time 
> elapsed: 0.198 s  <<< ERROR!
> java.lang.RuntimeException: java.io.FileNotFoundException: 
> /var/folders/1c/sl733g5s1g7_4mq61_qmbjx4gn/T/junit3332447750239941326/output.avro
>  (No such file or directory)
>   at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:340)
>   at 
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302)
>   at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:201)
>   at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:283)
>   at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:340)
>   at 
> org.apache.beam.sdk.io.AvroIOTest.testAvroIOWriteAndReadSchemaUpgrade(AvroIOTest.java:275)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:321)
>   at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>   at 
> org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:321)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
>   at 
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
>   at 
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107)
>   at 
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83)
>   at 
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
>   at 
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:157)
>   at 
> 

[jira] [Closed] (BEAM-2628) AvroSource.split() sequentially opens every matched file

2017-07-20 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2628.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> AvroSource.split() sequentially opens every matched file
> 
>
> Key: BEAM-2628
> URL: https://issues.apache.org/jira/browse/BEAM-2628
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: 2.2.0
>
>
> When you do AvroIO.read().from(filepattern), during splitting of AvroSource 
> the filepattern gets expanded into N files, and then for each of the N files 
> we do this: 
> https://github.com/apache/beam/blob/v2.0.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java#L259
> This is very slow. E.g. one job was reading 15,000 files, and it took almost 
> 2 hours to split the source because opening each file and reading schema was 
> taking about 0.5s.
> I'm not quite sure why we need the file metadata while splitting...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2608) Unclosed BoundedReader in TextIO#ReadTextFn#process()

2017-07-15 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2608.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Unclosed BoundedReader in TextIO#ReadTextFn#process()
> -
>
> Key: BEAM-2608
> URL: https://issues.apache.org/jira/browse/BEAM-2608
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Ted Yu
>Assignee: Eugene Kirpichov
>Priority: Minor
> Fix For: 2.2.0
>
>
> {code}
> BoundedSource.BoundedReader reader =
> source
> .createForSubrangeOfFile(metadata, range.getFrom(), 
> range.getTo())
> .createReader(c.getPipelineOptions());
> {code}
> The reader should be closed upon return.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2656) Introduce AvroIO.readAll()

2017-07-21 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2656:
--

 Summary: Introduce AvroIO.readAll()
 Key: BEAM-2656
 URL: https://issues.apache.org/jira/browse/BEAM-2656
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


TextIO.readAll() is nifty and performant when reading a large number of files.

We should similarly have AvroIO.readAll(). Maybe other connectors too in the 
future.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2680) Improve scalability of the Watch transform

2017-07-25 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2680:
--

 Summary: Improve scalability of the Watch transform
 Key: BEAM-2680
 URL: https://issues.apache.org/jira/browse/BEAM-2680
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


https://github.com/apache/beam/pull/3565 introduces the Watch transform 
http://s.apache.org/beam-watch-transform.

The implementation leaves several scalability-related TODOs:
1) The state stores hashes and timestamps of outputs that have already been 
output and should be omitted from future polls. We could garbage-collect this 
state, e.g. dropping elements from "completed" and from addNewAsPending() if 
their timestamp is more than X behind the watermark.
2) When a poll returns a huge number of elements, we don't necessarily have to 
add all of them into state.pending - instead we could add only N elements and 
ignore others, relying on future poll rounds to provide them, in order to avoid 
blowing up the state. Combined with garbage collection of 
GrowthState.completed, this would make the transform scalable to very large 
poll results.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2682) Merge AvroIOTest and AvroIOTransformTest

2017-07-25 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2682:
--

 Summary: Merge AvroIOTest and AvroIOTransformTest
 Key: BEAM-2682
 URL: https://issues.apache.org/jira/browse/BEAM-2682
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


These two tests seem to have exactly the same purpose. They should be merged 
into AvroIOTest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2677) AvroIO.read without specifying a schema

2017-07-24 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2677:
--

 Summary: AvroIO.read without specifying a schema
 Key: BEAM-2677
 URL: https://issues.apache.org/jira/browse/BEAM-2677
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


Sometimes it is inconvenient to require the user of AvroIO.read/readAll to 
specify a Schema for the Avro files they are reading, especially if different 
files may have different schemas.

It is possible to read GenericRecord objects from an Avro file, however it is 
not possible to provide a Coder for GenericRecord without knowing the schema: a 
GenericRecord knows its schema so we can encode it into a byte array, but we 
can not decode it from a byte array without knowing the schema (and encoding 
the full schema together with every record would be impractical).

Instead, a reasonable approach is to treat schemaless GenericRecord as 
unencodable and use the same approach as JdbcIO - a user-specified parse 
callback.

Suggested API: AvroIO.parseGenericRecords(SerializableFunction parseFn).from(filepattern).

CC: [~mkhadikov] [~reuvenlax]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner

2017-07-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103911#comment-16103911
 ] 

Eugene Kirpichov commented on BEAM-2671:


Actually that test is https://issues.apache.org/jira/browse/BEAM-1868 - does 
that mean it also should be fixed for 2.1.0 RC3?

> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark 
> runner
> 
>
> Key: BEAM-2671
> URL: https://issues.apache.org/jira/browse/BEAM-2671
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.1.0, 2.2.0
>
>
> Error message:
> Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: 
> Expected: iterable over [] in any order
>  but: Not matched: "late"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner

2017-07-27 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov reassigned BEAM-2670:
--

Assignee: Eugene Kirpichov  (was: Stas Levin)

> ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark 
> runner
> --
>
> Key: BEAM-2670
> URL: https://issues.apache.org/jira/browse/BEAM-2670
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Eugene Kirpichov
>Priority: Minor
> Fix For: 2.1.0, 2.2.0
>
>
> Tests succeed when run alone but fail when the whole ParDoTest is run. May be 
> related to PipelineOptions reusing / not cleaning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2670) ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner

2017-07-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103815#comment-16103815
 ] 

Eugene Kirpichov commented on BEAM-2670:


I found the issue and have a fix.

> ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark 
> runner
> --
>
> Key: BEAM-2670
> URL: https://issues.apache.org/jira/browse/BEAM-2670
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Stas Levin
>Priority: Minor
> Fix For: 2.1.0, 2.2.0
>
>
> Tests succeed when run alone but fail when the whole ParDoTest is run. May be 
> related to PipelineOptions reusing / not cleaning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner

2017-07-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103868#comment-16103868
 ] 

Eugene Kirpichov commented on BEAM-2671:


CreateStreamTest.testMultiOutputParDo() is failing as well.

> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark 
> runner
> 
>
> Key: BEAM-2671
> URL: https://issues.apache.org/jira/browse/BEAM-2671
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.1.0, 2.2.0
>
>
> Error message:
> Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: 
> Expected: iterable over [] in any order
>  but: Not matched: "late"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2671) CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark runner

2017-07-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104459#comment-16104459
 ] 

Eugene Kirpichov commented on BEAM-2671:


I wonder if these tests have something to do with 
https://github.com/apache/beam/commit/20820fa5477ffcdd4a9ef2e9340353ed3c5691a9 
(part of PR https://github.com/apache/beam/pull/3343 ) - it seems the last 
stable build of Spark ValidatesRunner was right before this commit 
https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastStableBuild/

[~aviemzur] , could you take a look perhaps?

> CreateStreamTest.testFirstElementLate validatesRunner test fails on Spark 
> runner
> 
>
> Key: BEAM-2671
> URL: https://issues.apache.org/jira/browse/BEAM-2671
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Etienne Chauchot
>Assignee: Jean-Baptiste Onofré
> Fix For: 2.1.0, 2.2.0
>
>
> Error message:
> Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).out0: 
> Expected: iterable over [] in any order
>  but: Not matched: "late"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2684) AmqpIOTest is flaky

2017-07-26 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2684:
--

 Summary: AmqpIOTest is flaky
 Key: BEAM-2684
 URL: https://issues.apache.org/jira/browse/BEAM-2684
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Reporter: Eugene Kirpichov
Assignee: Jean-Baptiste Onofré


This test is often timing out, and has been doing that for a while, causing 
unrelated PRs to fail randomly. I've gotten into the habit of excluding 
sdks/java/io/amqp when running "mvn verify" and I suppose it's not a good habit 
:) 

Example failure: 
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/13424/console



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-1820) Source.getDefaultOutputCoder() should be @Nullable

2017-07-26 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-1820.
--
   Resolution: Won't Fix
Fix Version/s: Not applicable

The consensus on dev@ is that instead, PTransform's should always provide a 
Coder for their output; and the natural generalization to Source is that 
sources also always should provide a Coder for their output; letting the user 
configure the Source with an explicit Coder if necessary.

> Source.getDefaultOutputCoder() should be @Nullable
> --
>
> Key: BEAM-1820
> URL: https://issues.apache.org/jira/browse/BEAM-1820
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Łukasz Gajowy
>  Labels: easyfix, starter
> Fix For: Not applicable
>
>
> Source.getDefaultOutputCoder() returns a coder for elements produced by the 
> source.
> However, the Source objects are nearly always hidden from the user and 
> instead encapsulated in a transform. Often, an enclosing transform has a 
> better idea of what coder should be used to encode these elements (e.g. a 
> user supplied a Coder to that transform's configuration). In that case, it'd 
> be good if Source.getDefaultOutputCoder() could just return null, and coder 
> would have to be handled by the enclosing transform or perhaps specified on 
> the output of that transform explicitly.
> Right now there's a bunch of code in the SDK and runners that assumes 
> Source.getDefaultOutputCoder() returns non-null. That code would need to be 
> fixed to instead use the coder set on the collection produced by 
> Read.from(source).
> It all appears pretty easy to fix, so this is a good starter item.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2686) PTransform's should always set a Coder on their output PCollections

2017-07-26 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2686:
--

 Summary: PTransform's should always set a Coder on their output 
PCollections
 Key: BEAM-2686
 URL: https://issues.apache.org/jira/browse/BEAM-2686
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


See discussion on 
https://lists.apache.org/thread.html/1dde0b5a93c2983cbab5f68ce7c74580102f5bb2baaa816585d7eabb@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2640) Introduce Create.ofProvider(ValueProvider)

2017-07-19 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093588#comment-16093588
 ] 

Eugene Kirpichov commented on BEAM-2640:


Coder inference for this is complicated by the fact that ValueProvider does not 
expose a TypeDescriptor for the value being provided. Even though it could. I'm 
gonna start with requiring to explicitly provide a coder.

> Introduce Create.ofProvider(ValueProvider)
> --
>
> Key: BEAM-2640
> URL: https://issues.apache.org/jira/browse/BEAM-2640
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> When you have a ValueProvider that may or may not be accessible at 
> construction time, a common task is to wrap it into a single-element 
> PCollection. This is especially common when migrating an IO connector that 
> used something like Create.of(query) followed by a ParDo, to having query be 
> a ValueProvider.
> Currently this is done in an icky way (e.g. 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1.java#L615)
> We should have a convenience helper for that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2643) Add TextIO.read_all() to Python SDK

2017-07-19 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093633#comment-16093633
 ] 

Eugene Kirpichov commented on BEAM-2643:


See also https://github.com/apache/beam/pull/3598

> Add TextIO.read_all() to Python SDK
> ---
>
> Key: BEAM-2643
> URL: https://issues.apache.org/jira/browse/BEAM-2643
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>
> Java SDK now has TextIO.read_all() API that allows reading a massive number 
> of files by moving from using the BoundedSource API (which may perform 
> expensive source operations on the control plane) to using ParDo operations.
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L170
> This API should be added for Python SDK as well.
> This form of reading files does not support dynamic work rebalancing for now. 
> But this should not matter much when reading a massive number of relatively 
> small files. In the future this API can support dynamic work rebalancing 
> through Splittable DoFn.
> cc: [~jkff]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2641) Improve discoverability of TextIO.readAll() as a replacement of TextIO.read() for large globs

2017-07-19 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2641:
--

 Summary: Improve discoverability of TextIO.readAll() as a 
replacement of TextIO.read() for large globs
 Key: BEAM-2641
 URL: https://issues.apache.org/jira/browse/BEAM-2641
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


TextIO.readAll() dramatically outperforms TextIO.read() when reading very large 
numbers of files (hundreds of thousands or millions or more).

However, it is not obvious that this is what you should use if you have such a 
filepattern in TextIO.read().

We should take a variety of measures to make it more discoverable, e.g.:

* Add a parameter to TextIO.read(), like "withHintManyFiles()"
* Log something suggesting the use of that hint when splitting TextIO if the 
filepattern is very large
* Improve documentation
* Post something on StackOverflow about this



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2607) Enforce that SDF must return stop() after a failed tryClaim() call

2017-07-12 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2607:
--

 Summary: Enforce that SDF must return stop() after a failed 
tryClaim() call
 Key: BEAM-2607
 URL: https://issues.apache.org/jira/browse/BEAM-2607
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


https://github.com/apache/beam/pull/3360 reintroduces DoFn.ProcessContinuation 
with some refinements to its semantics - see 
https://issues.apache.org/jira/browse/BEAM-2447.

One of the refinements is that, if the ProcessElement call unsuccessfully calls 
tryClaim() on the RestrictionTracker, the call MUST return stop().

The current JIRA is to enforce this automatically. Right now this is not 
possible because tryClaim() is not formally a method in RestrictionTracker 
(only concrete classes provide it, but not the base class) and runners can not 
hook into it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-2511) TextIO should support reading a PCollection of filenames

2017-07-11 Thread Eugene Kirpichov (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov closed BEAM-2511.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> TextIO should support reading a PCollection of filenames
> 
>
> Key: BEAM-2511
> URL: https://issues.apache.org/jira/browse/BEAM-2511
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
> Fix For: 2.2.0
>
>
> Motivation and proposed implementation in https://s.apache.org/textio-sdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-29 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068782#comment-16068782
 ] 

Eugene Kirpichov commented on BEAM-2140:


Conceptually, watermarks are for PCollections - lower bound on timestamps of 
new elements that may get added to the collection.
However, at the implementation level, watermarks are assigned to transforms: 
they have an "input watermark" and "output watermark" (I suppose, per input and 
per output).
The difference between the output watermark of a transform producing PC and the 
input watermark of a transform consuming PC is as follows: the input watermark 
is held by "pending elements", that we know need to be processed, but yet 
haven't.
The input watermark is also held by the event-time of pending timers set by the 
transform. In other words, logically the transform's input is (output of the 
producer of the input) + (timers set by the transform itself), and the input 
watermark is held by both of these.

Currently the input watermark of a transform is held only by _event-time_ 
timers; however, it makes sense to hold it also by _processing-time_ timers. 
For that we need to assign them an event-time timestamp. Currently this isn't 
happening at all (except assigning an "effective timestamp" to output from the 
timer firing, when it fires - it is assigned from the current input watermark). 
The suggestion in case of SDF is to use the ProcessContinuation's output 
watermark as the event-time for the residual timer.

We also discussed handling of processing-time timers in batch. Coming from the 
point of view that things should work exactly the same way in batch - setting a 
processing-time timer in batch for firing in 5 minutes should actually fire it 
after 5 minutes, including possibly delaying the bundle until processing-time 
timers quiesce. Motivating use case is, say, using an SDF-based polling 
continuous glob expander in a batch pipeline - it should process the same set 
of files it would in a streaming pipeline.

A few questions I still do not understand:
- Where exactly do the processing-timers get dropped, and on what condition? 
Kenn says that event-time timers don't get dropped: we just forbid setting them 
if they would be already "late". 
- When can an input to the SDF, or a timer set by the SDF be late at all; and 
should the SDF drop them? Technically a runner is free to drop late data at any 
point in the pipeline, but in practice it happens after GBKs; and semantically 
an SDF need not involve a GBK, so it should be allowed to just not drop 
anything late, no? - like a regular DoFn would (as long as it doesn't leak 
state)

Seems like we also should file JIRAs for the following:
- state leakage
- handling processing-time timers in batch properly
- holding watermark by processing-time timers
- allowing the timer API (internals or the user-facing one) to specifying 
event-time of processing-time timers
- more?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-29 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068740#comment-16068740
 ] 

Eugene Kirpichov commented on BEAM-2140:


So, to elaborate on what Kenn said. We dug a bit deeper into this yesterday and 
came up with the following conclusions.

1) The reason that this stuff works in Dataflow and Direct runner is that, for 
running SDF, they use a code path that simply _does not drop late data/timers 
or GC state_. These happen in LateDataDroppingRunner and ReduceFnRunner and 
StatefulDoFnRunner - and the path for running ProcessFn does not involve any of 
these. Aljoscha, maybe you can see why your current codepaths for running 
ProcessFn in Flink involve dropping of late data / late timers, and make them 
not involve it? :) (I'm not sure where this dropping happens in Flink)
2) As a consequence, however, state doesn't get GC'd. In practice this means 
that, if you apply an SDF to input that is in many windows (e.g. to input 
windowed by fixed or sliding windows), it will slowly leak state. However, in 
practice this is likely not a huge concern because SDFs are expected to mostly 
be used when the amount of input is not super large (at least compared to 
output), and it is usually globally windowed. Especially in streaming use 
cases. I.e. it can be treated as a "Known issue" rather than "SDF does not work 
at all". *I would recommend proceeding to implement it in Flink runner with 
this same known issue*, and then solving the issue uniformly across all runners.

Posting this comment for now and writing another on how to do it without state 
leakage.

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2536) Simplify specifying coders on PCollectionTuple

2017-06-28 Thread Eugene Kirpichov (JIRA)
Eugene Kirpichov created BEAM-2536:
--

 Summary: Simplify specifying coders on PCollectionTuple
 Key: BEAM-2536
 URL: https://issues.apache.org/jira/browse/BEAM-2536
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov


Currently when using a multi-output ParDo, the user usually has to do one of 
the following:

1) Use anonymous class: new TupleTag() {} - in order to reify the Foo type 
and make coder inference work. In this case, a frequent problem is that the 
anonymous class captures a large enclosing class, and either doesn't serialize 
at all, or at least serializes to something bulky.
2) Explicitly do tuple.get(myTag).setCoder(...)

Both of these are suboptimal.

Could we have e.g. a constructor for TupleTag that explicitly takes a 
TypeDescriptor? Or even a Coder? Or a family of factory methods for 
TupleTagList that take these? E.g.:
in.apply(ParDo.of(...).withOutputTags(mainTag, TupleTagList.of(side1, 
FooCoder.of()).and(side2, BarCoder.of()));

I would suggest both: TupleTag constructor should optionally take a 
TypeDescriptor; and TupleTagList.of() and .and() should optionally take a Coder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2532) BigQueryIO source should avoid expensive JSON schema parsing for every record

2017-06-30 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070428#comment-16070428
 ] 

Eugene Kirpichov commented on BEAM-2532:


[~reuvenlax] I think you had some thoughts about how to make this whole thing 
much faster? Or was it for Write?

> BigQueryIO source should avoid expensive JSON schema parsing for every record
> -
>
> Key: BEAM-2532
> URL: https://issues.apache.org/jira/browse/BEAM-2532
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-gcp
>Affects Versions: 2.0.0
>Reporter: Marian Dvorsky
>Assignee: Stephen Sisk
>Priority: Minor
>
> BigQueryIO source converts the schema from JSON for every input row, here:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java#L159
> This is the performance bottleneck in a simple pipeline with BigQueryIO 
> source.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065429#comment-16065429
 ] 

Eugene Kirpichov edited comment on BEAM-2140 at 6/27/17 8:38 PM:
-

Okay, I see that I misunderstood what the watermark hold does, and now I'm not 
sure how anything works at all (i.e. why timers set by SDF are not constantly 
dropped) - in direct and dataflow runner :-|

For SDF specifically, I think it would make sense to *advance the watermark of 
the input same as if the DoFn was not splittable* - i.e. consider the input 
element "consumed" only when the ProcessElement call terminates with no 
residual restriction. In other words, I guess, set an "input watermark hold" 
(in addition to output watermark hold)? Is such a thing possible? Does it make 
equal sense for non-splittable DoFn's that use timers?


was (Author: jkff):
Okay, I see that I misunderstood what the watermark hold does, and now I'm not 
sure how anything works at all (i.e. why timers set by SDF are not constantly 
dropped) - in direct and dataflow runner :-|

For SDF specifically, I think it would make sense to **advance the watermark of 
the input same as if the DoFn was not splittable** - i.e. consider the input 
element "consumed" only when the ProcessElement call terminates with no 
residual restriction. In other words, I guess, set an "input watermark hold" 
(in addition to output watermark hold)? Is such a thing possible? Does it make 
equal sense for non-splittable DoFn's that use timers?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065429#comment-16065429
 ] 

Eugene Kirpichov commented on BEAM-2140:


Okay, I see that I misunderstood what the watermark hold does, and now I'm not 
sure how anything works at all (i.e. why timers set by SDF are not constantly 
dropped) - in direct and dataflow runner :-|

For SDF specifically, I think it would make sense to **advance the watermark of 
the input same as if the DoFn was not splittable** - i.e. consider the input 
element "consumed" only when the ProcessElement call terminates with no 
residual restriction. In other words, I guess, set an "input watermark hold" 
(in addition to output watermark hold)? Is such a thing possible? Does it make 
equal sense for non-splittable DoFn's that use timers?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065626#comment-16065626
 ] 

Eugene Kirpichov commented on BEAM-2140:


_we can't advance the watermark as though it was non-splittable in the 
unbounded case_ - why is that / why is it a bad thing that the watermark of the 
PCollection being fed into the SDF would not advance? E.g. imagine it's a 
Create.of(pubsub topic name) + ParDo(read pubsub forever) - is it important to 
advance the watermark of the Create.of()?

Alternatively, imagine it's: read filepatterns from pubsub + 
TextIO.readAll().watchForNewFiles().watchFilesForNewEntries(), which has 
several SDFs in this. Would there be a problem with advancing the watermark of 
the PCollection of filepatterns only after the watch termination conditions of 
TextIO.readAll() are hit and this filepattern is no longer watched?

Alternatively - worst case I guess: read Pubsub topic names from Kafka, and 
read each topic forever. I'd assume that the user would be interested in 
advancement of the watermark of the PCollection of pubsub records rather than 
the PCollection of Pubsub topic names? I'm not sure the Pubsub topic names in 
Kafka would even need to have meaningful timestamps (rather than infinite past).

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2140) Fix SplittableDoFn ValidatesRunner tests in FlinkRunner

2017-06-27 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065627#comment-16065627
 ] 

Eugene Kirpichov commented on BEAM-2140:


...Or is the problem that the watermark of the output PCollection has to be 
smaller than watermark of the input, so if input doesn't advance then output 
can't advance either?

> Fix SplittableDoFn ValidatesRunner tests in FlinkRunner
> ---
>
> Key: BEAM-2140
> URL: https://issues.apache.org/jira/browse/BEAM-2140
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Aljoscha Krettek
>Assignee: Aljoscha Krettek
>
> As discovered as part of BEAM-1763, there is a failing SDF test. We disabled 
> the tests to unblock the open PR for BEAM-1763.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-673) Data locality for Read.Bounded

2017-04-25 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983950#comment-15983950
 ] 

Eugene Kirpichov commented on BEAM-673:
---

On second thought, this might be related to SDF: processing different 
restrictions of the same element may have different requirements.

Or more like: a design for DoFn's giving hints to runners about their resource 
requirements would need to include some data dependence. I don't have a good 
idea about how to express it in a way that will be modular and will combine 
well with the rest of the Beam model and various tricks runners are allowed to 
do (such as fusion or materialization).

> Data locality for Read.Bounded
> --
>
> Key: BEAM-673
> URL: https://issues.apache.org/jira/browse/BEAM-673
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Amit Sela
>Assignee: Ismaël Mejía
> Fix For: First stable release
>
>
> In some distributed filesystems, such as HDFS, we should be able to hint to 
> Spark the preferred locations of splits.
> Here is an example of how Spark does that for Hadoop RDDs:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L249



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


<    1   2   3   4   5   6   7   >