Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Reuven Lax
As a thought experiment: could this be done by expanding the set into a PCollection and running it through a Distinct (in the global window, trigger every element) transform? On Tue, Jul 11, 2017 at 9:48 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > In the current version, the tra

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
In the current version, the transform is intended to watch a set that is continuously growing; do you mean a GCS bucket that eventually contains more files than can fit in a state tag? I agree that this will eventually become an issue; I can see a couple of solutions: - I suspect many such sets ar

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Reuven Lax
BTW - I am worried about SDF storing everything in a single tag for watch. The problem is that streaming pipeline can run "forever." So someone watching a GCS bucket "forever" will eventually crash due to the value getting too large. Is there any reasonable way to garbage collect this state? On Tu

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
First PR has been submitted - enjoy TextIO.readAll() which reads a PCollection of filenames! I've started working on the SDF-based Watch transform http://s.apache.org/beam-watch-transform, and after that will be able to implement the incremental features in TextIO. On Tue, Jun 27, 2017 at 1:55 PM

Re: MergeBot is here!

2017-07-11 Thread Kenneth Knowles
On Tue, Jul 11, 2017 at 4:25 PM, Robert Bradshaw < rober...@google.com.invalid> wrote: > On Tue, Jul 11, 2017 at 8:51 AM, Kenneth Knowles > wrote: > > I like the idea of controlling squashing or not explicitly in the > mergebot > > invocation. I don't think it needs to be made interactive, but ju

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Yeah, I think the original point about ValueProviders was to raise my awareness of separation between pipeline build time and run time. Indeed, whether we use ValueProviders or not, we would still need to figure out a way to get the actual credentials values into the FileSystem object. This many a

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Sourabh Bajaj
I'm not sure ValueProviders address the issue of getting credentials to underlying libraries or FileSystem though as they are only exposed at the PTransform level. Eg. If I was using Flink on AWS and reading data from GCS we currently don't have a way for TextIO to get credentials it can use to re

Jenkins Upgrade Sunday

2017-07-11 Thread Jason Kuster
Hi all, An FYI that Infra is updating Jenkins over the weekend. One consequence of this is that MavenJob type builds will need to use Java 8 for execution. We've already been using Java 8 by default for executing our builds, so this shouldn't be an issue for us, except potentially in the cross-JDK

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Ahmet Altay
+1 to the above responses to for passing option into PTransforms. As Robert mentioned in the JIRA issue, filesystem plug-ins are in a different category. It is reasonable for them to create credentials based on options/environment variables. We could have a protocol for instantiating file system p

Re: MergeBot is here!

2017-07-11 Thread Robert Bradshaw
On Tue, Jul 11, 2017 at 8:51 AM, Kenneth Knowles wrote: > I like the idea of controlling squashing or not explicitly in the mergebot > invocation. I don't think it needs to be made interactive, but just based > on preparing the PR appropriately. > > I propose this for the default `@asfgit merge`:

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Sourabh Bajaj
We do the latter of treating constants as StaticValueProviders in the pipeline right now. On Tue, Jul 11, 2017 at 4:47 PM Dmitry Demeshchuk wrote: > Thanks a lot for the input, folks! > > Also, thanks for telling me about the concept of ValueProvider, Kenneth! > This was a good reminder to mysel

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Thanks a lot for the input, folks! Also, thanks for telling me about the concept of ValueProvider, Kenneth! This was a good reminder to myself that some stuff that's described in the Dataflow docs (I discovered https://cloud.google.com/dataflow/docs/templates/creating-templates after having read y

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Robert Bradshaw
Templates, including ValueProviders, were recently added to the Python SDK. +1 to pursuing this train of thought (and as I mentioned on the bug, and has been mentioned here, we don't want to add PipelineOptions access to PTransforms/at construction time). On Tue, Jul 11, 2017 at 3:21 PM, Kenneth K

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Thomas Groh
We'd like to avoid giving PTransforms access to the pipeline options during pipeline construction. There are a few compelling reasons for doing so. The biggest one is that the context in which the pipeline is constructed and the context in which it executes may not be the same. As an example, if I

Re: Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Kenneth Knowles
Hi Dmitry, This is a very worthwhile discussion that has recently come up on StackOverflow, here: https://stackoverflow.com/a/45024542/4820657 We actually recently _removed_ the PipelineOptions from Pipeline.apply in Java since they tend to cause transforms to have implicit changes that make them

Passing pipeline options into PTransforms and Filesystems in Python

2017-07-11 Thread Dmitry Demeshchuk
Hi folks, Sometimes, it would be very useful if PTransforms had access to global pipeline options, such as various credentials, settings and so on. Per conversation in https://issues.apache.org/jira/browse/BEAM-2572, I'd like to kick off a discussion about that. This would be beneficial for at l

Re: Mixed-Language Pipelines

2017-07-11 Thread Ahmet Altay
Thank you Thomas. I think this will especially be great for Python SDK, allowing it to tap into many sources that exist in the Java SDK. I added my comments. Ahmet On Mon, Jul 10, 2017 at 9:58 AM, Thomas Groh wrote: > Hey everyone; > > I've been working on a design for implementing multi-langua

Re: [PROPOSAL] Connectors for memcache and Couchbase

2017-07-11 Thread Ismaël Mejía
Hello again, Thanks Lukasz for the details. We will take a look and discuss with the others on how to achieve this. We hadn’t considered the case of a full scan Read (as Eugene mentions) so now your comments about the snapshot make more sense, however I am still wondering if the snapshot is worth

Re: MergeBot is here!

2017-07-11 Thread Kenneth Knowles
I like the idea of controlling squashing or not explicitly in the mergebot invocation. I don't think it needs to be made interactive, but just based on preparing the PR appropriately. I propose this for the default `@asfgit merge`: Don't squash, but reject merges that have commits that are obvious

Re: MergeBot is here!

2017-07-11 Thread Ismaël Mejía
Thanks a lot Jason, Great that Infra solved (2) so fast. About (3), maybe the extra pause/validation is not needed, because the bot will in principle make its work appropriately, maybe what we could just have is a way to see the git branch with the commits that mergebot will do as part of the rev

[VOTE] Release 2.1.0, release candidate #1

2017-07-11 Thread Jean-Baptiste Onofré
Hi everyone, Please review and vote on the release candidate #1 for the version 2.1.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1

Re: [Proposal] Submitting pipelines to Runners in another language

2017-07-11 Thread Aljoscha Krettek
This looks excellent! Please let me know once we get to actually implement this for a specific runner. Flink in my case, of course! :-) > On 8. Jul 2017, at 00:07, Thomas Groh wrote: > > I left a couple of comments. > > I'm looking forwards to this - it's going to be a good step towards being

Re: MergeBot is here!

2017-07-11 Thread Aljoscha Krettek
+1 This is excellent! > On 10. Jul 2017, at 21:42, Jason Kuster > wrote: > > (quick update re #2 above): ~4 minutes after I reopened the ticket, it's > fixed. > https://github.com/apache/infrastructure-puppet/commit/709944291da5e8aea711cb8578f0594deb45e222 > updates the website to the correct

Jenkins build became unstable: beam_Release_NightlySnapshot #474

2017-07-11 Thread Apache Jenkins Server
See

Re: [BEAM-135] Utilities for "batching" elements in a DoFn

2017-07-11 Thread Etienne Chauchot
Yes there is now a new PTransform that is called GroupIntoBatches Best, Etienne Le 11/07/2017 à 02:38, Robert Bradshaw a écrit : Sorry, just saw https://github.com/apache/beam/pull/2211 On Mon, Jul 10, 2017 at 5:37 PM, Robert Bradshaw wrote: Any progress on this? On Thu, Mar 9, 2017 at 1: