Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Andres Angel
Awesome Pablo thanks so much!!! AU On Thu, Jul 25, 2019 at 7:48 PM Pablo Estrada wrote: > Thanks for those who tuned in : ) - I feel like I might have spent too > long fiddling with Python code, and not long enough doing setup, testing, > etc. I will try to do another one where I just test

Re: Enhancement for Joining Unbounded PCollections of different WindowFns

2019-07-25 Thread rahul patwari
"*In terms of Join schematic, I think it's hard to reason data completeness since one side of the join is changing*" - As it is possible to apply [Global Windows with Non-Default Trigger] to Unbounded Data Source, say, Kafka, to distinguish this Kafka PCollection from "Slowly Changing lookup

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
Thanks for those who tuned in : ) - I feel like I might have spent too long fiddling with Python code, and not long enough doing setup, testing, etc. I will try to do another one where I just test / setup the environment / lint checks etc. Here are links for: Setting up the Python environment:

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread sridhar inuog
Thanks, Pablo for organizing this session. I found it useful. On Thu, Jul 25, 2019 at 4:56 PM Pablo Estrada wrote: > The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo > This is still happening. > > On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack > wrote: > >> Did I miss the link

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-25 Thread Rui Wang
Tried to verify RC1 by running Nexmark on Dataflow but found it's broken (at least based commands from Running+Nexmark ). Will try to debug it and rerun the process. -Rui On Thu, Jul 25, 2019 at 2:39 PM Anton Kedin wrote: > Hi

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Pablo Estrada
The link is here: https://www.youtube.com/watch?v=xpIpEO4PUDo This is still happening. On Thu, Jul 25, 2019 at 2:55 PM Innocent Djiofack wrote: > Did I miss the link or this was postponed? > > On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett < > whatwouldausti...@gmail.com> wrote: > >> Pablo, >>

Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-25 Thread Innocent Djiofack
Did I miss the link or this was postponed? On Tue, Jul 23, 2019 at 3:05 PM Austin Bennett wrote: > Pablo, > > Assigned https://issues.apache.org/jira/browse/BEAM-7607 to you, to make > even more likely that it is still around on the 25th :-) > > Cheers, > Austin > > On Tue, Jul 23, 2019 at

[VOTE] Release 2.14.0, release candidate #1

2019-07-25 Thread Anton Kedin
Hi everyone, Please review and vote on the release candidate #3 for the version 2.14.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], *

Re: Enhancement for Joining Unbounded PCollections of different WindowFns

2019-07-25 Thread Rui Wang
To be more clear, I think it's useful if we can achieve the following that you wrote PCollection mainStream = ...; PCollection lookupStream = ...; PCollectionTuple tuple = PCollectionTuple.of(new TupleTag("MainTable"), new TupleTag("LookupTable")); tuple.apply(SqlTransform.query("MainTable JOIN

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 6:34 PM rahul patwari wrote: > > So, If an RPC call has to be performed for a batch of Rows(PCollection), > instead of each Row, the recommended way is to batch the Rows in > startBundle() of >

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Eugene Kirpichov
Hi Gleb, Regarding the future of io.Read: ideally things would go as follows - All runners support SDF at feature parity with Read (mostly this is just the Dataflow runner's liquid sharding and size estimation for bounded sources, and backlog for unbounded sources, but I recall that a couple of

Re: Enhancement for Joining Unbounded PCollections of different WindowFns

2019-07-25 Thread Rui Wang
Hi Rahul, thanks for your detailed writeup. It pretty much summarizes the slow changing table join problem. To your question: "Can we implement SideInputJoin for this case", there are two perspectives. In terms of implementing the slowing changing lookup cache pattern

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Claire McGinty
As far as I/O code re-use, the consensus seems to be to make the SMB module as composable as possible using existing Beam components, ideally as-is or with very basic tweaks. To be clear, what I care about is that WriteFiles(X) and > WriteSmbFiles(X) can share the same X, for X in {Avro, Parquet,

Re: [2.14.0] Release Progress Update

2019-07-25 Thread Anton Kedin
Planning to send out the RC1 within the next couple of hours. Regards, Anton On Thu, Jul 25, 2019 at 1:21 PM Pablo Estrada wrote: > Hi Anton, > are there updates on the release? > Thanks! > -P. > > On Fri, Jul 19, 2019 at 12:33 PM Anton Kedin wrote: > >> Verification build succeeds except for

Re: [2.14.0] Release Progress Update

2019-07-25 Thread Pablo Estrada
Hi Anton, are there updates on the release? Thanks! -P. On Fri, Jul 19, 2019 at 12:33 PM Anton Kedin wrote: > Verification build succeeds except for AWS IO (which has tests hanging). I > will continue the release process as normal and will investigate the AWS IO > issue meanwhile. Will either

Re: Enhancement for Joining Unbounded PCollections of different WindowFns

2019-07-25 Thread rahul patwari
Hi Kenn, If we consider the following two *Unbounded* PCollections: - PCollection1 => [*Non-Global* Window with Default Trigger] - PCollection2 => [Global Window with *Non-Default* Trigger] :) coincidentally turned out to be the opposite Joining these two PCollections in BeamSql currently is not

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

2019-07-25 Thread Kenneth Knowles
We hope it does enter the SQL standard. It is one reason for coming together to write this paper. OVER clause is mentioned often. - TUMBLE can actually just be a function so you don't need OVER or any of the fancy stuff we propose; it is just done to make them all look similar - HOP still

Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Kenneth Knowles
Top posting just to address the vendoring question. We didn't have vendoring of gRPC until very recently. I think all rationale about keeping it off the SDK surface are obsolete now. It will probably unlock a lot of simplification to just go for it and use gRPC in the core SDK. Notably,

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
Yes. But, GroupIntoBatches works on KV. We are working on PCollection throughout our pipeline. We can convert Row to KV. But, we only have a few keys and a Bounded PCollection. As we have Global windows and a few keys, the opportunity for parallelism is limited to [No. of keys] with Stateful ParDo

Re: Choosing a coder for a class that contains a Row?

2019-07-25 Thread Brian Hulette
I know Reuven has put some thought into evolving schemas, but I'm not sure it's documented anywhere as of now. The only documentation I've come across as I bump around the schema code are some comments deep in RowCoder [1]. Essentially the current serialization format for a row includes a row

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Reuven Lax
Have you looked at the GroupIntoBatches transform? On Thu, Jul 25, 2019 at 9:34 AM rahul patwari wrote: > So, If an RPC call has to be performed for a batch of > Rows(PCollection), instead of each Row, the recommended way is to > batch the Rows in startBundle() of DoFn( >

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
So, If an RPC call has to be performed for a batch of Rows(PCollection), instead of each Row, the recommended way is to batch the Rows in startBundle() of DoFn( https://stackoverflow.com/questions/49094781/yield-results-in-finish-bundle-from-a-custom-dofn/49101711#49101711)? I thought Stateful and

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
Though it's not obvious in the name, Stateful ParDos can only be applied to keyed PCollections, similar to GroupByKey. (You could, however, assign every element to the same key and then apply a Stateful DoFn, though in that case all elements would get processed on the same worker.) On Thu, Jul

Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread rahul patwari
Hi, https://beam.apache.org/blog/2017/02/13/stateful-processing.html gives an example of assigning an arbitrary-but-consistent index to each element on a per key-and-window basis. If the Stateful ParDo is applied on a Non-Keyed PCollection, say, PCollection with Fixed Windows, the state is

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Gleb Kanterov
What is the long-term plan for org.apache.beam.sdk.io.Read? Is it going away in favor of SDF, or we are always going to have both? I was looking into AvroIO.read and AvroIO.readAll, both of them use AvroSource. AvroIO.readAll is using SDF, and it's implemented with ReadAllViaFileBasedSource that

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 12:35 AM Kenneth Knowles wrote: > > From the peanut gallery, keeping a separate implementation for SMB seems > fine. Dependencies are serious liabilities for both upstream and downstream. > It seems like the reuse angle is generating extra work, and potentially > making

Re: Write-through-cache in State logic

2019-07-25 Thread Robert Bradshaw
On Wed, Jul 24, 2019 at 6:21 AM Rakesh Kumar wrote: > > Thanks Robert, > > I stumble on the jira that you have created some time ago > https://jira.apache.org/jira/browse/BEAM-5428 > > You also marked code where code changes are required: >

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise wrote: > > Hi Jincheng, > > It is very exciting to see this follow-up, that you have done your research > on the current state and that there is the intention to join forces on the > portability effort! > > I have added a few pointers inline. > >

Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Robert Bradshaw
>From the portability perspective, https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto and the associated services for executing pipelines is about as "core" as it gets, and eventually I'd like to see all runners being portable (even if they have an