Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2019-01-14 Thread Kenneth Knowles
I wanted to use this thread to ping that the change to the user-facing API in order to wrap RestrictionTracker broke the Watch transform, which has been sickbayed for a long time. It would be helpful for experts to weigh in on https://issues.apache.org/jira/browse/BEAM-6352 about how the

Re: Beam Summits!

2019-01-14 Thread Austin Bennett
Great, Nicholas! I've started assembling the formal proposal. Let's get in touch to figure out how much you want to be involved. We'll certainly get a call for volunteers together and can use all the help we can get! Austin P.s. are you local to SF? If yes, let's Meetup at least at the Feb 7

Re: Refactoring Java State Sampler Plan for Ptransform Execution Time metrics (process, finish, start)

2019-01-14 Thread Alex Amato
Sorry, forgot the doc link accidentally. Here it is: https://docs.google.com/document/d/1OlAJf4T_CTL9WRH8lP8uQOfLjWYfm8IpRXSe38g34k4/edit# On Mon, Jan 14, 2019 at 5:18 PM Alex Amato wrote: > Hello, > > I am planning to implement the state sampler in Java, by reusing the state > sampler from

Refactoring Java State Sampler Plan for Ptransform Execution Time metrics (process, finish, start)

2019-01-14 Thread Alex Amato
Hello, I am planning to implement the state sampler in Java, by reusing the state sampler from the Dataflow Runner Harness. It will require a refactoring to do this: 1. Remove Dataflow specific code from the base classes and place it into the relevant subclasses only. 2. Move

Re: Beam JobService Problem

2019-01-14 Thread Ankur Goenka
Thanks Sam for bringing this to the list. As preparation_ids are not reusable, having preparation_id and job_id same makes sense to me for Flink. Another option is to have a subscription for all states/messages on the JobServer. This will be similar to "docker". As the container id is created

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-01-14 Thread Kenneth Knowles
We can enforce at the dependency level, since it is a compile error. I think some IDEs and build tools may allow the compile-time classpath to get polluted by transitive runtime deps, so protecting against bad imports is also a good idea. Kenn On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía wrote:

Re: Connection leaks with PostgreSQL instance

2019-01-14 Thread Kenneth Knowles
Hi Jonathan, JdbcIO.write() just invokes this DoFn: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765 It establishes a connection in @StartBundle and then in @FinishBundle it commits a batch and closes the connection. If an

Re: Joining an Unbounded Source with a Bounded Source

2019-01-14 Thread Kenneth Knowles
Hi Pierre, Joins using the join library are always per-window, so that is not likely to work well for you. Actually side inputs are the right architecture. What you really want is a map side input (View.asMap()) that does not pull the whole Map into memory. It is up to the runner to implement

Re: Beam JobService Problem

2019-01-14 Thread Maximilian Michels
Hi Sam, Good observation. Looks like we should fix that. Looking at InMemoryJobService, it appears that the state can only be retrieved by the client once the job is running with a job/invocation id associated. Indeed, any messages until that could be lost. For Flink the JobId is generated

Beam JobService Problem

2019-01-14 Thread Sam Rohde
Hello all, While going through the codebase I noticed a problem with the Beam JobService. In particular, the API allows for the possibility of never seeing some messages or states with Get(State|Message)Stream. This is because the Get(State|Message)Stream calls need to have the job id which can

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-01-14 Thread Ismaël Mejía
Not yet, we need to add that too, there are still some tasks to be done like improve the contribution guide with this info, and document how to generate a src build artifact locally since I doubt we can publish that into Apache for copyright reasons. I will message in the future for awareness for

Dealing with expensive jenkins + Dataflow jobs

2019-01-14 Thread Łukasz Gajowy
Hi all, one problem we need to solve while working with load tests we currently develop is that we don't really know how much GCP/Jenkins resources can we occupy. We did some initial testing with beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for: - 1 000 000 000 (~ 23 GB)

Re: Beam Summits!

2019-01-14 Thread Nicholas Audo
I'd be interesting in helping out with a summit in sf! How can I get started? On Sun, Jan 13, 2019, 23:08 Reza Rokni Hi, > > So after a few chats, would love to help enable a event in APAC later in > Q3 / early Q4 timeframe. > > Cheers > > Reza > > On Tue, 8 Jan 2019 at 14:49, Reza Rokni wrote:

Re: Load testing on DirectRunner

2019-01-14 Thread Łukasz Gajowy
Thanks all. We won't run full-size load tests on Direct runner. I defined a separate smoke job with "tiny" versions of load tests for each runner (well, Direct and Dataflow for now, but we can extend this later with other runners). Feel free to comment: https://github.com/apache/beam/pull/7497

Re: AvroIO read from unknown schema to generic record.

2019-01-14 Thread Reuven Lax
If the set of different schemas is fixed, then in expand() you could create a new PCollection for every possible schema, with the appropriate AvroCoder set on each one. You could then use FileIO.matchAll, parse and partition the records to the correct PCollection. However if indeed the set of

Re: Vendoring Calcite

2019-01-14 Thread Gleb Kanterov
Great initiative. I was thinking about making a similar proposal. I tried using Beam SQL in a project that has Calcite dependency, and it doesn't work because Calcite does internal JDBC connection on "jdbc:calcite:" URL, and you can't register two drivers for the same scheme. Not sure how it's

Re: Merge of vendored Guava (Some PRs need a rebase)

2019-01-14 Thread Maximilian Michels
Thanks for the heads up, Ismaël! Great to see the vendored Guava version is used everywhere now. Do we already have a Checkstyle rule that prevents people from using the unvendored Guava? If not, such a rule could be useful because it's almost inevitable that the unvedored Guava will slip

Re: Beam Dependency Check Report (2019-01-14)

2019-01-14 Thread Ismaël Mejía
I don't know if there is an issue with this report but I have the impression that once a dependency update is done, it stops checking it and if a new version of the library is published it is not reported. Also I don't see some libraries updates, What is the current status, are some libs disabled

Beam Dependency Check Report (2019-01-14)

2019-01-14 Thread Apache Jenkins Server
High Priority Dependency Updates Of Beam Python SDK: Dependency Name Current Version Latest Version Release Date Of the Current Used Version Release Date Of The Latest Release JIRA Issue future 0.16.0 0.17.1 2016-10-27

Merge of vendored Guava (Some PRs need a rebase)

2019-01-14 Thread Ismaël Mejía
We merged today the PR [1] that changes most of the code to use our new guava vendored dependency. In practice it means that most of the imports of the classes were changed from `com.google.common.` to `org.apache.beam.vendor.guava.v20_0.com.google.common.` This is a great improvement to fix a

Re: AvroIO read from unknown schema to generic record.

2019-01-14 Thread Gleb Kanterov
One approach could be creating PTransform with expand method that wraps AvroIO and reads AVRO writer schema from one of files matching read pattern. It will work if the set of sources with different schemas is fixed at pipeline construction step. ``` public abstract class GenericAvroIORead

Connection leaks with PostgreSQL instance

2019-01-14 Thread Jonathan Perron
Hello ! My question is maybe mainly GCP-oriented, so I apologize if it is not fully related to the Beam community. We have a streaming pipeline running on Dataflow which writes data to a PostgreSQL instance hosted on Cloud SQL. This database is suffering from connection leak spikes on a