I wanted to use this thread to ping that the change to the user-facing API
in order to wrap RestrictionTracker broke the Watch transform, which has
been sickbayed for a long time. It would be helpful for experts to weigh in
on https://issues.apache.org/jira/browse/BEAM-6352 about how the
Great, Nicholas!
I've started assembling the formal proposal. Let's get in touch to figure
out how much you want to be involved. We'll certainly get a call for
volunteers together and can use all the help we can get!
Austin
P.s. are you local to SF? If yes, let's Meetup at least at the Feb 7
Sorry, forgot the doc link accidentally. Here it is:
https://docs.google.com/document/d/1OlAJf4T_CTL9WRH8lP8uQOfLjWYfm8IpRXSe38g34k4/edit#
On Mon, Jan 14, 2019 at 5:18 PM Alex Amato wrote:
> Hello,
>
> I am planning to implement the state sampler in Java, by reusing the state
> sampler from
Hello,
I am planning to implement the state sampler in Java, by reusing the state
sampler from the Dataflow Runner Harness. It will require a refactoring to
do this:
1.
Remove Dataflow specific code from the base classes and place it into
the relevant subclasses only.
2.
Move
Thanks Sam for bringing this to the list.
As preparation_ids are not reusable, having preparation_id and job_id same
makes sense to me for Flink.
Another option is to have a subscription for all states/messages on the
JobServer.
This will be similar to "docker". As the container id is created
We can enforce at the dependency level, since it is a compile error. I
think some IDEs and build tools may allow the compile-time classpath to get
polluted by transitive runtime deps, so protecting against bad imports is
also a good idea.
Kenn
On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía wrote:
Hi Jonathan,
JdbcIO.write() just invokes this DoFn:
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
It establishes a connection in @StartBundle and then in @FinishBundle it
commits a batch and closes the connection. If an
Hi Pierre,
Joins using the join library are always per-window, so that is not likely
to work well for you. Actually side inputs are the right architecture. What
you really want is a map side input (View.asMap()) that does not pull the
whole Map into memory. It is up to the runner to implement
Hi Sam,
Good observation. Looks like we should fix that.
Looking at InMemoryJobService, it appears that the state can only be retrieved
by the client once the job is running with a job/invocation id associated.
Indeed, any messages until that could be lost.
For Flink the JobId is generated
Hello all,
While going through the codebase I noticed a problem with the Beam
JobService. In particular, the API allows for the possibility of never
seeing some messages or states with Get(State|Message)Stream. This is
because the Get(State|Message)Stream calls need to have the job id which
can
Not yet, we need to add that too, there are still some tasks to be
done like improve the contribution guide with this info, and document
how to generate a src build artifact locally since I doubt we can
publish that into Apache for copyright reasons.
I will message in the future for awareness for
Hi all,
one problem we need to solve while working with load tests we currently
develop is that we don't really know how much GCP/Jenkins resources can we
occupy. We did some initial testing with
beam_Java_LoadTests_GroupByKey_Dataflow_Small[1] and it seems that for:
- 1 000 000 000 (~ 23 GB)
I'd be interesting in helping out with a summit in sf! How can I get
started?
On Sun, Jan 13, 2019, 23:08 Reza Rokni Hi,
>
> So after a few chats, would love to help enable a event in APAC later in
> Q3 / early Q4 timeframe.
>
> Cheers
>
> Reza
>
> On Tue, 8 Jan 2019 at 14:49, Reza Rokni wrote:
Thanks all. We won't run full-size load tests on Direct runner. I defined a
separate smoke job with "tiny" versions of load tests for each runner
(well, Direct and Dataflow for now, but we can extend this later with other
runners). Feel free to comment: https://github.com/apache/beam/pull/7497
If the set of different schemas is fixed, then in expand() you could create
a new PCollection for every possible schema, with the appropriate AvroCoder
set on each one. You could then use FileIO.matchAll, parse and partition
the records to the correct PCollection.
However if indeed the set of
Great initiative. I was thinking about making a similar proposal. I tried
using Beam SQL in a project that has Calcite dependency, and it doesn't
work because Calcite does internal JDBC connection on "jdbc:calcite:" URL,
and you can't register two drivers for the same scheme. Not sure how it's
Thanks for the heads up, Ismaël! Great to see the vendored Guava version is used
everywhere now.
Do we already have a Checkstyle rule that prevents people from using the
unvendored Guava? If not, such a rule could be useful because it's almost
inevitable that the unvedored Guava will slip
I don't know if there is an issue with this report but I have the
impression that once a dependency update is done, it stops checking it and
if a new version of the library is published it is not reported.
Also I don't see some libraries updates, What is the current status, are
some libs disabled
High Priority Dependency Updates Of Beam Python SDK:
Dependency Name
Current Version
Latest Version
Release Date Of the Current Used Version
Release Date Of The Latest Release
JIRA Issue
future
0.16.0
0.17.1
2016-10-27
We merged today the PR [1] that changes most of the code to use our
new guava vendored dependency. In practice it means that most of the
imports of the classes were changed from `com.google.common.` to
`org.apache.beam.vendor.guava.v20_0.com.google.common.`
This is a great improvement to fix a
One approach could be creating PTransform with expand method that wraps
AvroIO and reads AVRO writer schema from one of files matching read pattern.
It will work if the set of sources with different schemas is fixed at
pipeline construction step.
```
public abstract class GenericAvroIORead
Hello !
My question is maybe mainly GCP-oriented, so I apologize if it is not
fully related to the Beam community.
We have a streaming pipeline running on Dataflow which writes data to a
PostgreSQL instance hosted on Cloud SQL. This database is suffering from
connection leak spikes on a
22 matches
Mail list logo