Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Robert Bradshaw
You can literally return a Python tuple of outputs from a composite transform as well. (Dicts with PCollections as values are also supported, if you want things to be named rather than referenced by index.) On Fri, Oct 25, 2019 at 4:06 PM Ahmet Altay wrote: > > Is DoOutputsTuple what you are

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
oking at the PR, here > is the verbiage I added about urgency: > > P0/Blocker: "A P0 issue is more urgent than simply blocking the next release" > P1/Critical: "Most critical bugs should block release" > P2/Major: "No special urgency is associated" > ...

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Robert Bradshaw
I think we'll still need approach (2) for when the pipeline finishes and a runner is tearing down workers. On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels wrote: > > Hi Jincheng, > > Thanks for bringing this up and capturing the ideas in the doc. > > Intuitively, I would have also considered

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Robert Bradshaw
It looks like fn_api_runner_test.py is quite expensive, taking 10-15+ minutes on each version of Python. This test consists of a base class that is basically a validates runner suite, and is then run in several configurations, many more of which (including some expensive ones) have been added

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
We cut a release every 6 weeks, according to schedule, making it easy to plan for, and the release manager typically sends out a warning email to remind everyone. I don't think it makes sense to do that for every ticket. Blockers should be reserved for things we really shouldn't release without.

Re: Interactive Beam Example Failing [BEAM-8451]

2019-10-21 Thread Robert Bradshaw
Thanks for trying this out. Yes, this is definitely something that should be supported (and tested). On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic wrote: > > Hi everyone, > > The interactive beam example using the DirectRunner fails after execution of > the last cell. The recursion limit is

Re: Test failures in python precommit: ZipFileArtifactServiceTest

2019-10-21 Thread Robert Bradshaw
I just merged https://github.com/apache/beam/pull/9845 which should resolve the issue. On Mon, Oct 21, 2019 at 12:58 PM Chad Dombrova wrote: > > thanks! > > On Mon, Oct 21, 2019 at 12:47 PM Kyle Weaver wrote: >> >> This issue is being tracked at >>

Re: Are empty bundles allowed by model?

2019-10-21 Thread Robert Bradshaw
iding the creation of these bundles, but maybe the test > should be modified so that it adheres to the model [1]. > > Jan > > [1] https://github.com/apache/beam/pull/9846 > > On 10/21/19 6:00 PM, Robert Bradshaw wrote: > > Yes, the model allows them. > > > > It also t

Re: Are empty bundles allowed by model?

2019-10-21 Thread Robert Bradshaw
Yes, the model allows them. It also takes less work to avoid them in general (e.g. imagine one reshuffles N elements to M > N workers. A priori, one would "start" a bundle and then try to read all data destined for that worker--postponing this until one knows that the set of data for this worker

Re: Python SDK timestamp precision

2019-10-18 Thread Robert Bradshaw
itAtTimestamp: SDK Native Object -> Timestamp On Fri, May 10, 2019 at 1:33 PM Robert Bradshaw wrote: > On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles wrote: > > > From: Robert Bradshaw > > Date: Wed, May 8, 2019 at 3:00 PM > > To: dev > > > >> From: Ke

Re: RFC: Assigning environments to transforms in a pipeline

2019-10-16 Thread Robert Bradshaw
Sounds nice. Is there a design doc (or, perhaps, you could just give an example of what this would look like in this thread)? On Wed, Oct 16, 2019 at 5:51 PM Chad Dombrova wrote: > Hi all, > One of our goals for the portability framework is to be able to assign > different environments to

Re: [design] A streaming Fn API runner for Python

2019-10-15 Thread Robert Bradshaw
Very excited to see this! I've added some comments to the doc. On Tue, Oct 15, 2019 at 3:43 PM Pablo Estrada wrote: > I've just been informed that access wasn't open. I've since opened access > to it. > Thanks > -P. > > On Tue, Oct 15, 2019 at 2:10 PM Pablo Estrada wrote: > >> Hello all, >> I

Re: So much green

2019-10-11 Thread Robert Bradshaw
Very nice to see, thanks for sharing. On Fri, Oct 11, 2019 at 5:44 AM Maximilian Michels wrote: > > Glad to see that we have fixed the recent flakes. Let's keep up the good > work :) > > -Max > > On 10.10.19 23:37, Kenneth Knowles wrote: > > All the cells in the pull request template are green

Re: Python thread pool executor for Apache Beam

2019-10-11 Thread Robert Bradshaw
Can we use a lower default timeout to mitigate this issue in the short term (I'd imagine one second or possibly smaller would be sufficient for our use), and get a fix upstream in the long term? On Fri, Oct 11, 2019 at 9:38 AM Luke Cwik wrote: > > I'm looking for a thread pool that re-uses

Re: Beam Python fails to run on macOS 10.15?

2019-10-10 Thread Robert Bradshaw
Looks like an issue with the protobuf library. Do you know what version of protobuf you're using? (E.g. by running pip freeze.) I don't have Catalina to test this on, but it'd be useful if you could winnow this down to the import that fails. On Thu, Oct 10, 2019 at 8:15 AM Kamil Wasilewski

Re: [spark structured streaming runner] merge to master?

2019-10-10 Thread Robert Bradshaw
On Thu, Oct 10, 2019 at 12:39 AM Etienne Chauchot wrote: > > Hi guys, > > You probably know that there has been for several months an work > developing a new Spark runner based on Spark Structured Streaming > framework. This work is located in a feature branch here: >

Re: [portability] Removing the old portable metrics API...

2019-10-09 Thread Robert Bradshaw
on test coverage could be increased. But we wrote this test > <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_exercise_metrics_pipeline.py> > . > > > On Wed, Oct 9, 2019 at 10:51 AM Luke Cwik wrote: > >> One way w

Re: Please comment on draft comms strategy by Oct 16

2019-10-09 Thread Robert Bradshaw
Probably worth mentioning Slack and StackOverflow as well. On Wed, Oct 9, 2019 at 3:59 PM María Cruz wrote: > > Hi all, > sorry for multiple messages. I realized after sending the first email that a > new thread with a different subject was probably more efficient. > > I created a communication

Re: [portability] Removing the old portable metrics API...

2019-10-07 Thread Robert Bradshaw
Yes, Dataflow still uses the old API, for both counters and for its progress/autoscaling mechanisms. We'd need to convert that over as well (which is on the TODO list but lower than finishing up support for portability in general). On Mon, Oct 7, 2019 at 9:56 AM Robert Burke wrote: > > The Go

Re: [VOTE] Release 2.16.0, release candidate #1

2019-10-04 Thread Robert Bradshaw
OK, this appears to have been a weird config issue on my system (though the error certainly could have been better). As BEAM-8303 has a workaround and all else is looking good, I don't think that's worth another RC. +1 (binding) to this release. On Fri, Oct 4, 2019 at 10:56 AM Robert Bradshaw

Re: Plan for dropping python 2 support

2019-10-04 Thread Robert Bradshaw
t;>>> >>>>> > Oh, and one more thing, I think it'd make sense for Apache Beam to >>>>> > sign https://python3statement.org/. The promise is that we'd >>>>> > discontinue Python 2 support *in* 2020, which is not committing us to >>>>> &g

Re: [VOTE] Release 2.16.0, release candidate #1

2019-10-04 Thread Robert Bradshaw
The artifact signatures and contents all look good to me. I've also verify the wheels work for the direct runner. However, I'm having an issue with trying to run on dataflow with Python 3.6: python -m apache_beam.examples.wordcount --input gs://clouddfe-robertwb/chicago_taxi_data/eval/data.csv

Re: Dockerhub push denied for py3.6 and py3.7 image

2019-10-02 Thread Robert Bradshaw
Please add robertwb0 as well. On Wed, Oct 2, 2019 at 9:09 AM Ahmet Altay wrote: > > > > On Tue, Oct 1, 2019 at 8:44 PM Pablo Estrada wrote: >> >> When she set up the repo, Hannah requested PMC members to ask for >> privileges, so I did. >> The set of admins currently is just Hannah and myself

Re: Multiple iterations after GroupByKey with SparkRunner

2019-10-01 Thread Robert Bradshaw
For this specific usecase, I would suggest this be done via PTranform URNs. E.g. one could have a GroupByKeyOneShot whose implementation is input .apply(GroupByKey.of() .apply(kv -> KV.of(kv.key(), kv.iterator()) A runner would be free to recognize and optimize this in the graph (based

Re: [VOTE] Sign a pledge to discontinue support of Python 2 in 2020.

2019-10-01 Thread Robert Bradshaw
The correct link is https://python3statement.org/ On Tue, Oct 1, 2019 at 10:14 AM Mark Liu wrote: > > +1 > > btw, the link (http://python3stament.org) you provided is broken. > > On Tue, Oct 1, 2019 at 9:44 AM Udi Meiri wrote: >> >> +1 >> >> On Tue, Oct 1, 2019 at 3:22 AM Łukasz Gajowy wrote:

Re: [VOTE] Sign a pledge to discontinue support of Python 2 in 2020.

2019-09-30 Thread Robert Bradshaw
+1 On Mon, Sep 30, 2019 at 5:35 PM David Cavazos wrote: > > +1 > > On Mon, Sep 30, 2019 at 5:27 PM Ahmet Altay wrote: >> >> +1 >> >> On Mon, Sep 30, 2019 at 5:22 PM Valentyn Tymofieiev >> wrote: >>> >>> Hi everyone, >>> >>> Please vote whether to sign a pledge on behalf of Apache Beam to

Re: Why is there no standard boolean coder?

2019-09-27 Thread Robert Bradshaw
.yaml >>> >>> >>> On Fri, Sep 27, 2019 at 5:17 PM Chad Dombrova wrote: >>>> >>>> Are there any dissenting votes to making a BooleanCoder a standard >>>> (portable) coder? >>>> >>>> I'm happy to make a PR to imple

Re: Why is there no standard boolean coder?

2019-09-27 Thread Robert Bradshaw
I had with +Robert Bradshaw a > while ago: We both agreed all of the coders listed in BEAM-7996 should be > implemented in Python, but didn't come to a conclusion on whether or not they > should actually be _standard_ coders, versus just being implicitly standard > as part of row

Re: Collecting feedback for Beam usage

2019-09-26 Thread Robert Bradshaw
user to check what is being sent. >>>> > >>>> > One more heavy-weight option is to also allow user configure and persist >>>> > what information he is ok with sharing. >>>> > >>>> > --Mikhail >>>> > >>>&

Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
arton.com/static/files/trace/profile.html > This information also appears within the build scans that are sent to Gradle. > > Integrating with either of these sources of information would allow us to > figure out whether its new tasks or old tasks taking longer. > > On Tue, Sep

Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
Does anyone know how to gather stats on where the time is being spent? Several times the idea of consolidating many of the (expensive) validates runner integration tests into a single pipeline, and then running things individually only if that fails, has come up. I think that'd be a big win if

Re: Collecting feedback for Beam usage

2019-09-23 Thread Robert Bradshaw
On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette wrote: > > Would people actually click on that link though? I think Kyle has a point > that in practice users would only find and click on that link when they're > having some kind of issue, especially if the link has "feedback" in it. I think the

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Robert Bradshaw
the Runner adds the Impulse override. That > way also the Python SDK would not have to have separate code paths for > Reads. Or, rather, that the Runner adds the non-Impuls override (in Java and Python). > On 19.09.19 11:46, Robert Bradshaw wrote: > > On Thu, Sep 19, 2019 at 11:22 AM Maxi

Re: Plan for dropping python 2 support

2019-09-19 Thread Robert Bradshaw
Oh, and one more thing, I think it'd make sense for Apache Beam to sign https://python3statement.org/. The promise is that we'd discontinue Python 2 support *in* 2020, which is not committing us to January if we're not ready. Worth a vote? On Thu, Sep 19, 2019 at 3:58 PM Robert Bradshaw wrote

Re: Plan for dropping python 2 support

2019-09-19 Thread Robert Bradshaw
Exactly how long we support Python 2 depends on our users. Other than those that speak up (such as yourself, thanks!), it's hard to get a handle on how many need Python 2 and for how long. (Should we send out a survey? Maybe after some experience with 2.16?) On the one hand, the whole ecosystem

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Robert Bradshaw
; > > > > > > > Kyle Weaver | Software Engineer | github.com/ibzib > > <http://github.com/ibzib> > > > <http://github.com/ibzib> > > > > <http://github.com/ibzib> | kcwea...@google.com >

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Robert Bradshaw
On Tue, Sep 17, 2019 at 1:43 PM Thomas Weise wrote: > > +1 for making --experiments=beam_fn_api default. > > Can the Dataflow runner driver just remove the setting if it is not > compatible? The tricky bit would be undoing the differences in graph construction due to this flag flip. But I would

Re: The state of external transforms in Beam

2019-09-16 Thread Robert Bradshaw
Thanks for bringing this up again. My thoughts on the open questions below. On Mon, Sep 16, 2019 at 11:51 AM Chad Dombrova wrote: > That commit solves 2 problems: > > Adds the pubsub Java deps so that they’re available in our portable pipeline > Makes the coder for the PubsubIO message-holder

Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Robert Bradshaw
;> non-docker environment, as Docker adds some operational complexity that >>>>> isn't really needed to run a word count example. For example, Yu's >>>>> pipeline >>>>> errored here because the expected Docker container wasn't built before >>>&g

Re: [discuss] How we support our users on Slack / Mailing list / StackOverflow

2019-09-06 Thread Robert Bradshaw
I would also suggest SO as the best alternative, especially due to its indexability and searchability. If discussion is needed, the users list (my preference) or slack can be good options, and ideally the resolution is brought back to SO. On Fri, Sep 6, 2019 at 1:10 PM Udi Meiri wrote: > > I

Re: Stop publishing unneeded Java artifacts

2019-09-03 Thread Robert Bradshaw
:sdks:java:testing:expansion-service could be useful to publish for testing as well. On Fri, Aug 30, 2019 at 3:13 PM Lukasz Cwik wrote: > > Google internally relies on being able to get the POM files generated for: > :sdks:java:testing:nexmark > :sdks:java:testing:test-utils > > Generating the

Re: Write-through-cache in State logic

2019-08-27 Thread Robert Bradshaw
Just to clarify, the repeated list of cache tokens in the process bundle request is used to validate reading *and* stored when writing? In that sense, should they just be called version identifiers or something like that? On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels wrote: > > Thanks.

Re: Write-through-cache in State logic

2019-08-27 Thread Robert Bradshaw
On Sun, Aug 18, 2019 at 7:30 PM Rakesh Kumar wrote: > > not to completely hijack Max's question but a tangential question regarding > LRU cache. > > What is the preferred python library for LRU cache? > I noticed that cachetools [1] is used as one of the dependencies for GCP [2]. >

[ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-26 Thread Robert Bradshaw
Hi, Please join me and the rest of the Beam PMC in welcoming a new committer: Valentyn Tymofieiev Valentyn has made numerous contributions to Beam over the last several years (including 100+ pull requests), most recently pushing through the effort to make Beam compatible with Python 3. He is

Re: Brief of interactive Beam

2019-08-26 Thread Robert Bradshaw
On Fri, Aug 23, 2019 at 4:25 PM Ning Kang wrote: > On Aug 23, 2019, at 3:09 PM, Robert Bradshaw wrote: > > Cool, sounds like we're getting closer to the same page. Some more replies > below. > > On Fri, Aug 23, 2019 at 1:47 PM Ning Kang wrote: > >> Thanks for the

Re: Python question about save_main_session

2019-08-23 Thread Robert Bradshaw
I suggest re-writing the test to avoid save_main_session. On Fri, Aug 23, 2019 at 11:57 AM Udi Meiri wrote: > Hi, > I'm trying to get pytest with the xdist plugin to run Beam tests. The > issue is with save_main_session and a dependency of pytest-xdist called > execnet, which triggers this

Re: Brief of interactive Beam

2019-08-23 Thread Robert Bradshaw
nstruction never spans multiple cells (though its implementation might via function calls) so one never has out-of-date transforms dangling off the pipeline object. > This has the downsides of recreating the PCollectiion objects which are > being used as handles (though perhaps they could be re-iden

Re: Brief of interactive Beam

2019-08-23 Thread Robert Bradshaw
On Wed, Aug 21, 2019 at 3:33 PM GMAIL wrote: > Thanks for the input, Robert! > > On Aug 21, 2019, at 11:49 AM, Robert Bradshaw wrote: > > On Wed, Aug 14, 2019 at 11:29 AM Ning Kang wrote: > >> Ahmet, thanks for forwarding! >> >> >>> My main con

Re: Brief of interactive Beam

2019-08-21 Thread Robert Bradshaw
the design >>>> overview. >>>> >>>> If you have any questions, please feel free to contact me through this >>>> email address! >>>> >>>> Thanks! >>>> >>>> Regards, >>>> Ning. >>>> >>>>

Re: Try to understand "Output timestamps must be no earlier than the timestamp of the current input"

2019-08-20 Thread Robert Bradshaw
The original timestamps are probably being assigned in the watchForNewFiles transform, which is also setting the watermark: https://github.com/apache/beam/blob/release-2.15.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L668 Until https://issues.apache.org/jira/browse/BEAM-644

Re: [PROPOSAL] An initial Schema API in Python

2019-08-20 Thread Robert Bradshaw
On Mon, Aug 19, 2019 at 5:44 PM Ahmet Altay wrote: > > > > On Mon, Aug 19, 2019 at 9:56 AM Brian Hulette wrote: >> >> >> >> On Fri, Aug 16, 2019 at 5:17 PM Chad Dombrova wrote: >> Agreed on float since it seems to trivially map to a double, but I’m >> torn on int still. While I

Re: Java 11 compatibility question

2019-08-09 Thread Robert Bradshaw
On Fri, Aug 9, 2019 at 12:48 PM Michał Walenia wrote: > From what I understand, the Java 8 -> 11 testing isn't in essence similar > to py2 -> py3 checks. > True. Python 3 is in many ways a new language, and much less (and more subtly) backwards compatible. You also can't "link" Python 3 code

Re: Write-through-cache in State logic

2019-08-09 Thread Robert Bradshaw
t; "blocked" call and then issue all the requests together. > > > On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw wrote: >> >> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise wrote: >> > >> > That would add a synchronization point that force

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-09 Thread Robert Bradshaw
;>> simpler implementation, since we'd no longer have to work within the >>>> constraints of the existing job server infrastructure. The only downside I >>>> can think of is the additional cost of implementing/maintaining jar >>>> creation code in

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

2019-08-09 Thread Robert Bradshaw
If we do provide a configuration value for this, I would make it have a fairly large default and ure-use the flag for all RPCs of similar nature, not tweeks for this particular service only. On Fri, Aug 9, 2019 at 2:58 AM Ahmet Altay wrote: > Default plus a flag to override sounds reasonable.

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-09 Thread Robert Bradshaw
Could you clarify what you mean by "inconsistent" and "incorrect"? Are elements missing/duplicated, or just batched differently? On Fri, Aug 9, 2019 at 2:18 AM rahul patwari wrote: > > I only ran in Direct runner. I will run in other runners and let you know the > results. > I am not setting

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Robert Bradshaw
On Wed, Aug 7, 2019 at 5:59 PM Thomas Weise wrote: > >> > * The pipeline construction code itself may need access to cluster >> > resources. In such cases the jar file cannot be created offline. >> >> Could you elaborate? > > > The entry point is arbitrary code written by the user, not limited

Re: [PROPOSAL] An initial Schema API in Python

2019-08-08 Thread Robert Bradshaw
On Wed, Aug 7, 2019 at 11:12 PM Brian Hulette wrote: > > Thanks for all the suggestions, I've added responses inline. > > On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova wrote: >> >> There’s a lot of ground to cover here, so I’m going to pull from a few >> different responses. >> >>

Re: Brief of interactive Beam

2019-08-08 Thread Robert Bradshaw
Thanks for the note. Are there any associated documents worth sharing as well? More below. On Wed, Aug 7, 2019 at 9:39 PM Ning Kang wrote: > To whom may concern, > > This is Ning from Google. We are currently making efforts to leverage an > interactive runner under python beam sdk. > > There is

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-07 Thread Robert Bradshaw
ownCoders.addLengthPrefixedCoder). However, only a few >>> > coders are defined in StandardCoders. It means that for most coder, a >>> > length will be added to the serialized bytes which is not necessary in my >>> > thoughts. My suggestion is maybe we can add s

Re: Collecting metrics in JobInvocation - BEAM-4775

2019-08-07 Thread Robert Bradshaw
I think the question here is whether PipelineRunner::run is allowed to be blocking. If it is, then the futures make sense (but there's no way to properly cancel it). I'm OK with not being able to return metrics on cancel in this case, or the case the pipeline didn't even start up yet. Otherwise,

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-07 Thread Robert Bradshaw
On Wed, Aug 7, 2019 at 6:20 AM Thomas Weise wrote: > > Hi Kyle, > > [document doesn't have comments enabled currently] > > As noted, worker deployment is an open question. I believe pipeline > submission and worker execution need to be considered together for a complete > deployment story. The

Re: [PROPOSAL] An initial Schema API in Python

2019-08-06 Thread Robert Bradshaw
On Sun, Aug 4, 2019 at 12:03 AM Chad Dombrova wrote: > > Hi, > > This looks like a great feature. > > Is there a plan to eventually support custom field types? > > I assume adding support for dataclasses in python 3.7+ should be trivial to > do in a follow up PR. Do you see any complications

Re: [ANNOUNCE] Beam 2.14.0 Released!

2019-08-02 Thread Robert Bradshaw
Lots of improvements all around. Thank you for pushing this through, Anton! On Fri, Aug 2, 2019 at 1:37 AM Chad Dombrova wrote: > > Nice work all round! I love the release blog format with the highlights and > links to issues. > > -chad > > > On Thu, Aug 1, 2019 at 4:23 PM Anton Kedin wrote:

Re: [ANNOUNCE] New committer: Jan Lukavský

2019-08-01 Thread Robert Bradshaw
Congratulations! On Thu, Aug 1, 2019 at 9:59 AM Jan Lukavský wrote: > Thanks everyone! > > Looking forward to working with this great community! :-) > > Cheers, > > Jan > On 8/1/19 12:18 AM, Rui Wang wrote: > > Congratulations! > > > -Rui > > On Wed, Jul 31, 2019 at 10:51 AM Robin Qiu wrote:

Re: [DISCUSS] Integer coders used in SchemaCoder

2019-07-31 Thread Robert Bradshaw
The standard VARINT coder is used for all sorts of integer values (e.g. the output of the CountElements transform), but the vast majority of them are likely significantly less than a full 64 bits. In Python, declaring an element type to be int will use this. On the other hand, using a VarInt

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-31 Thread Robert Bradshaw
seems > that it's still not supported in the Python SDK Harness. Is there any plan > on that? > > Robert Bradshaw 于2019年7月30日周二 下午12:33写道: > >> On Tue, Jul 30, 2019 at 11:52 AM jincheng sun >> wrote: >> >>> >>>>> Is it possible to a

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-31 Thread Robert Bradshaw
Jul 31, 2019 at 3:16 AM Pablo Estrada >> wrote: >> > >>> >> > >>> +1 >> > >>> >> > >>> I installed from source, and ran unit tests for Python in 2.7, 3.5, >> 3.6. >> > >>> >> >

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-30 Thread Robert Bradshaw
On Tue, Jul 30, 2019 at 11:52 AM jincheng sun wrote: > >>> Is it possible to add an interface such as `isSelfContained()` to the >>> `Coder`? This interface indicates >>> whether the serialized bytes are self contained. If it returns true, >>> then there is no need to add a prefixing length. >>>

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-30 Thread Robert Bradshaw
I checked all the artifact signatures and ran a couple test pipelines with the wheels (Py2 and Py3) and everything looked good to me, so +1. On Mon, Jul 29, 2019 at 8:29 PM Valentyn Tymofieiev wrote: > I have checked Python 3 batch and streaming quickstarts on Dataflow runner > using .zip and

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-29 Thread Robert Bradshaw
On Mon, Jul 29, 2019 at 4:14 PM jincheng sun wrote: > Hi Robert, > > Thanks for your detail comments, I would have added a few pointers inline. > > Best, > Jincheng > > Robert Bradshaw 于2019年7月29日周一 下午12:35写道: > >> On Sun, Jul 28, 2019 at 6:51 AM jincheng su

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-29 Thread Robert Bradshaw
, I noticed that FLOAT is not among StandardCoders, while DOUBLE > is among it. StandardCoders is supposed to be some sort of lowest common denominator, but theres no hard and fast criteria. For this example, some languages (e.g. Python) don't have the notion of FLOAT, and using a FLOAT coder

Re: Sort Merge Bucket - Action Items

2019-07-26 Thread Robert Bradshaw
y() >> >> ParDo.of(ReadFileRangesFn(createSource) :: DoFn> OffsetRange>, T>) where >> >> createSource :: String -> FileBasedSource >> >> createSource = AvroSource >> >> >> AvroIO.read without getHintMatchedManyFiles() :: PTransform> P

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
m.identityHashCode(this) in the body of a DoFn might be sufficient. > On Thu, Jul 25, 2019 at 9:54 PM Robert Bradshaw wrote: >> >> Though it's not obvious in the name, Stateful ParDos can only be >> applied to keyed PCollections, similar to GroupByKey. (You could, >>

Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
Though it's not obvious in the name, Stateful ParDos can only be applied to keyed PCollections, similar to GroupByKey. (You could, however, assign every element to the same key and then apply a Stateful DoFn, though in that case all elements would get processed on the same worker.) On Thu, Jul

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Robert Bradshaw
f >>> temp file handing logic lives). Might be hard to decouple either modifying >>> existing code or creating new transforms, unless if we re-write most of >>> FileBasedSink from scratch. >>> >>> Let me know if I'm on the wrong track. >>> >>>

Re: Write-through-cache in State logic

2019-07-25 Thread Robert Bradshaw
ntry and I'll be happy to answer any questions you might have if (well probably when) these pointers are insufficient. > On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw wrote: >> >> This is documented at >> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdz

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise wrote: > > Hi Jincheng, > > It is very exciting to see this follow-up, that you have done your research > on the current state and that there is the intention to join forces on the > portability effort! > > I have added a few pointers inline. > >

Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Robert Bradshaw
>From the portability perspective, https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto and the associated services for executing pipelines is about as "core" as it gets, and eventually I'd like to see all runners being portable (even if they have an

Re: On Auto-creating GCS buckets on behalf of users

2019-07-23 Thread Robert Bradshaw
On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath wrote: > > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver wrote: >> >> I agree with David that at least clearer log statements should be added. >> >> Udi, that's an interesting idea, but I imagine the sheer number of existing >> flags (including

Re: Write-through-cache in State logic

2019-07-23 Thread Robert Bradshaw
new data. The Beam Java SDK does this for all >>> runners when executed portably[1]. You could port the same logic to the >>> Beam Python SDK as well. >>> >>> 1: >>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/sr

Re: On Auto-creating GCS buckets on behalf of users

2019-07-23 Thread Robert Bradshaw
I think having a single, default, auto-created temporary bucket per project for use in GCP (when running on Dataflow, or running elsewhere but using GCS such as for this BQ load files example), though not ideal, is the best user experience. If we don't want to be automatically creating such things

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote: > > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote: >> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: >> > >> > Thanks Robert. Agree with the FileIO point. I'll look into it

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
urce, which we'd like to avoid, but there's no alternative. An SDF, if exposed, would likely be overkill and cumbersome to call (given the reflection machinery involved in invoking DoFns). > I'll file separate PRs for core changes needed for discussion. WDYT? Sounds good. > On Mon, Jul 22, 20

Re: python precommits failing at head

2019-07-22 Thread Robert Bradshaw
This was due to a bad release artifact push. This has now been fixed upstream. On Mon, Jul 22, 2019 at 11:00 AM Robert Bradshaw wrote: > > Looks like https://sourceforge.net/p/docutils/bugs/365/ > > On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli > wrote: &g

Re: python precommits failing at head

2019-07-22 Thread Robert Bradshaw
Looks like https://sourceforge.net/p/docutils/bugs/365/ On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli wrote: > Hi everyone, > > The Python PreCommit from the Jenkins job "beam_PreCommit_Python_Cron" is > failing[1]. The task :sdks:python:docs is failing with this traceback: > > Traceback

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote: > > Forking this thread to discuss action items regarding the change. We can keep > technical discussion in the original thread. > > Background: our SMB POC showed promising performance & cost saving > improvements and we'd like to adopt it for

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-18 Thread Robert Bradshaw
t quite following here. Suppose one processes element a, m, and z. Then one decides to split the bundle, but there's not a "range" we can pick for the "other" as this bundle already spans the whole range. But maybe I'm just off in the weeds here. > On Wed, Jul 17, 2019 at 6:

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Robert Bradshaw
gt;> Because of the merge sort, we can't split or offset seek a bucket file. >>>> Because without persisting the offset index of a key group somewhere, we >>>> can't efficiently skip to a key group without exhausting the previous >>>> ones

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Robert Bradshaw
possible to >>> binary search for matching keys but that's extra complication. IMO the >>> reader work distribution is better solved by better bucket/shard strategy >>> in upstream writer. >>> >>> References >>> >>> ReadMatche

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-17 Thread Robert Bradshaw
sh not changed after > squash & force-push > b00 "fixup: Address review comments." - commit hash has changed after > squash, but these commits were never reviewed, > > 5. Author requests another review iteration (PTAL). Since PR still has > a00, pr

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-17 Thread Robert Bradshaw
Congratulations! On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk wrote: > Congratulations! :) > > On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia < > michal.wale...@polidea.com> wrote: > >> Congratulations, Robert! :) >> >> On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy >> wrote: >> >>>

Re: Write-through-cache in State logic

2019-07-16 Thread Robert Bradshaw
Python workers also have a per-bundle SDK-side cache. A protocol has been proposed, but hasn't yet been implemented in any SDKs or runners. On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax wrote: > > It's runner dependent. Some runners (e.g. the Dataflow runner) do have such a > cache, though I think

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-16 Thread Robert Bradshaw
f copy-pasted Avro boilerplate. >>> - For compatibility, we can delegate to the new classes from the old ones >>> and remove them in the next breaking release. >>> >>> Re: WriteFiles logic, I'm not sure about generalizing it, but what about >>> s

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
teChannel/ReadableByteChannel, which is the level of >> granularity we need) but the Writers, at least, seem to be mostly >> private-access. Do you foresee them being made public at any point? >> >> - Claire >> >> On Mon, Jul 15, 2019 at 9:31 AM Robert Bradshaw

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
oducer pipelines and >>>> many downstream consumer pipelines. It's not intended to replace >>>> shuffle/join within a single pipeline. On the producer side, by >>>> pre-grouping/sorting data and writing to bucket/shard output files, the >>>> consum

Re: [python] ReadFromPubSub broken in Flink

2019-07-15 Thread Robert Bradshaw
On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath wrote: > > On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova wrote: >> >> Hi Chamikara, why not make this part of the pipeline options? does it really need to vary from transform to transform? >>> >>> It's possible for the same

Re: pickling typing types in Python 3.5+

2019-07-10 Thread Robert Bradshaw
ation on cloudpickle vs dill in Beam, I'll bring it to > the mailing list. > > On Wed, May 15, 2019 at 5:25 AM Robert Bradshaw wrote: >> >> (2) seems reasonable. >> >> On Tue, May 14, 2019 at 3:15 AM Udi Meiri wrote: >> > >> > It seems like pickling

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-10 Thread Robert Bradshaw
On Wed, Jul 10, 2019 at 5:06 AM Kenneth Knowles wrote: > > My opinion: what is important is that we have a policy for what goes into the > master commit history. This is very simple IMO: each commit should clearly do > something that it states, and a commit should do just one thing. Exactly

Re: Accumulating mode implies that panes are processed in order?

2019-06-26 Thread Robert Bradshaw
On Thu, Jun 27, 2019 at 1:52 AM Rui Wang wrote: >> >> >> AFAIK all streaming runners today practically do provide these panes in >> order; > > Does it refer to "the stage immediately after GBK itself processes fired > panes in order" in streaming runners? Could you share more information? > >

<    2   3   4   5   6   7   8   9   10   11   >