Re: Stateful ParDo on Non-Keyed PCollection

2019-07-25 Thread Robert Bradshaw
Though it's not obvious in the name, Stateful ParDos can only be applied to keyed PCollections, similar to GroupByKey. (You could, however, assign every element to the same key and then apply a Stateful DoFn, though in that case all elements would get processed on the same worker.) On Thu, Jul 25,

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Robert Bradshaw
l (where the bulk of >>> temp file handing logic lives). Might be hard to decouple either modifying >>> existing code or creating new transforms, unless if we re-write most of >>> FileBasedSink from scratch. >>> >>> Let me know if I'm on the wron

Re: Write-through-cache in State logic

2019-07-25 Thread Robert Bradshaw
IRA entry and I'll be happy to answer any questions you might have if (well probably when) these pointers are insufficient. > On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw wrote: >> >> This is documented at >> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOH

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-25 Thread Robert Bradshaw
On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise wrote: > > Hi Jincheng, > > It is very exciting to see this follow-up, that you have done your research > on the current state and that there is the intention to join forces on the > portability effort! > > I have added a few pointers inline. > > Seve

Re: How to expose/use the External transform on Java SDK

2019-07-25 Thread Robert Bradshaw
>From the portability perspective, https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto and the associated services for executing pipelines is about as "core" as it gets, and eventually I'd like to see all runners being portable (even if they have an option

Re: On Auto-creating GCS buckets on behalf of users

2019-07-23 Thread Robert Bradshaw
On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath wrote: > > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver wrote: >> >> I agree with David that at least clearer log statements should be added. >> >> Udi, that's an interesting idea, but I imagine the sheer number of existing >> flags (including m

Re: Write-through-cache in State logic

2019-07-23 Thread Robert Bradshaw
hen append all your new data. The Beam Java SDK does this for all >>> runners when executed portably[1]. You could port the same logic to the >>> Beam Python SDK as well. >>> >>> 1: >>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/

Re: On Auto-creating GCS buckets on behalf of users

2019-07-23 Thread Robert Bradshaw
I think having a single, default, auto-created temporary bucket per project for use in GCP (when running on Dataflow, or running elsewhere but using GCS such as for this BQ load files example), though not ideal, is the best user experience. If we don't want to be automatically creating such things

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Mon, Jul 22, 2019 at 7:39 PM Eugene Kirpichov wrote: > > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote: >> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: >> > >> > Thanks Robert. Agree with the FileIO point. I'll look in

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
based on FileBasedSource, which we'd like to avoid, but there's no alternative. An SDF, if exposed, would likely be overkill and cumbersome to call (given the reflection machinery involved in invoking DoFns). > I'll file separate PRs for core changes needed for discussion. WDYT? So

Re: python precommits failing at head

2019-07-22 Thread Robert Bradshaw
This was due to a bad release artifact push. This has now been fixed upstream. On Mon, Jul 22, 2019 at 11:00 AM Robert Bradshaw wrote: > > Looks like https://sourceforge.net/p/docutils/bugs/365/ > > On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli > wrote: >> >

Re: python precommits failing at head

2019-07-22 Thread Robert Bradshaw
Looks like https://sourceforge.net/p/docutils/bugs/365/ On Sun, Jul 21, 2019 at 11:56 PM Tanay Tummalapalli wrote: > Hi everyone, > > The Python PreCommit from the Jenkins job "beam_PreCommit_Python_Cron" is > failing[1]. The task :sdks:python:docs is failing with this traceback: > > Traceback (

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Robert Bradshaw
On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote: > > Forking this thread to discuss action items regarding the change. We can keep > technical discussion in the original thread. > > Background: our SMB POC showed promising performance & cost saving > improvements and we'd like to adopt it for p

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-18 Thread Robert Bradshaw
values. I'm not quite following here. Suppose one processes element a, m, and z. Then one decides to split the bundle, but there's not a "range" we can pick for the "other" as this bundle already spans the whole range. But maybe I'm just off in the weeds here. >

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Robert Bradshaw
>>> Note on reader block/offset/split requirement >>>> >>>> Because of the merge sort, we can't split or offset seek a bucket file. >>>> Because without persisting the offset index of a key group somewhere, we >>>> ca

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-17 Thread Robert Bradshaw
>>> which may not have the same key distribution. It might be possible to >>> binary search for matching keys but that's extra complication. IMO the >>> reader work distribution is better solved by better bucket/shard strategy >>> in upstream writer

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-17 Thread Robert Bradshaw
X" - commit hash not changed after > squash & force-push > b00 "fixup: Address review comments." - commit hash has changed after > squash, but these commits were never reviewed, > > 5. Author requests another review iteration (PTAL). Since PR

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-17 Thread Robert Bradshaw
Congratulations! On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk wrote: > Congratulations! :) > > On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia < > michal.wale...@polidea.com> wrote: > >> Congratulations, Robert! :) >> >> On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy >> wrote: >> >>> Con

Re: Write-through-cache in State logic

2019-07-16 Thread Robert Bradshaw
Python workers also have a per-bundle SDK-side cache. A protocol has been proposed, but hasn't yet been implemented in any SDKs or runners. On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax wrote: > > It's runner dependent. Some runners (e.g. the Dataflow runner) do have such a > cache, though I think

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-16 Thread Robert Bradshaw
his requires a lot of copy-pasted Avro boilerplate. >>> - For compatibility, we can delegate to the new classes from the old ones >>> and remove them in the next breaking release. >>> >>> Re: WriteFiles logic, I'm not sure about generalizing it, but what about

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
(since they >> operate on a WritableByteChannel/ReadableByteChannel, which is the level of >> granularity we need) but the Writers, at least, seem to be mostly >> private-access. Do you foresee them being made public at any point? >> >> - Claire >> >> On M

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Robert Bradshaw
t;>>> This is for the common pattern of few core data producer pipelines and >>>> many downstream consumer pipelines. It's not intended to replace >>>> shuffle/join within a single pipeline. On the producer side, by >>>> pre-grouping/sorting data

Re: [python] ReadFromPubSub broken in Flink

2019-07-15 Thread Robert Bradshaw
On Mon, Jul 15, 2019 at 5:42 AM Chamikara Jayalath wrote: > > On Sat, Jul 13, 2019 at 7:41 PM Chad Dombrova wrote: >> >> Hi Chamikara, why not make this part of the pipeline options? does it really need to vary from transform to transform? >>> >>> It's possible for the same pipel

Re: pickling typing types in Python 3.5+

2019-07-10 Thread Robert Bradshaw
on cloudpickle vs dill in Beam, I'll bring it to > the mailing list. > > On Wed, May 15, 2019 at 5:25 AM Robert Bradshaw wrote: >> >> (2) seems reasonable. >> >> On Tue, May 14, 2019 at 3:15 AM Udi Meiri wrote: >> > >> > It seems like pick

Re: [DISCUSS] Contributor guidelines for iterating on PRs: when to squash commits.

2019-07-10 Thread Robert Bradshaw
On Wed, Jul 10, 2019 at 5:06 AM Kenneth Knowles wrote: > > My opinion: what is important is that we have a policy for what goes into the > master commit history. This is very simple IMO: each commit should clearly do > something that it states, and a commit should do just one thing. Exactly how

Re: Accumulating mode implies that panes are processed in order?

2019-06-26 Thread Robert Bradshaw
On Thu, Jun 27, 2019 at 1:52 AM Rui Wang wrote: >> >> >> AFAIK all streaming runners today practically do provide these panes in >> order; > > Does it refer to "the stage immediately after GBK itself processes fired > panes in order" in streaming runners? Could you share more information? > > >

Re: Accumulating mode implies that panes are processed in order?

2019-06-26 Thread Robert Bradshaw
There is no promise that panes will arrive in order (especially the further you get "downstream"). Though they may be approximately so, it's dangerous to assume that. You can inspect the sequential index in PaneInfo to determine whether a pane is older than other panes you have seen. On Wed, Jun 2

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Robert Bradshaw
ction and should use it accordingly in subsequent > transforms. > > Thanks, > Cham > > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath > wrote: >> >> >> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw wrote: >>> >>> Good q

Re: Plan for dropping python 2 support

2019-06-26 Thread Robert Bradshaw
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail >>> >>> >>> On Tue, Jun 18, 2019 at 12:03 AM Valentyn Tymofieiev >>> wrote: >>>> >>>> I like the update Ismaël referenced [1], I think we should

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Robert Bradshaw
bundle). However this would be a valid "lazy" implementation. > > On Wed, Jun 26, 2019 at 2:29 PM Robert Bradshaw wrote: >> >> Note also that a "lazy" SDK implementation would be to simply return >> all the timers (as if they were new timers) to runner

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Robert Bradshaw
Good question. I'm not sure what could be done with (5) if it contains no deferred objects (e.g there's nothing to wait on). There is also (6) return PCollection. The advantage of (2) is that one can migrate to (1) or (6) without changing the public API, while giving something to wait on without

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Robert Bradshaw
f > work needed to implement Robert's suggestion. > > Reuven > > On Wed, Jun 26, 2019 at 12:28 PM Robert Bradshaw wrote: >> >> Another option, that is nice from an API perspective but places a >> burden on SDK implementers (and possibly runners), is to

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Robert Bradshaw
ould have to > update the runner while executing it. It is another possible option, but > seems to have issues of its own. > > On 6/26/19 12:28 PM, Robert Bradshaw wrote: > > Another option, that is nice from an API perspective but places a > > burden on SDK implement

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Robert Bradshaw
Another option, that is nice from an API perspective but places a burden on SDK implementers (and possibly runners), is to maintain the ordering of timers by requiring timers to be fired in order, and if any timers are set to fire them immediately before processing later timers. In other words, if

Re: Blogpost Beam Summit 2019

2019-06-26 Thread Robert Bradshaw
Thanks. This is a great write-up. +1 to an official tweet. On Wed, Jun 26, 2019, 6:11 AM Reza Rokni wrote: > Thank you for putting this together! > > On Wed, 26 Jun 2019 at 01:23, Ahmet Altay wrote: > >> Thank you for writing and sharing this. I enjoyed reading it :) I think >> it is worth shar

Re: Question about Python DirectRunner performance improvements

2019-06-25 Thread Robert Bradshaw
On Tue, Jun 25, 2019 at 2:26 PM Ismaël Mejía wrote: > > I stumbled recently into BEAM-3644 [1]. This issue mentions that > Python direct runner saw a great performance gain because of relying > on portability’s FnApiRunner. This seems to me a bit contra-intuitive > considering the extra overhead o

Re: [Discuss] Ideas for Apache Beam presence in social media

2019-06-24 Thread Robert Bradshaw
On Fri, Jun 21, 2019 at 1:02 PM Thomas Weise wrote: > > From what I understand, spreadsheets (not docs) provide the functionality > that we need: https://support.google.com/docs/answer/91588 > > Interested PMC members can subscribe and react to changes in the spreadsheet. > > Lazy consensus requi

Re: Assigning Reviewers in GitHub?

2019-06-24 Thread Robert Bradshaw
This is unfortunate, as the reviewers pulldown at least gives some reasonable suggestions for new users (which is why I suggested it). But, yes, R: @ is the convention. On Fri, Jun 21, 2019 at 5:56 PM Lukasz Cwik wrote: > > Only a few people have permission to update the 'Reviewers' section and I

Re: Request to join Jira as a Contributor - Beam Summit 2019

2019-06-20 Thread Robert Bradshaw
Welcome! You're in. On Thu, Jun 20, 2019 at 11:05 AM Jonas Grabber wrote: > > Hello, > > I'd like to start contributing to Apache Beam. Could you add me as a > Contributor in ASF Jira? > My Jira user ID is: "Jonas Grabber". > > > Best regards, > Jonas

Re: Contributor Registration

2019-06-20 Thread Robert Bradshaw
Welcome! I added you to the contributors group. On Thu, Jun 20, 2019 at 11:03 AM Matt Helm wrote: > > Hi Beam community, > > I'm Matt Helm, a Data Engineer at Shopify. I'm based in Vancouver, Canada. As > part of Beam Summit I'm looking to please start taking issues from Jira. My > username is

Re: [DISCUSS] Portability representation of schemas

2019-06-18 Thread Robert Bradshaw
inline. > > On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote: >> >> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote: >> > >> > Spoke to Brian about his proposal. It is essentially this: >> > >> > We create PortableSchemaCoder, with a well-k

Re: SparkRunner Combine.perKey performance

2019-06-17 Thread Robert Bradshaw
track that and will try some experiments. > Thanks. > > Jan > > [1] https://issues.apache.org/jira/browse/BEAM-7574 > > On 6/14/19 1:42 PM, Robert Bradshaw wrote: > > On Fri, Jun 14, 2019 at 1:02 PM Jan Lukavský wrote: > >> > Interesting. However, there we should ne

Re: SparkRunner Combine.perKey performance

2019-06-14 Thread Robert Bradshaw
one hour and a periodicity of one minute. If we have one datum per hour, better to not explode. If we have one datum per minute, it's a wash. If we have one datum per second, much better to explode. > On 6/14/19 12:19 PM, Robert Bradshaw wrote: > > On Fri, Jun 14, 2019 at 12:10 PM J

Re: SparkRunner Combine.perKey performance

2019-06-14 Thread Robert Bradshaw
On Fri, Jun 14, 2019 at 12:10 PM Jan Lukavský wrote: > > Hi Robert, > > thanks for the discussion. I will create a JIRA with summary of this. > Some comments inline. > > Jan > > On 6/14/19 10:49 AM, Robert Bradshaw wrote: > > On Thu, Jun 13, 2019 at 8:43 PM Jan L

Re: [DISCUSS] Portability representation of schemas

2019-06-14 Thread Robert Bradshaw
l issues that are still not decided. (I suppose we should be open to throwing whole PRs away in that case.) There are certainly pieces that we'll know that we need (like the ability to serialize a row consistently in all languages) we can get in immediately. > Reuven > > On Thu, Jun

Re: SparkRunner Combine.perKey performance

2019-06-14 Thread Robert Bradshaw
On Thu, Jun 13, 2019 at 8:43 PM Jan Lukavský wrote: > > On 6/13/19 6:10 PM, Robert Bradshaw wrote: > > On Thu, Jun 13, 2019 at 5:28 PM Jan Lukavský wrote: >> >> On 6/13/19 4:31 PM, Robert Bradshaw wrote: >> >> The comment fails to take into account the asymm

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Robert Bradshaw
On Thu, Jun 13, 2019 at 5:28 PM Jan Lukavský wrote: > On 6/13/19 4:31 PM, Robert Bradshaw wrote: > > The comment fails to take into account the asymmetry between calling > addInput vs. mergeAccumulators. It also focuses a lot on the asymptotic > behavior, when the most common beh

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Robert Bradshaw
mbineByKeyAndWindow > > Jan > On 6/13/19 3:51 PM, Robert Bradshaw wrote: > > I think the problem is that it never leverages the (traditionally much > cheaper) CombineFn.addInput(old_accumulator, new_value). Instead, it always > calls CombineFn.mergeAccumulators(old_accumula

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Robert Bradshaw
I think the problem is that it never leverages the (traditionally much cheaper) CombineFn.addInput(old_accumulator, new_value). Instead, it always calls CombineFn.mergeAccumulators(old_accumulator, CombineFn.addInput(CombineFn.createAccumulator(), new_value)). It should be feasible to fix this whil

Re: [DISCUSS] Portability representation of schemas

2019-06-13 Thread Robert Bradshaw
k data channel transfer vs ???, possibly with parameters like "preferred batch size") or standardizing on one physical byte representation for all communication over the boundary.) > > >> >> Can we also just have both of these, with different URNs? >> &g

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Robert Bradshaw
order to analyze it. Schemas don't change this, logical types don't change >> this. However in practice schemas do make it much easier to introspect >> data, and to analyze data. If the tradeoff is that some user might assume >> they understand more than they do about d

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Robert Bradshaw
On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote: > > I believe the schema registry is a transient construction-time concept. I > don't think there's any need for a concept of a registry in the portable > representation. > > I'd rather urn:beam:schema:logicaltype:javasdk not be used whenever

Python dependency compatibility badges

2019-06-12 Thread Robert Bradshaw
We have a pull request to add Pypi dependency compatibility badges to our readme: https://github.com/apache/beam/pull/8791 This looks generally useful, though highlights how out of date we still are in some areas. Any thoughts on this? Given the high visibility, I wanted to get some consensus (even

Re: Plan for dropping python 2 support

2019-06-11 Thread Robert Bradshaw
ed to extend this to beyond 2019. 2 releases > (~ 3 months) after solid python 3 support will very likely put the last > python 2 supporting release to last quarter of 2019 already. > > On Fri, Jun 7, 2019 at 2:15 AM Robert Bradshaw > wrote: > >> I don't think the seco

Re: [DISCUSS] Portability representation of schemas

2019-06-10 Thread Robert Bradshaw
n Mon, Jun 10, 2019 at 2:14 AM Robert Bradshaw > wrote: > >> On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote: >> >>> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote: >>> >>>> Wouldn't SDK specific types always be under the "co

Re: [DISCUSS] Portability representation of schemas

2019-06-10 Thread Robert Bradshaw
On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote: > On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote: > >> Wouldn't SDK specific types always be under the "coders" component >> instead of the logical type listing? >> >> Offhand, having a separate normalized listing of logical schema types i

Re: Plan for dropping python 2 support

2019-06-07 Thread Robert Bradshaw
;> >> In addition to the statement, keeping a target release and date(if possible) >> or timeline to drop support would also help users to decide when they need >> to work on migrating to Python 3. >> >> Regards, >> - TT >> >> [1] https://python

Re: Plan for dropping python 2 support

2019-06-05 Thread Robert Bradshaw
Until Python 3 support for Beam is officially out of beta and recommended, I don't think we can tell people to stop using Python 2. Given that 2020 is just over 6 months away, that seems a short transition time, so I would guess we'll have to continue supporting Python 2 sometime into 2020. A quic

Re: Timer support in Flink

2019-06-04 Thread Robert Bradshaw
One issue with the fully expanded version is that it's so large it's hard to read. I think it would be useful to make the ~ entries (at least) clickable or with hover tool tips. It would be nice to be able to expand columns individually as well. On Tue, Jun 4, 2019 at 7:20 AM Melissa Pashniak wr

Re: [VOTE] Release 2.13.0, release candidate #2

2019-06-03 Thread Robert Bradshaw
+1 I validated the artifacts and Python 3. On Sat, Jun 1, 2019 at 7:45 PM Ankur Goenka wrote: > > Thanks Ahmet and Luke for validation. > > If no one has objections then I am planning to move ahead without Gearpump > validation as it seems to be broken from past multiple releases. > > Reminder:

Re: Definition of Unified model

2019-05-29 Thread Robert Bradshaw
, that ordering of sources is relevant only for >> (partitioned!) streaming sources and generally always reduces to >> sequence metadata (e.g. offsets). >> >> Jan >> >> On 5/28/19 11:43 AM, Robert Bradshaw wrote: >> > Huge +1 to all Kenn said. >> >

Re: Definition of Unified model

2019-05-29 Thread Robert Bradshaw
ys reduces to > sequence metadata (e.g. offsets). Offsets within a file, unordered between files seems exactly analogous with offsets within a partition, unordered between partitions, right? > On 5/28/19 11:43 AM, Robert Bradshaw wrote: > > Huge +1 to all Kenn said. > > > &

Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Robert Bradshaw
are: > > >> - The product is clearly marked as beta with a big warning. > > >> - It looks like mostly a single person project. For the same reason I also strongly prefer not using a fork for a specific setting. Fork will only have less people looking at it. > > >>

Re: Measuring element sizes in benchmarks

2019-05-28 Thread Robert Bradshaw
I'm not quite following what these sizes are needed for--aren't the benchmarks already tuned to be specific, known sizes? I agree that this can be expensive; especially for benchmarking purposes a 5x overhead means you're benchmarking the sizing code, not the pipeline itself. Beam computes estimat

Re: DISCUSS: Sorted MapState API

2019-05-28 Thread Robert Bradshaw
On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles wrote: > > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote: >> >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: >>> >>> Some great comments! >>> >>> Aljoscha: absolutely this would have to be implemented by runners to be >>> efficient. W

Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.

2019-05-28 Thread Robert Bradshaw
>>> >>> On Mon, May 27, 2019 at 3:35 PM Maximilian Michels wrote: >>> > >>> > +1 >>> > >>> > On 27.05.19 14:04, Robert Bradshaw wrote: >>> > > Sounds like everyone's onboard with the plan. Any chance we could >>&

Re: Definition of Unified model

2019-05-28 Thread Robert Bradshaw
ed to >>> match the logical ordering. Not only might you have several elements with >>> the same timestamp, but in reality time skew across backend servers can >>> cause the events to have timestamps in reverse order of the actual >>> causality order. >&g

Re: [DISCUSS] Autoformat python code with Black

2019-05-27 Thread Robert Bradshaw
I'm generally in favor of autoformatters, though I haven't looked at how well this particular one works. We might have to go with https://github.com/desbma/black-2spaces given https://github.com/python/black/issues/378 . On Mon, May 27, 2019 at 10:43 PM Pablo Estrada wrote: > > This looks pretty

Re: Hazelcast Jet Runner

2019-05-27 Thread Robert Bradshaw
I also favor explicit opt-in, especially when you're mixing mature and new components. A differently-named, but still published, artifact seems preferable IMHO to long lived branches. I don't have a handle on how problematic this would be in practice (e.g. how would a user know to update the name.

Re: Environments for External Transforms

2019-05-27 Thread Robert Bradshaw
r. +1 And, on a pragmatic note, it'd be good to share the port with the artifact server as well, in which case the job server could say "serve artifacts to me" without having to worry about any intervening port forwarding, etc. that sits between it and the sdk. > On 27.05.19 13:3

Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.

2019-05-27 Thread Robert Bradshaw
Sounds like everyone's onboard with the plan. Any chance we could publish these for the upcoming 2.13 release? On Wed, Feb 6, 2019 at 6:29 PM Łukasz Gajowy wrote: > > +1 to have a registry for images accessible to anyone. For snapshot images, I > agree that gcr + apache-beam-testing project seem

Re: Environments for External Transforms

2019-05-27 Thread Robert Bradshaw
ironment? That would solve the problem of updating the expansion > > service, although it adds additional complexity for bringing up the > > environment. > > > > > > Which environment would be used to perform the expansion? I think > > thi

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Robert Bradshaw
On Fri, May 24, 2019 at 5:32 AM Reuven Lax wrote: > > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: >> >> >> >> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: >>> >>> >>> >>> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > > A few obvious problems with this code: > 1.

Re: Definition of Unified model

2019-05-23 Thread Robert Bradshaw
t outputs on >> >> checkpoint in sink). That implies that if you don't have sink that is >> >> able to commit outputs atomically on checkpoint, the pipeline >> >> execution should be deterministic upon retries, otherwise shadow >> >> writes

Re: @RequireTimeSortedInput design draft

2019-05-23 Thread Robert Bradshaw
Thanks for writing this up. I think the justification for adding this to the model needs to be that it is useful (you have this covered, though some examples would be nice) and that it's something that can't easily be done by users themselves (specifically, though it can be (relatively) cheaply do

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
n which to perform the expansion (though this would probably not be offered by most, let alone all, expansion services). > On 23.05.19 11:31, Robert Bradshaw wrote: > > On Wed, May 22, 2019 at 6:17 PM Maximilian Michels > <mailto:m...@apache.org>> wrote: > > > > Hi,

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
On Thu, May 23, 2019 at 11:07 AM Maximilian Michels wrote: > My motivation was to get rid of the Docker dependency for the Python VR > tests. Similarly to how we use Python's LOOPBACK environment for > executing all non-cross-language tests, I wanted to use Java's EMBEDDED > environment to run th

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
On Wed, May 22, 2019 at 6:17 PM Maximilian Michels wrote: > Hi, > > Robert and me were discussing on the subject of user-specified > environments for external transforms [1]. We couldn't decide whether > users should have direct control over the environment when they use an > external transform i

Re: [Discussion] A tweak to existing large iterable protocol?

2019-05-22 Thread Robert Bradshaw
ng, but I have trouble seeing how an SDK would take advantage of this (what would trigger the discard of the first page? What complexity would that incur vs. the marginal benefit (if any)?). Mostly this feels like a solution in search of a problem. Is there a problem that you're trying to solv

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Robert Bradshaw
ser themselves, should we elevate this to a property of > (Stateful?)DoFns that the runner can provide? I think a compelling > argument can be made here that we should. > > +1 > > Jan > > On 5/21/19 11:07 AM, Robert Bradshaw wrote: > > On Mon, May 20, 2019 at 5:24 PM

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Robert Bradshaw
>> StatefulParDo in the first place. >>> >>>> >>>> > Pipelines that fail in the "worst case" batch scenario are likely to >>>> degrade poorly (possibly catastrophically) when the watermark falls >>>> behind in streaming mode

Re: [Discussion] A tweak to existing large iterable protocol?

2019-05-21 Thread Robert Bradshaw
The primary con I see with this is that the runner must know ahead of time, when it starts encoding the iterable, whether or not to treat it as a large one (to also cache the part it's encoding to the data channel, or at least some kind of pointer to it). With the current protocol it can be lazy, a

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Robert Bradshaw
t; I see the first two options quite equally good, although the letter one > is probably more time consuming to implement. But it would bring > additional feature to streaming case as well. > > Thanks for any thoughts. > > Jan > > On 5/20/19 12:41 PM, Robert Bradshaw wrote: &g

Re: [BEAM-6673] pre-commit checks failing due to test failure in unrelated Docker module

2019-05-20 Thread Robert Bradshaw
I created https://issues.apache.org/jira/browse/BEAM-7367 On Mon, May 20, 2019 at 3:11 PM Michael Luckey wrote: > > This is most likely caused by Merge of > https://issues.apache.org/jira/browse/BEAM-7349, which was done lately. > > Best, > > michel > > On Mon, May 20, 2019 at 2:49 PM Charith El

Re: [BEAM-6673] pre-commit checks failing due to test failure in unrelated Docker module

2019-05-20 Thread Robert Bradshaw
I have no idea about this failure, but it sounds like you've done due diligence looking into it at this point and it makes sense to ask some reviewers to take a look at your code which can happen in parallel to figuring out the root cuase of this kafka issue before it finally gets submitted. On Mo

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Robert Bradshaw
On Fri, May 17, 2019 at 4:48 PM Jan Lukavský wrote: > > Hi Reuven, > > > How so? AFAIK stateful DoFns work just fine in batch runners. > > Stateful ParDo works in batch as far, as the logic inside the state works for > absolutely unbounded out-of-orderness of elements. That basically > (practica

Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
On Wed, May 15, 2019 at 8:43 PM Allie Chen wrote: > Thanks all for your reply. I will try each of them and see how it goes. > > The experiment I am working now is similar to > https://stackoverflow.com/questions/48886943/early-results-from-groupbykey-transform, > which tries to get early results

Re: SqlTransform Metadata

2019-05-15 Thread Robert Bradshaw
On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles wrote: > > On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw wrote: >> >> Isn't there an API for concisely computing new fields from old ones? >> Perhaps these expressions could contain references to metadata value &g

Re: PardoLifeCycle: Teardown after failed call to setup

2019-05-15 Thread Robert Bradshaw
? > > > > On Wed, May 15, 2019, 6:51 AM Robert Bradshaw wrote: >> >> This does bring up an interesting question though. Are runners >> violating (the intent of) the spec if they simply abandon/kill workers >> rather than gracefully bringing them down (e.g. so th

Re: PardoLifeCycle: Teardown after failed call to setup

2019-05-15 Thread Robert Bradshaw
This does bring up an interesting question though. Are runners violating (the intent of) the spec if they simply abandon/kill workers rather than gracefully bringing them down (e.g. so that these callbacks can be invoked)? On Tue, May 7, 2019 at 3:55 PM Michael Luckey wrote: > > Thanks Kenn and R

Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
On Wed, May 15, 2019 at 1:00 PM Robert Bradshaw wrote: >> >> Unfortunately the "write" portion of the reshuffle cannot be >> parallelized more than the source that it's reading from. In my >> experience, generally the read is the bottleneck in this case, but

Re: pickling typing types in Python 3.5+

2019-05-15 Thread Robert Bradshaw
(2) seems reasonable. On Tue, May 14, 2019 at 3:15 AM Udi Meiri wrote: > > It seems like pickling of typing types is broken in 3.5 and 3.6, fixed in 3.7: > https://github.com/python/typing/issues/511 > > Here are my attempts: > https://gist.github.com/udim/ec213305ca865390c391001e8778e91d > > > M

Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Robert Bradshaw
+1 for removing the code given the current state of things. On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang wrote: > > +1 > > From: Daniel Oliveira > Date: Tue, May 14, 2019 at 2:19 PM > To: dev > >> Hello everyone, >> >> I'm calling for a vote on removing the deprecated Java Reference Runner >>

Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
Unfortunately the "write" portion of the reshuffle cannot be parallelized more than the source that it's reading from. In my experience, generally the read is the bottleneck in this case, but it's possible (e.g. if the input compresses extremely well) that it is the write that is slow (which you se

Re: SqlTransform Metadata

2019-05-15 Thread Robert Bradshaw
Isn't there an API for concisely computing new fields from old ones? Perhaps these expressions could contain references to metadata value such as timestamp. Otherwise, Rather than withMetadata reifying the value as a nested field, with the timestamp, window, etc. at the top level, one could let it

Re: Developing a new beam runner for Twister2

2019-05-15 Thread Robert Bradshaw
I would strongly suggest new runners adapt the portability runner from the start, which will be more forward compatible and more flexible (e.g. supporting other languages). The primary difference is that rather than wrapping individual DoFns, one wraps a "fused" bundle of DoFns (called an Executabl

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Robert Bradshaw
Congratulations, Pablo! From: Ismaël Mejía Date: Wed, May 15, 2019 at 10:22 AM To: > Congrats Pablo, well deserved, nece to see your work recognized! > > On Wed, May 15, 2019 at 9:59 AM Pei HE wrote: > > > > Congrats, Pablo! > > > > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli > > wrot

Re: Python SDK timestamp precision

2019-05-10 Thread Robert Bradshaw
On Thu, May 9, 2019 at 9:32 AM PM Kenneth Knowles wrote: > From: Robert Bradshaw > Date: Wed, May 8, 2019 at 3:00 PM > To: dev > >> From: Kenneth Knowles >> Date: Wed, May 8, 2019 at 6:50 PM >> To: dev >> >> >> The end-of-window, for firing, can

Re: Streaming pipelines in all SDKs!

2019-05-09 Thread Robert Bradshaw
From: Chamikara Jayalath Date: Thu, May 9, 2019 at 7:49 PM To: dev > From: Maximilian Michels > Date: Thu, May 9, 2019 at 9:21 AM > To: > >> Thanks for sharing your ideas for load testing! >> >> > According to other contributors knowledge/experience: I noticed that >> > streaming with KafkaIO

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw
ll of that with a RowCoder we understood to designate the type(s), right? > On Thu, May 9, 2019 at 10:16 AM Robert Bradshaw wrote: >> >> On the flip side, Schemas are equivalent to the space of Coders with >> the addition of a RowCoder and the ability to materialize to som

<    3   4   5   6   7   8   9   10   11   12   >