Re: Handling large values

2018-11-28 Thread Robert Bradshaw
On Wed, Nov 28, 2018 at 11:57 PM Lukasz Cwik wrote: > > Re-adding +datapls-portability-t...@google.com > +datapls-unified-wor...@google.com > > On Wed, Nov 28, 2018 at 2:23 PM Robert Bradshaw wrote: >> >> Thanks for bringing this to the list. More below. >> &g

Re: Handling large values

2018-11-28 Thread Robert Bradshaw
Thanks for bringing this to the list. More below. On Wed, Nov 28, 2018 at 11:10 PM Kenneth Knowles wrote: > FWIW I deliberately limited the thread to not mix public and private > lists, so people intending private replies do not accidentally send to > dev@beam. > > I've left them on this time,

Re: Evolving a Coder for an added field

2018-11-26 Thread Robert Bradshaw
nding of the Coder machinery to be able >>> to design a solution, so I'd need to hand this off or simply leave it in >>> the Jira backlog. >>> >>> [0] https://github.com/apache/beam/pull/6914 >>> >>> >>> On Tue, Nov 6, 2018 at 4:38 AM Rob

Re: Reading CSV from google cloud storage to Data Flow

2018-11-26 Thread Robert Bradshaw
The same holds true in Python: Read the files with TextIO and follow with a Map operation that splits the lines into records. This, of course, only works if you don't have newlines within your records. In that case, you may need to use a DoFn that takes as input a each filename and reads the

Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-20 Thread Robert Bradshaw
to figure out >> whether the problem can be solved upstream or downstream, or with a >> combination of both. >> >> I think Thomas wanted to address this latter case. It seems like we're >> all more or less on the same page. The core problem is more related to >> co

Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Robert Bradshaw
mon and seems to be supported by Java (BigDecimal), > Python (decimal module) and Go (via shopspring/decimal). B is a close > second since many languages can convert it. > Any reason to not just use double? (Do we need arbitrary/fixed precision for anything?) > On Tue, Nov 20, 2018 at

Re: E-mail Organization

2018-11-20 Thread Robert Bradshaw
I was about to suggest tags in subject lines as well. Easier to see in email listings than anything in the body. On Mon, Nov 19, 2018 at 7:22 PM Lukasz Cwik wrote: > Putting the tags in the subject line is inline with the style of what we > currently do using [DISCUSS], [VOTE], [BEAM-YYY] so I

Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-20 Thread Robert Bradshaw
ly we could have an unbounded >>> PCollection goto a BoundedPerElement DoFn and that will produce an >>> unbounded PCollection. Restrictions.IsBounded is used during pipeline >>> execution to inform the runner whether a restriction being returned is >>> bounde

Re: [DISCUSS] Reverting commits on green post-commit status

2018-11-19 Thread Robert Bradshaw
If something breaks Beam's post (or especially pre) commit tests, I agree that rollback is typically the best option and can be done quickly. The situation is totally different if it breaks downstream projects in which Kenn's three points are good criteria for determining if we should rollback,

Re: Python profiling

2018-11-16 Thread Robert Bradshaw
le runners, either disable container cleanup >> (using --retainDockerContainers=true) or use remote distributed file >> system path. >> >> On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw >> wrote: >> >>> Any portable runner should pick it up automatical

Spotless and lint precommit

2018-11-13 Thread Robert Bradshaw
I really like how spottless runs separately and quickly for Java code. Should we do the same for Python lint?

Re: [PROPOSAL] ParquetIO support for Python SDK

2018-11-13 Thread Robert Bradshaw
Was there resolution on how to handle row group size, given that it's hard to pick a decent default? IIRC, the ideal was to base this on byte sizes; will this be in v1 or will there be other parameter(s) that we'll have to support going forward? On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee wrote:

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-09 Thread Robert Bradshaw
ifies causing problems for so many users. >> >> Reuven >> >> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw wrote: >>> >>> Yes, microseconds is a good compromise for covering a long enough >>> timespan that there's little reason it could be

Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-09 Thread Robert Bradshaw
+1 approve. On Fri, Nov 9, 2018 at 2:47 AM Ahmet Altay wrote: > > Hi all, > > Please review the following statement: > > "2.7.0 branch will be marked as the long-term-support (LTS) release branch. > This branch will be supported for a window of 6 months starting from the day > it is marked as

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Robert Bradshaw
I'd assume you're compiling the code with Cython as well? (If you're using the default containers, that should be fine.) On Fri, Nov 9, 2018 at 12:09 AM Robert Bradshaw wrote: > > Very cool to hear of this progress on Samza! > > Python protocol buffers are extraordinarily slow (lots o

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Robert Bradshaw
Very cool to hear of this progress on Samza! Python protocol buffers are extraordinarily slow (lots of reflection, attributes lookups, and bit fiddling for serialization/deserialization that is certainly not Python's strong point). Each bundle processed involves multiple protos being constructed

Re: Can't define a pytype alias from Beam's PCollection type.

2018-11-08 Thread Robert Bradshaw
On Wed, Nov 7, 2018 at 10:30 PM Zach Moshe wrote: > > (Adding the public Beam-dev group) > > On Wed, Nov 7, 2018 at 2:26 PM Zach Moshe wrote: >> >> Hi, >> I've noticed that `beam.core.pvalue.PCollection` doesn't support a >> `__getitem__()` that returns a `GenericMeta` type (like regular types

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-11-08 Thread Robert Bradshaw
There's two questions here: (A) What do we do in the short term? I think adding every runner option to every SDK is not sustainable (n*m work, assuming every SDK knows about every runner), and having a patchwork of options that were added as one-offs to SDKs is not desirable either. Furthermore,

Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-07 Thread Robert Bradshaw
I think that not returning the users specific subclass should be fine. Does the removal of markDone imply that the consumer always knows a "final" key to claim on any given restriction? On Wed, Nov 7, 2018 at 1:45 AM Lukasz Cwik wrote: > > I have started to work on how to change the user facing

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-07 Thread Robert Bradshaw
ownstream consequences for all runners. >>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía wrote: >>>> >>>> +1 to more precision even to the nano level, probably via Reuven's >>>> proposal of a different internal representation. >>>>

Re: Evolving a Coder for an added field

2018-11-06 Thread Robert Bradshaw
> > > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré wrote: >> >> It makes sense to have a more concrete URN including the version. >> >> Good idea Robert. >> >> Regards >> JB >> >> On 05/11/2018 16:52, Robert Bradshaw wrote: >

Re: Stackoverflow Questions

2018-11-06 Thread Robert Bradshaw
People who ask on SO probably expect to be answered on SO. I don't think it makes sense to subscribe users to automated emails like this, but a weekly (or maybe even daily) summary to dev would probably be helpful in effectively getting these questions answered. On Tue, Nov 6, 2018 at 9:27 AM

Re: [DISCUSS] More precision supported by DATETIME field in Schema

2018-11-06 Thread Robert Bradshaw
+1 to offering more granular timestamps in general. I think it will be odd if setting the element timestamp from a row DATETIME field is lossy, so we should seriously consider upgrading that as well. On Tue, Nov 6, 2018 at 6:42 AM Charles Chen wrote: > > One related issue that came up before is

Re: Evolving a Coder for an added field

2018-11-05 Thread Robert Bradshaw
I think we'll want to allow upgrades across SDK versions. A runner should be able to recognize when a coder (or any other aspect of the pipeline) has changed and adapt/reject accordingly. (Until we remove coders from sources/sinks, there's also possibly the expectation that one should be able to

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-11-05 Thread Robert Bradshaw
? For example, are we going to cut regular patch releases > for supported branch (release-2.7.0) within the supported period that fixes > known issues ? My preference is to keep existing policy on this regard. > > Thanks, > Cham > > On Mon, Nov 5, 2018 at 5:12 AM Robert Bradsha

Re: [PROPOSAL] Additional design for the Beam Python State and Timers API

2018-11-05 Thread Robert Bradshaw
; are required to be processed serially (generally key+window) These probably don't merit a new (pair of) named operation(s). The motivation to add them was "why can I use state in a DoFn, but not in Map/FlatMap?" which could be justified by the above. As for the side input question, de

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-26 Thread Robert Bradshaw
Thanks Tim! This was my only hesitation, and sounds like we're in the clear here. +1 (binding) On Fri, Oct 26, 2018 at 5:05 PM Tim Robertson wrote: > > A colleague and I tested on 2.7.0 and 2.8.0RC1: > > 1. Quickstart on Spark/YARN/HDFS (CDH 5.12.0) (commented in spreadsheet) > 2. Our Avro to

Re: Unbalanced FileIO writes on Flink

2018-10-26 Thread Robert Bradshaw
an be overriden? However, > it is not used by the WritesFiles code. > > > -Max > > On 26.10.18 11:41, Robert Bradshaw wrote: > > I think it's worth adding a URN for the operation of distributing > > "evenly" into an "appropriate" number of shards.

Re: Follow up ideas, to simplify creating MonitoringInfos.

2018-10-24 Thread Robert Bradshaw
Thanks for bringing this to the list; it's a good question. I think the difficulty comes from trying to statically define a lists of possibilities that should instead be runtime values. E.g. we currently we're up to about a dozen distinct types, and having a setter for each is both verbose and

Re: Java Precommit duration

2018-10-23 Thread Robert Bradshaw
On Tue, Oct 23, 2018 at 11:28 PM Kenneth Knowles wrote: > Hi all, > > Java Precommit duration is about 1h15. That is quite a burden. Especially > if something gets broken. > I'm in favor of (simple!) build breaks going in before precommits finish, on the promise that the offending test(s)

Re: [PROPOSAL] Move sorting to sdks-java-core

2018-10-23 Thread Robert Bradshaw
I like the idea of asking for a coder for T with properties X. (E.g. the order-preserving one may not be the the most efficient, so a poor default, but required in some cases.) Note that if we go the route of secondary-key-extraction, we don't even need a full coder here, just an order-preserving

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-10-22 Thread Robert Bradshaw
d Blog >> <http://rmannibucau.wordpress.com> | Github >> <https://github.com/rmannibucau> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau> | Book >> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >> >> >

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Robert Bradshaw
and often. The cost of test failures in postsubmit is *significantly* higher, we should only put stuff we can't test earlier there. (If we do move things, I would suggest we keep at least some of the gcp and py3 tests in presubmit if we can't afford to run the whole suite). > On Thu, Oct 18, 2

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Robert Bradshaw
all the different environments. On Wed, Oct 17, 2018 at 10:17 AM Udi Meiri wrote: > >> On Wed, Oct 17, 2018 at 1:38 AM Robert Bradshaw >> wrote: >> >>> On Tue, Oct 16, 2018 at 12:48 AM Udi Meiri wrote: >>> >>>> Hi, >>>> >>>

Re: [PROPOSAL] Move sorting to sdks-java-core

2018-10-18 Thread Robert Bradshaw
+1 to splitting out the Hadoop deps. As has been said, there's no need to move it to core for runners to optimize this. But perhaps a case could be made that this belongs in core? (On the other hand, recent discussions indicate a desire to make core even smaller.) Also, +1 to re-thinking the

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Robert Bradshaw
probably less of a burden requiring this for developers, and would simplify some of our code. However, there's probably only a small subset that merits testing with Cython and without. > On Thu, Oct 18, 2018 at 12:45 AM Robert Bradshaw > wrote: > >> We run the full suite of Python unit t

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Robert Bradshaw
ns the same set of tests, one with --streaming and the other >>> without. This should be able to work for Python as well. >>> >>> The Worker API had some updates in the latest Gradle release but still >>> seems rough to use. >>> >>> On Wed, Oct 17, 2018 at

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-18 Thread Robert Bradshaw
On Wed, Oct 17, 2018 at 7:17 PM Udi Meiri wrote: > On Wed, Oct 17, 2018 at 1:38 AM Robert Bradshaw > wrote: > >> On Tue, Oct 16, 2018 at 12:48 AM Udi Meiri wrote: >> >>> Hi, >>> >>> In light of increasing Python pre-commit times due to the

Re: Java > 8 support

2018-10-18 Thread Robert Bradshaw
On Thu, Oct 18, 2018 at 4:55 AM Kenneth Knowles wrote: > Just to add to what Luke said: The reason we had those Java 8-only modules > was because some underlying tech (example: Gearpump) required Java 8. If an > engine requires something then it is OK for a user who chooses the runner > for that

Re: Integrating Stateful DoFns from the Python SDK

2018-10-17 Thread Robert Bradshaw
my stateful DoFn: > > > >.with_output_types(typehints.KV[K, V]) > > > > For some reason `.with_input_types(typehints.KV[K, V])` on my stateful > > DoFn did not work. > > > > Until we enforce KV during pipeline construction, we will have to throw > >

Re: [PROPOSAL] allow the users to anticipate the support of features in the targeted runner.

2018-10-17 Thread Robert Bradshaw
On Wed, Oct 17, 2018 at 3:17 PM Kenneth Knowles wrote: > On Wed, Oct 17, 2018 at 3:12 AM Maximilian Michels wrote: > >> A dry-run feature would be useful, i.e. the user can run an inspection >> on the pipeline to see if it contains any features which are not >> supported by the Runner. >> > >

Re: Integrating Stateful DoFns from the Python SDK

2018-10-17 Thread Robert Bradshaw
Yes, we should be enforcing keyness (and use of KeyCoder with) stateful DoFns, similar to what we do for GBKs. See e.g. https://github.com/apache/beam/pull/6304#issuecomment-421935375 (This possibly relates to a long-standing issue that the coder inference should be moved up into construction, or

Re: [PROPOSAL] Using Bazel and Docker for Python SDK development and tests

2018-10-17 Thread Robert Bradshaw
On Tue, Oct 16, 2018 at 12:48 AM Udi Meiri wrote: > Hi, > > In light of increasing Python pre-commit times due to the added Python 3 > tests, > I thought it might be time to re-evaluate the tools used for Python tests > and development, and propose an alternative. > > Currently, we use

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-16 Thread Robert Bradshaw
to be nested within an option. This >>> is amplified by there being two Runners the user needs to be aware of, >>> i.e. PortableRunner and the actual Runner (Dataflow/Flink/Spark..). >>> >>> I feel like we would eventually replicate all options in the SDK b

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-10-16 Thread Robert Bradshaw
nt >>> to find out what caused this change. >>> >>> I believe we can improve our commit guidelines in this way and it should >>> help to have commit history more clean and easy to read. >>> >>> On 1 Oct 2018, at 06:34, Kenneth Knowles wrote: >

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-16 Thread Robert Bradshaw
its JSON form? > > On Mon, Oct 15, 2018 at 2:41 PM Robert Bradshaw > wrote: > >> On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik wrote: >> > >> > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw >> wrote: >> >> >> >>

Re: Rethinking Timers as PCollections

2018-10-16 Thread Robert Bradshaw
eems like the pun of "PCollection" for so many purposes is hitting its limit. >> >> Timers should fire according to just the watermark of the data input, but nevertheless are a hold on GC and also output watermark. >> >> Kenn >> >> On Thu, Oct 4, 20

Re: [Proposal] Euphoria DSL - looking for reviewers

2018-10-16 Thread Robert Bradshaw
Ideally one (or all) of you can become committers [1], which I think should be the goal. While for the time being this would involve the overhead of getting existing committers to sign off on PRs (which can be reviewed by others as well), this can actually be beneficial as it will be a forcing

Re: Why not adding all coders into ModelCoderRegistrar?

2018-10-16 Thread Robert Bradshaw
Any coders added to the ModelCoderRegistrar requires support from *all* SDKs, which is why that set is chosen sparingly. Could you clarify exactly what you're trying to achieve. It sounds like there's some case where you know the SDK will submit a KV with a Void and/or VarIntCoder in the key, and

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-15 Thread Robert Bradshaw
On Mon, Oct 15, 2018 at 11:30 PM Lukasz Cwik wrote: > > On Mon, Oct 15, 2018 at 1:17 PM Robert Bradshaw wrote: >> >> On Mon, Oct 15, 2018 at 7:50 PM Lukasz Cwik wrote: >> > >> > I agree with the sentiment for better error checking. >> > >> &

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-15 Thread Robert Bradshaw
feature that SDKs may want to support, but I wouldn't want to require this complexity for bootstrapping an SDK. Regarding always keeping runner options separate, +1, though I'm not sure the line is always clear. > On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw wrote: >> >> On Mon

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-15 Thread Robert Bradshaw
On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels wrote: > > I agree that the current approach breaks the pipeline options contract > because "unknown" options get parsed in the same way as options which > have been defined by the user. FWIW, I think we're already breaking this "contract."

Re: Splitting the repo

2018-10-12 Thread Robert Bradshaw
CI but the dev are never affected by >>> that and the build does not mess up their machines as well. >>> >>> Today the main blocker is that default "profile" (script) is not matching >>> dev persona and therefore there is no real hope to have external

Re: post-commit failure emails

2018-10-12 Thread Robert Bradshaw
I agree the jenkins emails are spammy (to the point that I honestly can't follow all of them). +1 to emailing "suspects" as defined by those that impacted the build in the time it turned green to red. On Fri, Oct 12, 2018 at 12:55 AM Udi Meiri wrote: > > The email trigger is setup to trigger on

Re: Python SDK: .options deprecation

2018-10-12 Thread Robert Bradshaw
Correct. Among other things, we don't want to expose the choice of runner during pipeline construction (perhaps it's even deferred), or characteristics like streaming vs. batch (the runner should be able to make this choice on its own). This was not yet pushed all the way through in Python as it

Re: [DISCUSS] - Separate JIRA notifications to a new mailing list

2018-10-11 Thread Robert Bradshaw
Huge +1 from me too. On Thu, Oct 11, 2018 at 2:42 PM Jean-Baptiste Onofré wrote: > > +1 > > We are doing the same in Karaf as well. > > Regards > JB > > On 11/10/2018 14:35, Colm O hEigeartaigh wrote: > > Hi all, > > > > Apologies in advance if this has already been discussed (and rejected). > >

Re: Splitting the repo

2018-10-10 Thread Robert Bradshaw
> >> > For the clean point it is quite linked to the build tools and fake env >> > for not native modules for the build tool (go for gradle which is java >> > first for instance). This is why having a real build which is natural >> > per language would be

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-10-10 Thread Robert Bradshaw
s a branch where we will cherry-pick some important fixes in > >>> the future and where we will cut release. It's the approach I use in > >>> other Apache projects (especially Karaf) and it works fine. > >> > >> > >> JB, does Karaf has a documented p

Re: [DISCUSS] Gradle for the build ?

2018-10-10 Thread Robert Bradshaw
Some rough stats (because I was curious): The gradle files have been edited by ~79 unique contributors over 696 distinct commits, whereas the maven ones were edited (over a longer time period) by ~130 unique contributors over 1389 commits [1]. This doesn't capture how much effort was put into

Re: Splitting the repo

2018-10-10 Thread Robert Bradshaw
o introduce core-sql, core-schema, >> core-sdf, ... >> >> It's not a huge effort, and would allow us to move forward on Beam "more >> API oriented" approach. >> >> Regards >> JB >> >> On 10/10/2018 10:12, Robert Bradshaw wrote: >> >

Re: Splitting the repo

2018-10-10 Thread Robert Bradshaw
interface/SPI. >> >> Our users would then be able to pick the part of the core they want, >> resulting with lighter artifacts, and for us, it gives a more flexible >> approach. >> >> Regards >> JB >> >> On 10/10/2018 10:26, Robert Bradshaw wrote

Re: Splitting the repo

2018-10-10 Thread Robert Bradshaw
gt; I discussed with Luke and Reuven to introduce core-sql, core-schema, > > core-sdf, ... > > > > It's not a huge effort, and would allow us to move forward on Beam > "more > > API oriented" approach. > > > > R

Re: Splitting the repo

2018-10-10 Thread Robert Bradshaw
nted" approach. > > Regards > JB > > On 10/10/2018 10:12, Robert Bradshaw wrote: > > Hi everyone, > > > > While IMHO it's too early to even be able to split the repo, it's not to > > early to talk about it, and I wanted to spin this off to keep the o

Re: [DISCUSS] Gradle for the build ?

2018-10-10 Thread Robert Bradshaw
On Wed, Oct 10, 2018 at 8:03 AM Jean-Baptiste Onofré wrote: > Hi Robert, > > about your point about we never fully build the project, even if I > agree, it's what we "sold" with Gradle. > Because, with Maven you can also build a single module without problem. > Good incremental support for the

Splitting the repo

2018-10-10 Thread Robert Bradshaw
Hi everyone, While IMHO it's too early to even be able to split the repo, it's not to early to talk about it, and I wanted to spin this off to keep the other thread focused. In particular, I am trying to figure out exactly what is hoped to be gained by splitting things up. In my experience, a

Re: [DISCUSS] Gradle for the build ?

2018-10-09 Thread Robert Bradshaw
On Tue, Oct 9, 2018 at 10:04 AM Jean-Baptiste Onofré wrote: > Hi guys, > > I know that's a hot topic, but I have to bring this discussion on the > table. > Thank you for bringing this up and revisiting it now that we have some experience. > Some months ago, we discussed about migrating our

Re: Beam website sources migrated to apache/beam

2018-10-08 Thread Robert Bradshaw
; >> We hope the new contribution experience will be seamless and make your > >> website contributions the best part of your day. If you find any rough > >> edges or areas for improvement, please add them to the fit-and-finish > >> list here: https://issues.apache.

Re: A new contributor

2018-10-05 Thread Robert Bradshaw
Be glad to do that. Done. On Fri, Oct 5, 2018 at 12:03 PM Gleb Kanterov wrote: > Hi all, > > My name is Gleb and I work on Data Infrastructure at Spotify. We use > Apache Beam and develop spotify/scio . > Time-to-time I create JIRA issues and submit pull

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-10-05 Thread Robert Bradshaw
On Fri, Oct 5, 2018 at 3:59 AM Chamikara Jayalath wrote: > > On Thu, Oct 4, 2018 at 9:39 AM Ahmet Altay wrote: > >> I agree that LTS releases require more thought. Thank you for raising >> these questions. What other open questions do we have related LTS releases? >> >> One way to do this would

Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-04 Thread Robert Bradshaw
+1 to cutting the release. I agree that the LTS label requires more discussion. I think it boils down to the question of whether we are comfortable with encouraging people to not upgrade to the latest Beam. It probably boils down to creating a list of (potential) blockers and then going from

Re: Rethinking Timers as PCollections

2018-10-04 Thread Robert Bradshaw
hich seems to be already the case). Timers as > separate PCollections seems elegant but less practical to me. > > -Max > > [Disclaimer: I could be wrong since I just thought about this in more > detail] > > On 20.09.18 00:28, Robert Bradshaw wrote: > > On Wed, Sep 19, 2018 at

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-29 Thread Robert Bradshaw
;>> wrote: >>>> >>>>> I brought up this discussion a few months ago from the other side: I >>>>> don't like my commits being squashed. I try to create logical commits that >>>>> each passes tests and could be broken up into multiple

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-28 Thread Robert Bradshaw
ure. > > > https://lists.apache.org/thread.html/8d29e474e681ab9123280164d95075bb8b0b91486b66d3fa25ed20c2@%3Cdev.beam.apache.org%3E > > Andrew > > On Fri, Sep 28, 2018 at 7:29 AM Chamikara Jayalath > wrote: > >> >> >> On Thu, Sep 27, 2018 at 9:51 AM Robert Bradshaw >&g

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Robert Bradshaw
Something here on the Beam side is clearly linear in the input size, as if there's a bottleneck where were' not able to get any parallelization. Is the spark variant running in parallel? On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) wrote: > Hi > I have completed my test. > 1. Spark

Re: Removing documentation for old Beam versions

2018-09-27 Thread Robert Bradshaw
he CI process, then I'm in favor of that. It looks cleaner to >>> not mingle source and generated files in the same repo. Otherwise we can do >>> the asf-site branch in the main repo and get rid of docs from it once we >>> found a better solution. >>> >>> >&

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-27 Thread Robert Bradshaw
I agree that we should create a good pointer for cleaning up PRs, and request (though not require) that authors do it. It's unfortunate though that squashing during a review makes things difficult to follow, so adds one more round trip. We could consider for those PRs that make sense as a single

Re: Removing documentation for old Beam versions

2018-09-26 Thread Robert Bradshaw
h from a > Git repo, SVN, or a UI-based CMS interface. > > On Wed, Sep 26, 2018 at 9:45 AM Robert Bradshaw > wrote: > >> I am also definitely in favor of a single repository. Perhaps I'm just >> misunderstanding why the generated must be put in a git repository at >> al

Re: Removing documentation for old Beam versions

2018-09-26 Thread Robert Bradshaw
requisite for this effort. The goal of this work is to >>>>>>> improve the reliability of automation for contributing website changes. >>>>>>> At >>>>>>> last measure, only about half of beam-site PR merges use Mergebot >>>&

Re: [VOTE] Release 2.7.0, release candidate #3

2018-09-26 Thread Robert Bradshaw
+1 (binding), same verification as before. On Wed, Sep 26, 2018 at 7:36 AM Charles Chen wrote: > To clarify, the only difference between RC2 and RC3 is the Python fix > https://github.com/apache/beam/pull/6494. > > This means that the Java validations from RC2 should carry over, though I >

Re: [VOTE] Release 2.7.0, release candidate #2

2018-09-25 Thread Robert Bradshaw
+1 (binding) I verified all the signatures and hashes, as well as one of the Python wheels, and that we're not shipping gradle[w] but otherwise the content matches the git repo (except a SNAPSHOT vs version change to the source). The changes [1] look minimal compared to RC1, so most of the

Re: [jira] [Commented] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.

2018-09-24 Thread Robert Bradshaw
to set worker log level in Python SDK harness. > > --- > > > > Key: BEAM-5468 > > URL: https://issues.apache.org/jira/browse/BEAM-5468 > > Project: Beam > > Issue Type: Improvement

Re: Compatibility Matrix vs Runners in the code base

2018-09-21 Thread Robert Bradshaw
I don't know that we need to limit the matrix to runners in the Beam codebase (in fact, I could envision a world where most runners live in an upstream codebase), but at the very lease each of these runners should have a link to a page about using that runner with Beam. On Fri, Sep 21, 2018 at

Re: Retiring runners?

2018-09-21 Thread Robert Bradshaw
Glad to hear Gearpump is still alive. It is hard to measure how much of a burden these additional runners are at the moment. I suggest that if it comes to a point that non-trivial changes are needed, we reach out to the list. If no one agrees to support it, we could disable the tests and, after

Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-20 Thread Robert Bradshaw
Congratulations Kenn! And thank you, Davor, for the hard work you've put in these last several years. On Thu, Sep 20, 2018 at 9:50 AM Tim Robertson wrote: > Thank you to Davor all the PMC - I can only imagine how much work it has > been to get Beam to where it is today. > > Congratulations

Re: Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
On Wed, Sep 19, 2018 at 11:54 PM Lukasz Cwik wrote: > > On Wed, Sep 19, 2018 at 2:46 PM Robert Bradshaw > wrote: > >> On Wed, Sep 19, 2018 at 8:31 PM Lukasz Cwik wrote: >> >>> *How does modelling a timer as a PCollection help the Beam model?* >>> >

Re: Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
tation. The special treatment >> (and slight confusion) at the graph level perhaps was an early warning >> sign, discovering the extra complexity wiring this in a runner should be a >> reason to revisit. >> >> Conceptually timers are special state, they are certainly more state

***UNCHECKED*** Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-19 Thread Robert Bradshaw
GBK - ExecutableStage - GBK - ... (or is it not always a digraph of this form, possibly with branching)? > > Thanks, > Thomas > > > On Fri, Sep 14, 2018 at 2:56 AM Robert Bradshaw > wrote: > >> Currently the best solution we've come up with is that we must process an &

Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
TLDR Perhaps we should revisit https://s.apache.org/beam-portability-timers in light of the fact that Timers are more like State than PCollections. -- While looking at implementing State and Timers in the Python SDK, I've been revisiting the ideas presented at

***UNCHECKED*** Re: Proposal for Beam Python User State and Timer APIs

2018-09-19 Thread Robert Bradshaw
And its implementation of the Fn API is on it's way too: https://github.com/apache/beam/pull/6349 https://github.com/apache/beam/pull/6433 On Tue, Sep 18, 2018 at 6:56 PM Charles Chen wrote: > An update: the reference DirectRunner implementation of (and common > execution code for) the Python

Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Robert Bradshaw
There are known performance issues with Beam on Spark that are being worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible you're hitting something different, but would be worth investigating. See also

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett wrote: > Do we currently maintain a finer grained list of compatibility between > execution/runner versions and beam versions? Is this only really a concern > with recent Flink (sounded like at least Spark jump, too)? I see the > capability

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
> 4. In the long run, we want a stable abstraction layer for each > Runner > > that, ideally, is maintained by the upstream of the execution > > engine. In > > the short run, this is probably not realistic, as the shared > libraries > >

Re: PTransforms and Fusion

2018-09-15 Thread Robert Bradshaw
On Tue, Sep 11, 2018 at 7:01 PM Henning Rohde wrote: > > Empty pipelines have neither subtransforms or a spec, which is what I > don't think is useful > There's nothing preventing them form having a spec, display data, etc. They're useful because they (more) faithfully represent the user's

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-15 Thread Robert Bradshaw
+1 (binding) On Sat, Sep 15, 2018 at 6:44 AM Tim wrote: > +1 > > On 15 Sep 2018, at 01:23, Yifan Zou wrote: > > +1 > > On Fri, Sep 14, 2018 at 4:20 PM David Morávek > wrote: > >> +1 >> >> >> >> On 15 Sep 2018, at 00:59, Anton Kedin wrote: >> >> +1 >> >> On Fri, Sep 14, 2018 at 3:22 PM Alan

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
is that the user needs to use to get decent performance as it simply doesn't scale (in many directions). Fortunately in this case, though you have to be a bit more careful about things, it is not less efficient. On Fri, Sep 14, 2018 at 4:10 PM Robert Bradshaw wrote: > >> >> If

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
If Spark supports producing grouped elements in timestamp order, a more intelligent ReduceFnRunner can be used. (We take advantage of that in Dataflow for example.) For non-merging windows, you could also put the window itself (or some subset thereof) into the key resulting in smaller groupings.

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-14 Thread Robert Bradshaw
Currently the best solution we've come up with is that we must process an unbounded number of bundles concurrently to avoid deadlock. Especially in the batch case, this may be wasteful as we bring up workers for many stages that are not actually executable until upstream stages finish. Since it

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 10:02 AM Romain Manni-Bucau wrote: > > Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw a > écrit : > >> On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau >> wrote: >> >>> Well IBM runner is outside Beam for instance so this is

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau wrote: > Well IBM runner is outside Beam for instance so this is not really a point > IMHO. > > My view is simple: > 1. does this module bring anything to Beam as a project: I understand your > answer as a no (please clarify if I'm wrong) > As

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-13 Thread Robert Bradshaw
the SDK >>> environment) from the runner (job service + shared common runner libs + >>> Flink/Spark/Dataflow/Apex/Samza/...). >>> >>> Dataflow would be highly invested in having the appropriate tooling >>> within Apache Beam to support multiple SDK versi

<    5   6   7   8   9   10   11   12   13   14   >