Splitting the repo

2018-10-10 Thread Robert Bradshaw
Hi everyone, While IMHO it's too early to even be able to split the repo, it's not to early to talk about it, and I wanted to spin this off to keep the other thread focused. In particular, I am trying to figure out exactly what is hoped to be gained by splitting things up. In my experience, a sin

Re: [DISCUSS] Gradle for the build ?

2018-10-09 Thread Robert Bradshaw
On Tue, Oct 9, 2018 at 10:04 AM Jean-Baptiste Onofré wrote: > Hi guys, > > I know that's a hot topic, but I have to bring this discussion on the > table. > Thank you for bringing this up and revisiting it now that we have some experience. > Some months ago, we discussed about migrating our bui

Re: Beam website sources migrated to apache/beam

2018-10-08 Thread Robert Bradshaw
t; >> > >> We hope the new contribution experience will be seamless and make your > >> website contributions the best part of your day. If you find any rough > >> edges or areas for improvement, please add them to the fit-and-finish > >> list here: https:/

Re: A new contributor

2018-10-05 Thread Robert Bradshaw
Be glad to do that. Done. On Fri, Oct 5, 2018 at 12:03 PM Gleb Kanterov wrote: > Hi all, > > My name is Gleb and I work on Data Infrastructure at Spotify. We use > Apache Beam and develop spotify/scio . > Time-to-time I create JIRA issues and submit pull requests.

Re: What is required for LTS releases? (was: [PROPOSAL] Prepare Beam 2.8.0 release)

2018-10-05 Thread Robert Bradshaw
On Fri, Oct 5, 2018 at 3:59 AM Chamikara Jayalath wrote: > > On Thu, Oct 4, 2018 at 9:39 AM Ahmet Altay wrote: > >> I agree that LTS releases require more thought. Thank you for raising >> these questions. What other open questions do we have related LTS releases? >> >> One way to do this would

Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-04 Thread Robert Bradshaw
+1 to cutting the release. I agree that the LTS label requires more discussion. I think it boils down to the question of whether we are comfortable with encouraging people to not upgrade to the latest Beam. It probably boils down to creating a list of (potential) blockers and then going from there

Re: Rethinking Timers as PCollections

2018-10-04 Thread Robert Bradshaw
a spec for a DoFn (which seems to be already the case). Timers as > separate PCollections seems elegant but less practical to me. > > -Max > > [Disclaimer: I could be wrong since I just thought about this in more > detail] > > On 20.09.18 00:28, Robert Bradshaw wrote: > >

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-29 Thread Robert Bradshaw
gt;> >>>> On Fri, Sep 28, 2018 at 8:21 AM Andrew Pilloud >>>> wrote: >>>> >>>>> I brought up this discussion a few months ago from the other side: I >>>>> don't like my commits being squashed. I try to create logical com

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-28 Thread Robert Bradshaw
in the > future. > > > https://lists.apache.org/thread.html/8d29e474e681ab9123280164d95075bb8b0b91486b66d3fa25ed20c2@%3Cdev.beam.apache.org%3E > > Andrew > > On Fri, Sep 28, 2018 at 7:29 AM Chamikara Jayalath > wrote: > >> >> >> On Thu, Sep 27, 2018 at 9:51 AM Robert B

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Robert Bradshaw
Something here on the Beam side is clearly linear in the input size, as if there's a bottleneck where were' not able to get any parallelization. Is the spark variant running in parallel? On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) wrote: > Hi > I have completed my test. > 1. Spark paramet

Re: Removing documentation for old Beam versions

2018-09-27 Thread Robert Bradshaw
complicate the CI process, then I'm in favor of that. It looks cleaner to >>> not mingle source and generated files in the same repo. Otherwise we can do >>> the asf-site branch in the main repo and get rid of docs from it once we >>> found a better solution. >&g

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-27 Thread Robert Bradshaw
I agree that we should create a good pointer for cleaning up PRs, and request (though not require) that authors do it. It's unfortunate though that squashing during a review makes things difficult to follow, so adds one more round trip. We could consider for those PRs that make sense as a single l

Re: Are there plans for removing joda-time from the beam java SDK?

2018-09-27 Thread Robert Bradshaw
As long as joda stays anywhere in the public API (and removing it would be a backwards incompatible change) we can't drop it as a dependency. While we could provide java.time overloads for time-accepting methods, time-returning methods can't be as transparently interchangeable. I'm not sure whethe

Re: Removing documentation for old Beam versions

2018-09-26 Thread Robert Bradshaw
n publish from a > Git repo, SVN, or a UI-based CMS interface. > > On Wed, Sep 26, 2018 at 9:45 AM Robert Bradshaw > wrote: > >> I am also definitely in favor of a single repository. Perhaps I'm just >> misunderstanding why the generated must be put in a git repository

Re: Removing documentation for old Beam versions

2018-09-26 Thread Robert Bradshaw
>>>> shouldn't be a prerequisite for this effort. The goal of this work is to >>>>>>> improve the reliability of automation for contributing website changes. >>>>>>> At >>>>>>> last measure, only about half of beam-si

Re: [VOTE] Release 2.7.0, release candidate #3

2018-09-26 Thread Robert Bradshaw
+1 (binding), same verification as before. On Wed, Sep 26, 2018 at 7:36 AM Charles Chen wrote: > To clarify, the only difference between RC2 and RC3 is the Python fix > https://github.com/apache/beam/pull/6494. > > This means that the Java validations from RC2 should carry over, though I > reran

Re: [VOTE] Release 2.7.0, release candidate #2

2018-09-25 Thread Robert Bradshaw
+1 (binding) I verified all the signatures and hashes, as well as one of the Python wheels, and that we're not shipping gradle[w] but otherwise the content matches the git repo (except a SNAPSHOT vs version change to the source). The changes [1] look minimal compared to RC1, so most of the verifi

Re: [jira] [Commented] (BEAM-5468) Allow runner to set worker log level in Python SDK harness.

2018-09-24 Thread Robert Bradshaw
Issue Type: Improvement > > Components: sdk-py-harness > >Reporter: Robert Bradshaw > >Assignee: Robert Bradshaw > >Priority: Major > > > > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) >

Re: Compatibility Matrix vs Runners in the code base

2018-09-21 Thread Robert Bradshaw
I don't know that we need to limit the matrix to runners in the Beam codebase (in fact, I could envision a world where most runners live in an upstream codebase), but at the very lease each of these runners should have a link to a page about using that runner with Beam. On Fri, Sep 21, 2018 at 10:

Re: Retiring runners?

2018-09-21 Thread Robert Bradshaw
Glad to hear Gearpump is still alive. It is hard to measure how much of a burden these additional runners are at the moment. I suggest that if it comes to a point that non-trivial changes are needed, we reach out to the list. If no one agrees to support it, we could disable the tests and, after so

Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-20 Thread Robert Bradshaw
Congratulations Kenn! And thank you, Davor, for the hard work you've put in these last several years. On Thu, Sep 20, 2018 at 9:50 AM Tim Robertson wrote: > Thank you to Davor all the PMC - I can only imagine how much work it has > been to get Beam to where it is today. > > Congratulations Kenn!

Re: Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
On Wed, Sep 19, 2018 at 11:54 PM Lukasz Cwik wrote: > > On Wed, Sep 19, 2018 at 2:46 PM Robert Bradshaw > wrote: > >> On Wed, Sep 19, 2018 at 8:31 PM Lukasz Cwik wrote: >> >>> *How does modelling a timer as a PCollection help the Beam model?* >>> >

Re: Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
> Runner interaction and overall implementation. The special treatment >> (and slight confusion) at the graph level perhaps was an early warning >> sign, discovering the extra complexity wiring this in a runner should be a >> reason to revisit. >> >> Conceptually

***UNCHECKED*** Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-19 Thread Robert Bradshaw
ge - GBK - ExecutableStage - GBK - ... (or is it not always a digraph of this form, possibly with branching)? > > Thanks, > Thomas > > > On Fri, Sep 14, 2018 at 2:56 AM Robert Bradshaw > wrote: > >> Currently the best solution we've come up with is that we must pr

Rethinking Timers as PCollections

2018-09-19 Thread Robert Bradshaw
TLDR Perhaps we should revisit https://s.apache.org/beam-portability-timers in light of the fact that Timers are more like State than PCollections. -- While looking at implementing State and Timers in the Python SDK, I've been revisiting the ideas presented at https://s.apache.org/beam-portabilit

***UNCHECKED*** Re: Proposal for Beam Python User State and Timer APIs

2018-09-19 Thread Robert Bradshaw
And its implementation of the Fn API is on it's way too: https://github.com/apache/beam/pull/6349 https://github.com/apache/beam/pull/6433 On Tue, Sep 18, 2018 at 6:56 PM Charles Chen wrote: > An update: the reference DirectRunner implementation of (and common > execution code for) the Python

Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Robert Bradshaw
There are known performance issues with Beam on Spark that are being worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's possible you're hitting something different, but would be worth investigating. See also https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performan

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett wrote: > Do we currently maintain a finer grained list of compatibility between > execution/runner versions and beam versions? Is this only really a concern > with recent Flink (sounded like at least Spark jump, too)? I see the > capability matrix:

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
> > > > > 4. In the long run, we want a stable abstraction layer for each > Runner > > that, ideally, is maintained by the upstream of the execution > > engine. In > > the short run, this is probably not realistic, as the shared > libra

Re: PTransforms and Fusion

2018-09-15 Thread Robert Bradshaw
On Tue, Sep 11, 2018 at 7:01 PM Henning Rohde wrote: > > Empty pipelines have neither subtransforms or a spec, which is what I > don't think is useful > There's nothing preventing them form having a spec, display data, etc. They're useful because they (more) faithfully represent the user's inten

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-15 Thread Robert Bradshaw
+1 (binding) On Sat, Sep 15, 2018 at 6:44 AM Tim wrote: > +1 > > On 15 Sep 2018, at 01:23, Yifan Zou wrote: > > +1 > > On Fri, Sep 14, 2018 at 4:20 PM David Morávek > wrote: > >> +1 >> >> >> >> On 15 Sep 2018, at 00:59, Anton Kedin wrote: >> >> +1 >> >> On Fri, Sep 14, 2018 at 3:22 PM Alan My

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
ts like this that the user needs to use to get decent performance as it simply doesn't scale (in many directions). Fortunately in this case, though you have to be a bit more careful about things, it is not less efficient. On Fri, Sep 14, 2018 at 4:10 PM Robert Bradshaw wrote: > >&g

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
If Spark supports producing grouped elements in timestamp order, a more intelligent ReduceFnRunner can be used. (We take advantage of that in Dataflow for example.) For non-merging windows, you could also put the window itself (or some subset thereof) into the key resulting in smaller groupings. I

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-14 Thread Robert Bradshaw
Currently the best solution we've come up with is that we must process an unbounded number of bundles concurrently to avoid deadlock. Especially in the batch case, this may be wasteful as we bring up workers for many stages that are not actually executable until upstream stages finish. Since it may

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 10:02 AM Romain Manni-Bucau wrote: > > Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw a > écrit : > >> On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau >> wrote: >> >>> Well IBM runner is outside Beam for instance so this is

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau wrote: > Well IBM runner is outside Beam for instance so this is not really a point > IMHO. > > My view is simple: > 1. does this module bring anything to Beam as a project: I understand your > answer as a no (please clarify if I'm wrong) > As h

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-13 Thread Robert Bradshaw
t a user writes with and how it runs (the SDK and the SDK >>> environment) from the runner (job service + shared common runner libs + >>> Flink/Spark/Dataflow/Apex/Samza/...). >>> >>> Dataflow would be highly invested in having the appropriate tooling >>>

Re: [jira] [Commented] (BEAM-5354) Side Inputs seems to be non-working in the sdk-go

2018-09-13 Thread Robert Bradshaw
In the short term, try passing the experiment runner_harness_container_image=gcr.io/dataflow-build/robertwb/harness:latest On Thu, Sep 13, 2018 at 2:08 PM Tomas Roos (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/BEAM-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:co

Re: [portablility] metrics interrogations

2018-09-13 Thread Robert Bradshaw
On Thu, Sep 13, 2018 at 10:06 AM Etienne Chauchot wrote: > Hi, > Ben, thanks for clarifying. It was indeed a terminology misunderstanding > > Le mercredi 12 septembre 2018 à 15:10 -0700, Ben Chambers a écrit : > > I think there is a confusion of terminology here. Let me attempt to > clarify as a

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-12 Thread Robert Bradshaw
On Wed, Sep 12, 2018 at 5:27 PM Alexey Romanenko wrote: > Thank you everybody for your feedback! > > I think we can conclude that the most popular option, according to > discussion above, is number 3. Not sure if we need to do a separate vote > for that but, please, let me know if we need. > > So

Re: [portablility] metrics interrogations

2018-09-12 Thread Robert Bradshaw
y "metric cells" in the Java Runner Harness.) > Le mardi 11 septembre 2018 à 17:53 +0200, Robert Bradshaw a écrit : > On Mon, Sep 10, 2018 at 11:07 AM Etienne Chauchot > wrote: > > Hi all, > > @Luke, @Alex I have a general question related to metrics in the Fn API: > as

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-12 Thread Robert Bradshaw
The target audience is people who want to use the latest Beam but do not want to use the latest version of the runner, right? I think this will be somewhat (though not entirely) addressed by Beam LTS releases, where those not wanting to upgrade the runner at least have a well-supported version of

Re: [portablility] metrics interrogations

2018-09-11 Thread Robert Bradshaw
On Mon, Sep 10, 2018 at 11:07 AM Etienne Chauchot wrote: > Hi all, > > @Luke, @Alex I have a general question related to metrics in the Fn API: > as the communication between runner harness and SDK harness is done on a > bundle basis. When the runner harness sends data to the sdk harness to > exe

Re: How to implement repartition.

2018-09-11 Thread Robert Bradshaw
Does Reshuffle do what you want? On Tue, Sep 11, 2018, 3:46 PM devinduan(段丁瑞) wrote: > Hi all: > I recently start studying the Beam on spark runner. > I want to implement a method *repartition* similar to Spark > *rdd.repartition()* , but I can't find a solution. > Could anyone help

Re: PTransforms and Fusion

2018-09-11 Thread Robert Bradshaw
in a few other areas (think Docker environments). I think we >> need to provide the flexibility and enable, not prevent alternatives and >> therefore -1 for B1 (unsurprisingly :). >> >> [1] >> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d7

Re: PTransforms and Fusion

2018-09-10 Thread Robert Bradshaw
A) I think it's a bug to not handle empty PTransforms (which are useful at pipeline construction, and may still have meaning in terms of pipeline structure, e.g. for visualization). Note that such transforms, if truly composite, can't output any PCollections that do not appear in their inputs (whic

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-07 Thread Robert Bradshaw
ost complicated for us if the final > goal it to merge “hadoop-input-format” and “hadoop-output-format” together. > > On 7 Sep 2018, at 13:45, Robert Bradshaw wrote: > > Agree about not impacting users. Perhaps I misread (3), isn't it fully > backwards compatible as well? >

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-07 Thread Robert Bradshaw
Agree about not impacting users. Perhaps I misread (3), isn't it fully backwards compatible as well? On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré wrote: > Hi, > > in order to limit the impact for the existing users on Beam 2.x series, > I would go for (1). > > Regards > JB > > On 06/09/20

Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-07 Thread Robert Bradshaw
I think it makes sense to keep *hadoop-file-system* separate, as it's common to use HDFS even if one is not using any of the other hadoop (mapreduce) libraries. On the other hand, it makes a lot of sense to me to put the hadoop read and write into the same module, probably going with option (3) whe

Re: Beam Schemas: current status

2018-09-06 Thread Robert Bradshaw
a Registry.). Any source can integrate with an external > registry and use it to set the schema on the output. > > Reuven > > On Fri, Aug 31, 2018 at 12:44 PM Robert Bradshaw > wrote: > >> On Fri, Aug 31, 2018 at 5:01 PM Alexey Romanenko < >> aromanenko...

Re: Beam Schemas: current status

2018-08-31 Thread Robert Bradshaw
On Fri, Aug 31, 2018 at 5:01 PM Alexey Romanenko wrote: > Thanks Reuven for updating community with this, great work! > > One small question about IO integration. What kind of integration this is > supposed to be? > Two IOs that I would love to see benefit from schemas are BigQuery and Avro (and

Re: Beam Schemas: current status

2018-08-31 Thread Robert Bradshaw
On Thu, Aug 30, 2018 at 5:15 PM Reuven Lax wrote: > Some answer inline: > On Thu, Aug 30, 2018 at 7:56 AM Ismaël Mejía wrote: > >> Thanks Reuven for the excellent summary and thanks to all the guys who >> worked in the Schema/SQL improvements. This is great for usability. I >> really like the id

Re: Beam Schemas: current status

2018-08-31 Thread Robert Bradshaw
On Fri, Aug 31, 2018 at 11:22 AM Maximilian Michels wrote: > Thanks Reuven. That's an OK restriction. Apache Flink also requires > non-final fields to be able to generate TypeInformation (~=Schema) from > PoJos. > > I agree that it's not very intuitive for Users. > > I suppose it would work to as

Re: SplittableDoFn

2018-08-31 Thread Robert Bradshaw
Thanks for taking this up. I added some comments to the doc. A European-friendly time for discussion would be great. On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik wrote: > I came up with a proposal[1] for a progress model solely based off of the > backlog and that splits should be based upon the r

Re: Process JobBundleFactory for portable runner

2018-08-27 Thread Robert Bradshaw
On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels wrote: > > Thanks for your proposal Henning. +1 for explicit environment messages. > I'm not sure how important it is to support cross-platform pipelines. I > can foresee future use but I wouldn't consider it essential. One may want to execute T

Re: Bootstrapping Beam's Job Server

2018-08-27 Thread Robert Bradshaw
uot; proposal. Do you mean that > the process startup script listens for an RPC from the Runner to bring > up SDK harnesses as needed? > > I agree this would be helpful to know the required parameter, e.g. you > mentioned the Fn Api network configuration. > > On 23.08.18 17:07,

Re: [Proposal] Track non-code contributions in Jira

2018-08-24 Thread Robert Bradshaw
Jira is basically a fancy TODO list; if folks think it would be helpful for tracking these kinds of contributions (e.g. there's a lot of stuff that needs to be done for a successful meetup, or things like "write a blog post about X") I think it's worth a try. I don't know how useful it'd be for ope

Re: Process JobBundleFactory for portable runner

2018-08-24 Thread Robert Bradshaw
I think "external" still needs some way (I was suggesting grpc) to pass the control address, etc. to whatever starts up the workers. Also, +1 to making this a URN. Embedded makes sense too. On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise wrote: > > Option #3 "external" would fit the Kubernetes use c

Re: Bootstrapping Beam's Job Server

2018-08-23 Thread Robert Bradshaw
ldn't this be also handled > by the shell script? Good point :). I still think it'd be nice to make this option more explicit, as it doesn't even require starting up (or managing) a subprocess. > On 23.08.18 14:13, Robert Bradshaw wrote: > > On Thu, Aug 23, 2018 at 1:

Re: Bootstrapping Beam's Job Server

2018-08-23 Thread Robert Bradshaw
On Thu, Aug 23, 2018 at 1:54 PM Maximilian Michels wrote: > > Big +1. Process-based execution should be simple to reason about for > users. +1. In fact, this is exactly what the Python local job server does, with running Docker simply being a particular command line that's passed down here. http

Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Robert Bradshaw
amic" output (where the user can choose a destination > file based on the input record). In practice if a pipeline is not using > dynamic destinations the full graph is still generated, but much of that > graph is never used (empty PCollections). > > On Wed, Aug 22, 2018 at

Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Robert Bradshaw
filed to see which stages and/or operations are taking up the > > time? > > Not yet. I'm browsing through the spark DAG produced which I've committed [1] > and reading the code. > > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro > > On Wed, Aug 22,

Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Robert Bradshaw
I agree that this is concerning. Some of the complexity may have also been introduced to accommodate writing files in Streaming mode, but it seems we should be able to execute this as a single Map operation. Have you profiled to see which stages and/or operations are taking up the time? On Wed, Au

Re: Bug or confusing python code? Are these the same element count metrics?

2018-08-21 Thread Robert Bradshaw
On Tue, Aug 21, 2018 at 2:05 AM Alex Amato wrote: > > I discovered something while trying to update test_progress_metrics in > fn_api_runner_tests.py to inspect the returned MonitoringInfos in addition to > the already returned metrics format. > > This metric appears to be added twice using the

Re: Travis apache credentials

2018-08-21 Thread Robert Bradshaw
into travis-cli. I'm not sure who has the right >>> permission to perform this operation. >>> 2. We have a common svn credential or one of the beam committers would like >>> to put his(or her) credential into beam-wheels. >>> >>> If we can confi

Re: Removing documentation for old Beam versions

2018-08-20 Thread Robert Bradshaw
On Sun, Aug 5, 2018 at 5:28 AM Thomas Weise wrote: > > Yes, I think the separation of generated code will need to occur prior to > completing the merge and switching the web site to the main repo. > > There should be no reason to check generated documentation into either of the > repos/branches.

Travis apache credentials

2018-08-20 Thread Robert Bradshaw
Boyaun set up a nice repository for building Python wheels at https://github.com/apache/beam-wheels . Does anyone know if it would be possible to get SVN credentials for travis so every user wouldn't have to fork the repository and put their own in?

Re: Should Beam Python throw an error if DoFn returns a string?

2018-08-17 Thread Robert Bradshaw
On Fri, Aug 17, 2018 at 8:16 PM Pablo Estrada wrote: > > Beam Python expects DoFns to return an iterable that contains the actual > output elements. This is documented, and visible in examples, but it is also > a bit counter-intuitive. > > We should definitely add a check in _OutputProcessor[1]

Re: [Discussion] Clarify the support story for released Beam versions

2018-08-16 Thread Robert Bradshaw
On Thu, Aug 16, 2018 at 7:56 PM Ahmet Altay wrote: > > Thank you Alexey, Robert for the feedback! > > On Thu, Aug 16, 2018 at 5:26 AM, Robert Bradshaw wrote: >> >> I think this is a worthwhile thing to do. I would just have one change: I >> think that rather than d

Re: [Discussion] Clarify the support story for released Beam versions

2018-08-16 Thread Robert Bradshaw
I think this is a worthwhile thing to do. I would just have one change: I think that rather than deciding each Nth release is an LTS, we should do so at the time of the release based on the time since the last LTS, the number of LTS releases currently in flight, and whether the accumulated feature

Re: Apache Beam Python Wheels Repository

2018-08-15 Thread Robert Bradshaw
For background, a separate repository that contains reference to Brett's Travis/Appveyer build scripts + project configuration information is the de facto standard for out of the box support for a wide variety of platform-specific Python wheels (aka binary builds). We also need a secure place to p

Re: [VOTE] Community Examples Repository

2018-08-09 Thread Robert Bradshaw
(3) In particular, I see a lot of value for (quoting the proposal) """ Since then, there have been numerous updates, increased Python parity, and new features that do not have accompanying examples employing best practices and demonstrating an end-to-end experience for new users. We would like to

Re: Community Examples Repository

2018-08-02 Thread Robert Bradshaw
I have to admit I'm generally -1 on moving examples to a separate repository. In particular, I think it would actually inhibit the stated goals of increasing visibility and better keeping them up to date, and for all the reasons we just migrated the beam-site directory in. It seems the primary moti

Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Robert Bradshaw
On Sun, Jul 15, 2018 at 2:20 PM Eugene Kirpichov wrote: > Hey beamers, > > I've always wondered whether the BoundedSource implementations in the Beam > SDK are worth their complexity, or whether they rather could be converted > to the much easier to code ParDo style, which is also more modular an

Re: Ability to read from UTF-16 or UTF-32 encoded files?

2018-07-07 Thread Robert Bradshaw
Currently TextIO scans for newlines to find line (record) boundaries, but this can occur as part of a character for UTF-16 or UTF-32. It could be certainly adapted to look for multi-byte patterns (with the right offset) but this would be more complicated. Fortunately, the default of UTF-8 handles

Re: Portable pipelines on Flink

2018-07-06 Thread Robert Bradshaw
On Wed, Jul 4, 2018 at 1:06 AM Thomas Weise wrote: > [subject change for discussion fork] > > Thanks for the steps. I'm able to run the Python wordcount example, though > it fails with local file output. Did you test with distributed FS or local > FS? > It doesn't work with local FS because the

Re: Invite to comment on the @RequiresStableInput design doc

2018-07-02 Thread Robert Bradshaw
cessElement. Since both are on the same DoFn, > I'm not sure how you would represent this as a separate transform. > > On Mon, Jul 2, 2018 at 5:05 PM Robert Bradshaw > wrote: > >> Thanks for the writeup. >> >> I'm wondering with, rather than phrasing this as a

Re: Invite to comment on the @RequiresStableInput design doc

2018-07-02 Thread Robert Bradshaw
Thanks for the writeup. I'm wondering with, rather than phrasing this as an annotation on DoFn methods that gets plumbed down through the portability representation, if it would make more sense to introduce a new, primitive "EnsureStableInput" transform. For those runners whose reshuffle provide s

Re: [Design Proposal] Improving Beam code review

2018-06-27 Thread Robert Bradshaw
Thanks for writing this up! I especially like the idea of automatically assigning code reviewers, e.g. via https://help.github.com/articles/about-codeowners/ On Wed, Jun 27, 2018 at 11:10 AM Scott Wegner wrote: > > Thanks for putting together this proposal Huygaa. Overall looks good to me; I > ad

Re: Going on leave for a bit

2018-06-27 Thread Robert Bradshaw
Enjoy your time with your family! - Robert On Wed, Jun 27, 2018 at 7:56 AM Etienne Chauchot wrote: > > Congrats Kenn ! > > Spend some good time with your family. > > Etienne > > Le lundi 25 juin 2018 à 22:42 -0700, Kenneth Knowles a écrit : > > Hi friends, > > I think I did not mention on dev@ at

Re: [PROPOSAL] Merge samza-runner to master

2018-06-22 Thread Robert Bradshaw
e shouldn't _start_ any runner down the legacy path. But >>> this is runner predates portability. I don't think the Java SDK is ready to >>> provide feature parity, much less adequate performance, so it doesn't seem >>> reasonable to require using it. Comm

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #2

2018-06-22 Thread Robert Bradshaw
I was waiting for the gradlew jar file to be removed from the source release, but I figured it made more sense to just jump in and help out. $ git show-ref github/release-2.5.0 093ac64bc2d59f3d58c0f0e7f65f2a36eaf26a4a refs/remotes/github/release-2.5.0 $ wget https://github.com/apache/beam/archive

Re: [PROPOSAL] Merge samza-runner to master

2018-06-21 Thread Robert Bradshaw
Neat to see a new runner on board! I would like to make it a requirement for all new runners to support the portability API, but given that it's still somewhat of a moving target, and you have ongoing work in this direction, that may not be a hard requirement. I'm a bit concerned that there is ar

Re: scala/scio

2018-06-21 Thread Robert Bradshaw
I might go so far as to say Scio *is* the official Scala API for Beam. We point to it on our website, and have no plans to create another. It just happens to not be maintained and released by us. On Thu, Jun 21, 2018 at 7:37 AM Jean-Baptiste Onofré wrote: > > Hi Alistair, > > we discussed several

Re: Broken links to releases

2018-06-20 Thread Robert Bradshaw
So when we send out announcements, etc. we should be linking to archive.apache.org so as not to break them in the future then. I suppose we could add another manual step to change the links for old releases on the website when we do new releases (though the less work to give the release manager the

Re: Broken links to releases

2018-06-19 Thread Robert Bradshaw
In that case, should we always be linking to the archive? Or is it standard practice to change your links when things get archived? (You can't, of course, change links to announcement emails and it's hard to hunt down all blog posts, etc. so I'd much rather have stable release pointers from the sta

Re: CSVSplitter - Splittable DoFn

2018-06-18 Thread Robert Bradshaw
pr 24, 2018 at 3:26 PM Eugene Kirpichov >>> wrote: >>> >>>> Robert - you're right, but this is a pathological case. It signals that >>>> there *might* be cases where we'll need to scan the whole file, however for >>>> pract

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-12 Thread Robert Bradshaw
tors: contributors writing just for each other > > And you also have > > - wiki/users: users writing for users > > That's interesting. > Yep. We don't have to start wiki/users right away, but it could be useful down the line. > On Mon, Jun 11, 2018 at 2:30 PM Robert

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Robert Bradshaw
On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles wrote: > I disagree strongly here - I don't think the wiki will have appropriate > polish for users. Even if carefully polished I don't think the presentation > style is right, and it is not flexible. Power users will find it, of course. > I wasn't

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-08 Thread Robert Bradshaw
The use of wiki vs. docs vs. the (repo-backed) website seems to be one of convenience vs. polish, and totally orthogonal to dev vs. user-facing stuff. I'm not opposed to a wiki, but personally I think a lot of our dev-facing docs (e.g. testing, ptransform style guide, portability overview) benefit

Re: Proposal: keeping precommit times fast

2018-06-07 Thread Robert Bradshaw
No, this isn't the kind of thing that should require a vote (unless someone really wants a vote). On Thu, Jun 7, 2018 at 9:29 AM Udi Meiri wrote: > Would I need a vote on installing this plugin, or can I just open a ticket > to infra? > > On Wed, Jun 6, 2018, 16:18 Rober

Re: Proposal: keeping precommit times fast

2018-06-06 Thread Robert Bradshaw
to document the >>>>> requirements >>>>> for a test to run as pre-commit, and start enforcing it for new tests. >>>>> >>>>> >>>>> >> On Fri, May 18, 2018 at 3:25 PM Henning Rohde >>>>> wrote: >&

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Robert Bradshaw
Are there JIRAs filed for these? I have yet to have a corrupt cache, but it would be nice to know how to avoid and fix it. Did --no-parallel make the ErrorProne error go away? On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau wrote: > Also maybe deactivate the daemon (--no-daemon) since its cac

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Robert Bradshaw
Thank you JB! Glad to see this finally rolling out. I don't see the Python artifacts, did you mean to stage them in https://dist.apache.org/repos/dist/dev/beam/2.5.0/? If you want help building wheels, let me know. On Wed, Jun 6, 2018 at 1:50 AM Etienne Chauchot wrote: > Thanks JB for all your

Re: Initial contributor experience

2018-06-05 Thread Robert Bradshaw
This is great, thanks! I added some comments to the doc. On Tue, Jun 5, 2018 at 1:49 PM Griselda Cuevas wrote: > +user@ in case someone has had similar experiences. > > Thanks for documenting this Austin & Pablo! > > If any other folks would like to participate in improving the "First > contribu

Re: [VOTE] Use probot/stale to automatically manage stale pull requests

2018-06-01 Thread Robert Bradshaw
+1 On Fri, Jun 1, 2018 at 1:43 PM Andrew Pilloud wrote: > +1 > > On Fri, Jun 1, 2018 at 1:31 PM Huygaa Batsaikhan > wrote: > >> +1 >> >> On Fri, Jun 1, 2018 at 1:17 PM Henning Rohde wrote: >> >>> +1 >>> >>> On Fri, Jun 1, 2018 at 10:16 AM Chamikara Jayalath >>> wrote: >>> +1 (non-binding

Re: [VOTE] Code Review Process

2018-06-01 Thread Robert Bradshaw
+1 On Fri, Jun 1, 2018 at 12:06 PM Chamikara Jayalath wrote: > +1 > > Thanks, > Cham > > On Fri, Jun 1, 2018 at 11:36 AM Jason Kuster > wrote: > >> +1 >> >> On Fri, Jun 1, 2018 at 11:36 AM Ankur Goenka wrote: >> >>> +1 >>> >>> On Fri, Jun 1, 2018 at 11:28 AM Charles Chen wrote: >>> +1 >>

Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Robert Bradshaw
+1, this is what I was going to propose. Code review serves two related, but distinct purposes. The first is just getting a second set of eyes on the code to improve quality (call this the LGTM). This can be done by anyone. The second is vetting whether this contribution, in its current form, shou

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-31 Thread Robert Bradshaw
I think it makes sense to cut the release and get the ball rolling, and iff the ParquetIO/S3 issue turns out to be simple, we cherry-pick, otherwise we add a note. On Thu, May 31, 2018 at 1:56 AM Jean-Baptiste Onofré wrote: > Hi, > > Regarding RabbitMqIO, Eugene provided new feedback last night

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-30 Thread Robert Bradshaw
On Wed, May 30, 2018 at 12:59 PM Ahmet Altay wrote: > Thank you JB. > > For clarification, are you referring to the following items: > - RabbitMqIO - https://github.com/apache/beam/pull/1729 > - ParquetIO on HDFS/S3 - https://issues.apache.org/jira/browse/BEAM-4421 > > If the above mapping is co

<    6   7   8   9   10   11   12   13   14   15   >