Re: Call for help: let's add Splittable DoFn to Spark, Flink and Apex runners

2017-04-08 Thread Thomas Weise
Nice work Aljoscha!

Update WRT ApexRunner: We merged some prep work in the ParDoOperator to
weed out remnants of OldDoFn. I have almost all the changes ready to add
the support for Splittable DoFn (for most part those follow the Flink
runner changes). The final piece missing to support the feature (based on
observation from the test failures) is the timer internals.

Thanks,
Thomas


On Sat, Apr 1, 2017 at 1:17 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hey all,
>
> The Flink PR has been merged, and thus - Flink becomes the first
> distributed runner to support Splittable DoFn!!!
> Thank you, Aljoscha!
>
> Looking forward to Spark and Apex, and continuing work on Dataflow.
> I'll also send proposals about a couple of new ideas related to SDF next
> week.
>
> On Thu, Mar 30, 2017 at 9:08 AM Amit Sela  wrote:
>
> > I will not be able to make it this weekend, too busy. Let's chat at the
> > beginning of next week and see what's on my plate.
> >
> > On Tue, Mar 28, 2017 at 5:44 PM Aljoscha Krettek 
> > wrote:
> >
> > > Thanks for the offers, guys! The code is finished, though. I only need
> > > to do the last touch ups.
> > >
> > > On Tue, Mar 28, 2017, at 09:16, JingsongLee wrote:
> > > > Hi Aljoscha,
> > > > I would like to work on the Flink runner with you.
> > > >
> > >
> > Best,JingsongLee
> --From:Jean-Baptiste
> > > > Onofré Time:2017 Mar 28 (Tue) 14:04To:dev
> > > > Subject:Re: Call for help: let's add Splittable
> > > DoFn
> > > > to Spark, Flink and Apex runners
> > > > Hi Aljoscha,
> > > >
> > > > do you need some help on this ?
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 03/28/2017 08:00 AM, Aljoscha Krettek wrote:
> > > > > Hi,
> > > > > sorry for being so slow but I’m currently traveling.
> > > > >
> > > > > The Flink code works but I think it could benefit from some
> > refactoring
> > > > > to make the code nice and maintainable.
> > > > >
> > > > > Best,
> > > > > Aljoscha
> > > > >
> > > > > On Tue, Mar 28, 2017, at 07:40, Jean-Baptiste Onofré wrote:
> > > > >> I add myself on the Spark runner.
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >> On 03/27/2017 08:18 PM, Eugene Kirpichov wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> Let's continue the ~bi-weekly sync-ups about state of SDF support
> > in
> > > > >>> Spark/Flink/Apex runners.
> > > > >>>
> > > > >>> Spark:
> > > >
> > > >>> Amit, Aviem, Ismaël - when would be a good time for you; does same
> > time
> > > > >>> work (8am PST this Friday)? Who else would like to join?
> > > > >>>
> > > > >>> Flink:
> > > > >>> I pinged the PR, but - Aljoscha, do you think it's worth
> discussing
> > > > >>> anything there over a videocall?
> > > > >>>
> > > > >>> Apex:
> > > >
> > > >>> Thomas - how about same time next Monday? (9:30am PST) Who else
> > would like
> > > > >>> to join?
> > > > >>>
> > > > >>> On Mon, Mar 20, 2017 at 9:59 AM Eugene Kirpichov <
> > > kirpic...@google.com>
> > > > >>> wrote:
> > > > >>>
> > > >  Meeting notes:
> > > >  Me and Thomas had a video call and we pretty much walked through
> > the
> > > >
> > >  implementation of SDF in the runner-agnostic part and in the
> direct
> > runner.
> > > >  Flink and Apex are pretty similar, so likely
> > > >  https://github.com/apache/beam/pull/2235
> > >  (the Flink PR) will give a very
> > > >  good guideline as to how to do this in Apex.
> > > >  Will talk again in ~2 weeks; and will involve +David Yan
> > > >   > > > who is also on Apex and currently conveniently
> > > >
> > >  works on the Google Dataflow team and, from in-person
> conversation,
> > was
> > > >  interested in being involved :)
> > > > 
> > > >  On Mon, Mar 20, 2017 at 7:34 AM Eugene Kirpichov <
> > > kirpic...@google.com>
> > > >  wrote:
> > > > 
> > > >  Thomas - yes, 9:30 works, shall we do that?
> > > > 
> > > >
> > >  JB - excellent! You can start experimenting already, using direct
> > runner!
> > > > 
> > > >  On Mon, Mar 20, 2017, 2:26 AM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > >  wrote:
> > > > 
> > > >  Hi Eugene,
> > > > 
> > > >  Thanks for the meeting notes !
> > > > 
> > > >
> > >  I will be in the next call and Ismaël also provided to me some
> > updates.
> > > > 
> > > >
> > >  I will sync with Amit on Spark runner and start to experiment and
> > test SDF
> > > >  on
> > > >  the JMS IO.
> > > > 
> > > >  Thanks !
> > > >  Regards
> > > >  JB
> > > > 
> > > >  On 03/17/2017 04:36 PM, Eugene Kirpichov wrote:
> > > > > Meeting notes from today's call with Amit, Aviem and Ismaël:
> > > > >
> > > > > Spark has 2 types of stateful operators; a cheap one intended
> for
> > > >  updating
> > > >
> > > > 

Re: [PROPOSAL] Standard IO Metrics

2017-04-08 Thread Jean-Baptiste Onofré

Thanks Aviem,

let me take a look on the document.

Regards
JB

On 04/08/2017 09:04 PM, Aviem Zur wrote:

Hi all,

We are currently in the process of introducing IO metrics to Beam.

Questions have been raised as to what the metrics names should be, and if
they should be standard across different IOs.

I've written this up as a proposal found here:
https://s.apache.org/standard-io-metrics

As usual, this document is commentable, please go over it and make comments
where appropriate.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[PROPOSAL] Standard IO Metrics

2017-04-08 Thread Aviem Zur
Hi all,

We are currently in the process of introducing IO metrics to Beam.

Questions have been raised as to what the metrics names should be, and if
they should be standard across different IOs.

I've written this up as a proposal found here:
https://s.apache.org/standard-io-metrics

As usual, this document is commentable, please go over it and make comments
where appropriate.


Re: Proposed Splittable DoFn API changes

2017-04-08 Thread Aljoscha Krettek
+1

I was too busy with traveling and preparations for Flink Forward but I wanted 
to retroactively confirm that these are good changes. :-)
> On 7. Apr 2017, at 22:43, Jean-Baptiste Onofré  wrote:
> 
> Hi Eugene,
> 
> thanks for the update and nice example.
> 
> I plan to start to refactor/experiment on some IOs.
> 
> Regards
> JB
> 
> On 04/08/2017 02:44 AM, Eugene Kirpichov wrote:
>> The changes are in.
>> 
>> Also included is a handy change that allows one to skip implementing the
>> NewTracker method if the restriction type implements HasDefaultTracker,
>> leaving the only two required methods be ProcessElement and
>> GetInitialRestriction.
>> 
>> E.g. here's what a minimal SDF example looks like now - splittably pairing
>> a string with every number in 0..100:
>> 
>>  class CountFn extends DoFn> {
>>@ProcessElement
>>public void process(ProcessContext c, OffsetRangeTracker tracker) {
>>  for (long i = tracker.currentRestriction().getFrom();
>> tracker.tryClaim(i); ++i) {
>>c.output(KV.of(c.element(), i));
>>  }
>>}
>> 
>>@GetInitialRestriction
>>public OffsetRange getInitialRange(String element) { return new
>> OffsetRange(0, 100); }
>>  }
>> 
>> 
>> On Thu, Apr 6, 2017 at 3:16 PM Eugene Kirpichov 
>> wrote:
>> 
>>> FWIW, here's a pull request implementing these changes:
>>> https://github.com/apache/beam/pull/2455
>>> 
>>> On Wed, Apr 5, 2017 at 4:55 PM Eugene Kirpichov 
>>> wrote:
>>> 
>>> Hey all,
>>> 
>>> From the recent experience in continuing implementation of Splittable
>>> DoFn, I would like to propose a few changes to its API. They get rid of a
>>> bug, make parts of its semantics more well-defined and easier for a user to
>>> get right, and reduce the assumptions about the runner implementation.
>>> 
>>> In short:
>>> - Add c.updateWatermark() and report watermark continuously via this
>>> method.
>>> - Make SDF.@ProcessElement return void, which is simpler for users though
>>> it doesn't allow to resume after a specified time
>>> - Declare that SDF.@ProcessElement must guarantee that after it returns,
>>> the entire tracker.currentRestriction() was processed.
>>> - Add a bool RestrictionTracker.done() method to enforce the bullet above.
>>> - For resuming after specified time, use regular DoFn with state and
>>> timers API.
>>> 
>>> The only downside is the removal (from SDF) of ability to suspend the call
>>> for a certain amount of time - the suggestion is that, if you need that,
>>> you should use a regular DoFn and the timers API.
>>> 
>>> Please see the full proposal in the following doc and comment there & vote
>>> on this thread.
>>> 
>>> https://docs.google.com/document/d/1BGc8pM1GOvZhwR9SARSVte-20XEoBUxrGJ5gTWXdv3c/edit?usp=sharing
>>> 
>>> 
>>> I am going to concurrently start prototyping some parts of this proposal,
>>> because the current implementation is simply wrong and this proposal is the
>>> only way to fix it that I can think of, but I will adjust my implementation
>>> based on the discussion. I believe this proposal should not affect runner
>>> authors - I can make all the necessary changes myself.
>>> 
>>> Thanks!
>>> 
>>> 
>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com



Re: IO ITs: Hosting Docker images

2017-04-08 Thread Ted Yu
+1

> On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré  wrote:
> 
> Hi Stephen,
> 
> I think we should go to 1 and 4:
> 
> 1. Try to use existing images providing what we need. If we don't find 
> existing image, we can always ask and help other community to provide so.
> 4. If we don't find a suitable image, and waiting for this image, we can 
> store the image in our own "IT dockerhub".
> 
> Regards
> JB
> 
>> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
>> Wanted to see if anyone else had opinions on this/provide a quick update.
>> 
>> I think for both elasticsearch and HIFIO that we can find existing,
>> supported images that could serve those purposes - HIFIO is looking like
>> it'll able to do so for cassandra, which was proving tricky.
>> 
>> So to summarize my current proposed solutions: (ordered by my preference)
>> 1. (new) Strongly urge people to find existing docker images that meet our
>> image criteria - regularly updated/security checked
>> 2. Start using helm
>> 3. Push our docker images to docker hub
>> 4. Host our own public container registry
>> 
>> S
>> 
>>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk  wrote:
>>> 
>>> I'd like to hear what direction folks want to go in, and from there look
>>> at the options. I think for some of these options (like running our own
>>> public registry), they may be able to and it's something we should look at,
>>> but I don't assume they have time to work on this type of issue.
>>> 
>>> S
>>> 
>>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik 
>>> wrote:
>>> 
>>> Is this something that Apache infra could help us with?
>>> 
>>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk 
>>> wrote:
>>> 
 Summary:
 
 For IO ITs that use data stores that need custom docker images in order
>>> to
 run, we can't currently use them in a kubernetes cluster (which is where
>>> we
 host our data stores.) I have a couple options for how to solve this and
>>> am
 looking for feedback from folks involved in creating IO ITs/opinions on
 kubernetes.
 
 
 Details:
 
 We've discussed in the past that we'll want to allow developers to submit
 just a dockerfile, and then we'll use that when creating the data store
>>> on
 kubernetes. This is the case for ElasticsearchIO and I assume more data
 stores in the future will want to do this. It's also looking like it'll
>>> be
 necessary to use custom docker images for the HadoopInputFormatIO's
 cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
>>> good
 image you can use out of the box.
 
 In either case, in order to retrieve a docker image, kubernetes needs a
 container registry - it will read the docker images from there. A simple
 private container registry doesn't work because kubernetes config files
>>> are
 static - this means that if local devs try to use the kubernetes files,
 they point at the private container registry and they wouldn't be able to
 retrieve the images since they don't have access. They'd have to manually
 edit the files, which in theory is an option, but I don't consider that
>>> to
 be acceptable since it feels pretty unfriendly (it is simple, so if we
 really don't like the below options we can revisit it.)
 
 Quick summary of the options
 
 ===
 
 We can:
 
 * Start using something like k8 helm - this adds more dependencies, adds
>>> a
 small amount of complexity (this is my recommendation, but only by a
 little)
 
 * Start pushing images to docker hub - this means they'll be publicly
 visible and raises the bar for maintenance of those images
 
 * Host our own public container registry - this means running our own
 public service with costs, etc..
 
 Below are detailed discussions of these options. You can skip to the "My
 thoughts on this" section if you're not interested in the details.
 
 
 1. Templated kubernetes images
 
 =
 
 Kubernetes (k8) does not currently have built in support for
>>> parameterizing
 scripts - there's an issues open for this[1], but it doesn't seem to be
 very active.
 
 There are tools like Kubernetes helm that allow users to specify
>>> parameters
 when running their kubernetes scripts. They also enable a lot more
>>> (they're
 probably closer to a package manager like apt-get) - see this
 description[3] for an overview.
 
 I'm open to other options besides helm, but it seems to be the officially
 supported one.
 
 How the world would look using helm:
 
 * When developing an IO IT, someone (either the developer or one of us),
 would need to create a chart (the name for the helm script) - it's
 basically another set of config files but in theory is as simple