Re: [DISCUSSION] Add hint/option on PCollection

Jean-Baptiste Onofré Tue, 30 Jan 2018 12:13:51 -0800

Yes, agree, it sounds like Create.of() but actually it adding hint tothe collection. So maybe AddHints.on(collection, hint1, ...) it's clearer.


Regards
JB


On 30/01/2018 21:08, Romain Manni-Bucau wrote:

I think so too but `pc.apply(AddHints.of(hint1, hint2, hint3))` is a bitambiguous for me (is it affecting the previous collection?)

Maybe AddHints.on(collection, hint1, hint2, ...) is an acceptablecompromise? Less fluent but not ambiguous (based on the same pattern asviews).



Romain Manni-Bucau

@rmannibucau <https://twitter.com/rmannibucau> | Blog<https://rmannibucau.metawerx.net/> | Old Blog<http://rmannibucau.wordpress.com> | Github<https://github.com/rmannibucau> | LinkedIn<https://www.linkedin.com/in/rmannibucau>

2018-01-30 21:00 GMT+01:00 Reuven Lax <[email protected]<mailto:[email protected]>>:


    I think the hints would logically be metadata in the pcollection,
    just like coder and schema.

    On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré" <[email protected]
    <mailto:[email protected]>> wrote:

        Great idea for AddHints.of() !

        What would be the resulting PCollection ? Just a PCollection of
        hints or the pc elements + hints ?

        Regards
        JB

        On 30/01/2018 20:52, Reuven Lax wrote:

            I think adding hints for runners is reasonable, though hints
            should always be assumed to be optional - they shouldn't
            change semantics of the program (otherwise you destroy the
            portability promise of Beam). However there are many types
            of hints that some runners might find useful (e.g. this step
            needs more memory. this step runs ML algorithms, and should
            run on a machine with GPUs. etc.)

            Robert has mentioned in the past that we should try and keep
            PCollection an immutable object, and not introduce new
            setters on it. We slightly break this already today with
            PCollection.setCoder, and that has caused some problems.
            Hints can be set on PTransforms though, and propagate to
            that PTransform's output PCollections. This is nearly as
            easy to use however, as we can implement a helper PTransform
            that can be used to set hints. I.e.

            pc.apply(AddHints.of(hint1, hint2, hint3))

            Is no harder than called pc.addHint()

            Reuven

            On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré
            <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>> wrote:

                 Maybe I should have started the discussion on the user
            mailing list:
                 it would be great to have user feedback on this, even
            if I got your
                 points.

                 Sometime, I have the feeling that whatever we are
            proposing and
                 discussing, it doesn't go anywhere. At some point, to
            attract more
                 people, we have to get ideas from different
            perspective/standpoint.

                 Thanks for the feedback anyway.

                 Regards
                 JB

                 On 30/01/2018 20:27, Romain Manni-Bucau wrote:



                     2018-01-30 19:52 GMT+01:00 Kenneth Knowles
            <[email protected] <mailto:[email protected]>
                     <mailto:[email protected] <mailto:[email protected]>>
            <mailto:[email protected] <mailto:[email protected]>
                     <mailto:[email protected] <mailto:[email protected]>>>>:


                          I generally like having certain "escape
            hatches" that are well
                          designed and limited in scope, and anything
            that turns out
                     to be
                          important becomes first-class. But this one I
            don't really like
                          because the use cases belong elsewhere. Of
            course, they
                     creep so you
                          should assume they will be unbounded in how
            much gets
                     stuffed into
                          them. And the definition of a "hint" is that
            deleting it
                     does not
                          change semantics, just performance/monitor/UI
            etc but this
                     does not
                          seem to be true.

                          "spark.persist" for idempotent replay in a sink:
                            - this is already @RequiresStableInput
                            - it is not a hint because if you don't
            persist your
                     results are
                          incorrect
                            - it is a property of a DoFn / transform not
            a PCollection


                     Let's put this last point aside since we'll manage
            to make it
                     working wherever we store it ;).


                          schema:
                            - should be first-class


                     Except it doesn't make sense everywhere. It is
            exactly like
                     saying "implement this" and 2 lines later "it
            doesn't do
                     anything for you". If you think wider on schema you
            will want to
                     do far more - like getting them from the previous
            step etc... -
                     which makes it not an API thing. However, with some
            runner like
                     spark, being able to specifiy it will enable to
            optimize the
                     execution. There is a clear mismatch between a
            consistent and
                     user friendly generic and portable API, and a
            runtime, runner
                     specific, implementation.

                     This is all fine as an issue for a portable API and
            why all EE
                     API have a map to pass properties somewhere so I
            don't see why
                     beam wouldn't fall in that exact same bucket since
            it embraces
                     the drawback of the portability and we already hit
            it since
                     several releases.


                          step parallelism (you didn't mention but most
            runners need some
                          control):
                            - this is a property of the data and the
            pipeline
                     together, not
                          just the pipeline


                     Good one but this can be configured from the
            pipeline or even a
                     transform. This doesn't mean the data is not
            important - and you
                     are more than right on that point, just that it is
            configurable
                     without referencing the data (using ranges is a
            trivial example
                     even if not the most efficient).


                          So I just don't actually see a use case for
            free-form hints
                     that we
                          haven't already covered.


                     There are several cases, even in the direct runner
            to be able to
                     industrialize it:
                     - use that particular executor instance
                     - debug these infos for that transform

                     etc...

                     As a high level design I think it is good to bring
            hints to beam
                     to avoid to add ad-hoc solution each time and take
            the risk to
                     loose the portability of the main API.


                          Kenn

                          On Tue, Jan 30, 2018 at 9:55 AM, Romain
            Manni-Bucau
                          <[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>
                     <mailto:[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>>>
                     wrote:

                              Lukasz, the point is that you have to
            choice to either
                     bring all
                              specificities to the main API which makes
            most of the
                     API not
                              usable or implemented or the opposite, not
            support
                     anything.
                              Introducing hints will allow to have
            eagerly for some
                     runners
                              some features - or just some very specific
            things - and
                     once
                              mainstream it can find a place in the main
            API. This is
                     saner
                              than the opposite since some specificities
            can never
                     find a good
                              place.

                              The little thing we need to take care with
            that is to
                     avoid to
                              introduce some feature flipping as support
            some feature not
                              doable with another runner. It should
            really be about
                     runing a
                              runner execution (like the schema in spark).


                              Romain Manni-Bucau
                              @rmannibucau
            <https://twitter.com/rmannibucau
            <https://twitter.com/rmannibucau>
                     <https://twitter.com/rmannibucau
            <https://twitter.com/rmannibucau>>> | Blog
                              <https://rmannibucau.metawerx.net/
            <https://rmannibucau.metawerx.net/>
                     <https://rmannibucau.metawerx.net/
            <https://rmannibucau.metawerx.net/>>> | Old Blog
                              <http://rmannibucau.wordpress.com
            <http://rmannibucau.wordpress.com>
                     <http://rmannibucau.wordpress.com
            <http://rmannibucau.wordpress.com>>> | Github
                              <https://github.com/rmannibucau
            <https://github.com/rmannibucau>
                     <https://github.com/rmannibucau
            <https://github.com/rmannibucau>>> | LinkedIn
                              <https://www.linkedin.com/in/rmannibucau
            <https://www.linkedin.com/in/rmannibucau>
                     <https://www.linkedin.com/in/rmannibucau
            <https://www.linkedin.com/in/rmannibucau>>>

                              2018-01-30 18:42 GMT+01:00 Jean-Baptiste
            Onofré
                     <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
                              <mailto:[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>>>:

                                  Good point Luke: in that case, the
            hint will be
                     ignored by
                                  the runner if the hint is not for him.
            The hint can be
                                  generic (not specific to a runner). It
            could be
                     interesting
                                  for the schema support or IOs, not
            specific to a
                     runner.

                                  What do you mean by gathering
                     PTransforms/PCollections and
                                  where ?

                                  Thanks !
                                  Regards
                                  JB

                                  On 30/01/2018 18:35, Lukasz Cwik wrote:

                                      If the hint is required to run the
            persons pipeline
                                      well, how do you expect that the
            person we be
                     able to
                                      migrate their pipeline to another
            runner?

                                      A lot of hints like
            "spark.persist" are really
                     the user
                                      trying to tell us something about the
                     PCollection, like
                                      it is very small. I would prefer
            if we gathered
                     this
                                      information about PTransforms and
            PCollections
                     instead
                                      of runner specific knobs since
            then each runner can
                                      choose how best to map such a
            property on their
                     internal
                                      representation.

                                      On Tue, Jan 30, 2018 at 2:21 AM,
            Jean-Baptiste
                     Onofré
                                      <[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>
                     <mailto:[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>>
                                      <mailto:[email protected]
            <mailto:[email protected]>
                     <mailto:[email protected] <mailto:[email protected]>>
            <mailto:[email protected] <mailto:[email protected]>
                     <mailto:[email protected]
            <mailto:[email protected]>>>>> wrote:

                                           Hi,

                                           As part of the discussion
            about schema, Romain
                                      mentioned hint. I
                                           think it's
                                           worth to have an explanation
            about that and
                                      especially it could be
                                           wider than
                                           schema.

                                           Today, to give information to
            the runner,
                     we use
                                      PipelineOptions.
                                           The runner can
                                           use these options, and apply
            for all inner
                                      representation of the
                                           PCollection in
                                           the runner.

                                           For instance, for the Spark
            runner, the
                     persistence
                                      storage level
                                           (memory, disk,
                                           ...) can be defined via
            pipeline options.

                                           Then, the Spark runner
            automatically
                     defines if
                                      RDDs have to be
                                           persisted (using
                                           the storage level defined in
            the pipeline
                     options),
                                      for instance if
                                           the same
                                           POutput/PCollection is read
            several time.

                                           However, the user doesn't
            have any way to
                     provide
                                      indication to the
                                           runner to
                                           deal with a specific PCollection.

                                           Imagine, the user has a
            pipeline like this:

pipeline.apply().apply().apply(). We

                                           have three PCollections
            involved in this
                     pipeline.
                                      It's not
                                           currently possible
                                           to give indications how the
            runner should
                                      "optimized" and deal with
                                           the second
                                           PCollection only.

                                           The idea is to add a method
            on the
                     PCollection:

                                           PCollection.addHint(String
            key, Object value);

                                           For instance:

collection.addHint("spark.persist",

                                      StorageLevel.MEMORY_ONLY);

                                           I see three direct usage of this:

                                           1. Related to schema: the schema
                     definition could
                                      be a hint
                                           2. Related to the IO: add
            headers for the
                     IO and
                                      the runner how to
                                           specifically
                                           process a collection. In
            Apache Camel, we have
                                      headers on the
                                           message and
                                           properties on the exchange
            similar to this. It
                                      allows to give some
                                           indication
                                           how to process some messages
            on the Camel
                                      component. We can imagine
                                           the same of
                                           the IO (using the PCollection
            hints to react
                                      accordingly).
                                           3. Related to runner
            optimization: I see for
                                      instance a way to use
                                           RDD or
                                           dataframe in Spark runner, or
            even specific
                                      optimization like
                                           persist. I had lot
                                           of questions from Spark users
            saying: "in
                     my Spark
                                      job, I know where
                                           and how I
                                           should use persist
            (rdd.persist()), but I
                     can't do
                                      such optimization
                                           using
                                           Beam". So it could be a good
            improvements.

                                           Thoughts ?

                                           Regards
                                           JB
                                           --
                                           Jean-Baptiste Onofré
            [email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>
                     <mailto:[email protected]
            <mailto:[email protected]> <mailto:[email protected]
            <mailto:[email protected]>>>
                                      <mailto:[email protected]
            <mailto:[email protected]>
                     <mailto:[email protected]
            <mailto:[email protected]>> <mailto:[email protected]
            <mailto:[email protected]>
                     <mailto:[email protected]
            <mailto:[email protected]>>>>
            http://blog.nanthrax.net
                                           Talend - http://www.talend.com

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to