Re: [DISCUSSION] Add hint/option on PCollection

Romain Manni-Bucau Tue, 30 Jan 2018 13:30:17 -0800

Not sure how it fits in terms of API yet but +1 for the high level view.
Makes perfect sense.


Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" <j...@nanthrax.net> a écrit :

> Hi Robert,
>
> Good point and idea for the Composite transform. It would apply nicely on
> all transforms based on composite.
>
> I also agree that the hint is more on the transform than the PCollection
> itself.
>
> Thanks !
> Regards
> JB
>
> On 30/01/2018 21:26, Robert Bradshaw wrote:
>
>> Many hints make more sense for PTransforms (the computation itself)
>> than for PCollections. In addition, when we want properties attached
>> to PCollections of themselves, it often makes sense to let these be
>> provided by the producing PTransform (e.g. coders and schemas are
>> often functions of the input metadata and the operation itself, and
>> can't just be set arbitrarily).
>>
>> Also, we already have a perfectly standard way of nesting transforms
>> (or even sets of transforms), namely composite transforms. In terms of
>> API design I would propose writing a composite transform that applies
>> constraints/hints/requirements to all its inner transforms. This
>> translates nicely to the Fn API as well.
>>
>> On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles <k...@google.com> wrote:
>>
>>> It seems like most of these use cases are hints on a PTransform and not a
>>> PCollection, no? CPU, memory, expected parallelism, etc are. Then you
>>> could
>>> just have:
>>>      pc.apply(WithHints(myTransform, <hints>))
>>>
>>> For a PCollection hints that might make sense are bits like total size,
>>> element size, and throughput. All things that the Dataflow folks have
>>> said
>>> should be measured instead of hinted. But I understand that we shouldn't
>>> force runners to do infeasible things like build a whole no-knobs
>>> service on
>>> top of a super-knobby engine.
>>>
>>> Incidentally for portability, we have this "environment" object that is
>>> basically the docker URL of an SDK harness that can execute a function.
>>> We
>>> always intended that same area of the proto (exact fields TBD) to have
>>> things like requirements for CPU, memory, GPUs, disk, etc. It is likely a
>>> good place for hints.
>>>
>>> BTW good idea to ask users@ for their pain points and bring them back
>>> to the
>>> dev list to motivate feature design discussions.
>>>
>>> Kenn
>>>
>>> On Tue, Jan 30, 2018 at 12:00 PM, Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>> I think the hints would logically be metadata in the pcollection, just
>>>> like coder and schema.
>>>>
>>>> On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré" <j...@nanthrax.net>
>>>> wrote:
>>>>
>>>>>
>>>>> Great idea for AddHints.of() !
>>>>>
>>>>> What would be the resulting PCollection ? Just a PCollection of hints
>>>>> or
>>>>> the pc elements + hints ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 30/01/2018 20:52, Reuven Lax wrote:
>>>>>
>>>>>>
>>>>>> I think adding hints for runners is reasonable, though hints should
>>>>>> always be assumed to be optional - they shouldn't change semantics of
>>>>>> the
>>>>>> program (otherwise you destroy the portability promise of Beam).
>>>>>> However
>>>>>> there are many types of hints that some runners might find useful
>>>>>> (e.g. this
>>>>>> step needs more memory. this step runs ML algorithms, and should run
>>>>>> on a
>>>>>> machine with GPUs. etc.)
>>>>>>
>>>>>> Robert has mentioned in the past that we should try and keep
>>>>>> PCollection
>>>>>> an immutable object, and not introduce new setters on it. We slightly
>>>>>> break
>>>>>> this already today with PCollection.setCoder, and that has caused some
>>>>>> problems. Hints can be set on PTransforms though, and propagate to
>>>>>> that
>>>>>> PTransform's output PCollections. This is nearly as easy to use
>>>>>> however, as
>>>>>> we can implement a helper PTransform that can be used to set hints.
>>>>>> I.e.
>>>>>>
>>>>>> pc.apply(AddHints.of(hint1, hint2, hint3))
>>>>>>
>>>>>> Is no harder than called pc.addHint()
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré <
>>>>>> j...@nanthrax.net
>>>>>> <mailto:j...@nanthrax.net>> wrote:
>>>>>>
>>>>>>      Maybe I should have started the discussion on the user mailing
>>>>>> list:
>>>>>>      it would be great to have user feedback on this, even if I got
>>>>>> your
>>>>>>      points.
>>>>>>
>>>>>>      Sometime, I have the feeling that whatever we are proposing and
>>>>>>      discussing, it doesn't go anywhere. At some point, to attract
>>>>>> more
>>>>>>      people, we have to get ideas from different
>>>>>> perspective/standpoint.
>>>>>>
>>>>>>      Thanks for the feedback anyway.
>>>>>>
>>>>>>      Regards
>>>>>>      JB
>>>>>>
>>>>>>      On 30/01/2018 20:27, Romain Manni-Bucau wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>          2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com
>>>>>>          <mailto:k...@google.com> <mailto:k...@google.com
>>>>>>          <mailto:k...@google.com>>>:
>>>>>>
>>>>>>
>>>>>>               I generally like having certain "escape hatches" that
>>>>>> are
>>>>>> well
>>>>>>               designed and limited in scope, and anything that turns
>>>>>> out
>>>>>>          to be
>>>>>>               important becomes first-class. But this one I don't
>>>>>> really
>>>>>> like
>>>>>>               because the use cases belong elsewhere. Of course, they
>>>>>>          creep so you
>>>>>>               should assume they will be unbounded in how much gets
>>>>>>          stuffed into
>>>>>>               them. And the definition of a "hint" is that deleting it
>>>>>>          does not
>>>>>>               change semantics, just performance/monitor/UI etc but
>>>>>> this
>>>>>>          does not
>>>>>>               seem to be true.
>>>>>>
>>>>>>               "spark.persist" for idempotent replay in a sink:
>>>>>>                 - this is already @RequiresStableInput
>>>>>>                 - it is not a hint because if you don't persist your
>>>>>>          results are
>>>>>>               incorrect
>>>>>>                 - it is a property of a DoFn / transform not a
>>>>>> PCollection
>>>>>>
>>>>>>
>>>>>>          Let's put this last point aside since we'll manage to make it
>>>>>>          working wherever we store it ;).
>>>>>>
>>>>>>
>>>>>>               schema:
>>>>>>                 - should be first-class
>>>>>>
>>>>>>
>>>>>>          Except it doesn't make sense everywhere. It is exactly like
>>>>>>          saying "implement this" and 2 lines later "it doesn't do
>>>>>>          anything for you". If you think wider on schema you will
>>>>>> want to
>>>>>>          do far more - like getting them from the previous step
>>>>>> etc... -
>>>>>>          which makes it not an API thing. However, with some runner
>>>>>> like
>>>>>>          spark, being able to specifiy it will enable to optimize the
>>>>>>          execution. There is a clear mismatch between a consistent and
>>>>>>          user friendly generic and portable API, and a runtime, runner
>>>>>>          specific, implementation.
>>>>>>
>>>>>>          This is all fine as an issue for a portable API and why all
>>>>>> EE
>>>>>>          API have a map to pass properties somewhere so I don't see
>>>>>> why
>>>>>>          beam wouldn't fall in that exact same bucket since it
>>>>>> embraces
>>>>>>          the drawback of the portability and we already hit it since
>>>>>>          several releases.
>>>>>>
>>>>>>
>>>>>>               step parallelism (you didn't mention but most runners
>>>>>> need
>>>>>> some
>>>>>>               control):
>>>>>>                 - this is a property of the data and the pipeline
>>>>>>          together, not
>>>>>>               just the pipeline
>>>>>>
>>>>>>
>>>>>>          Good one but this can be configured from the pipeline or
>>>>>> even a
>>>>>>          transform. This doesn't mean the data is not important - and
>>>>>> you
>>>>>>          are more than right on that point, just that it is
>>>>>> configurable
>>>>>>          without referencing the data (using ranges is a trivial
>>>>>> example
>>>>>>          even if not the most efficient).
>>>>>>
>>>>>>
>>>>>>               So I just don't actually see a use case for free-form
>>>>>> hints
>>>>>>          that we
>>>>>>               haven't already covered.
>>>>>>
>>>>>>
>>>>>>          There are several cases, even in the direct runner to be
>>>>>> able to
>>>>>>          industrialize it:
>>>>>>          - use that particular executor instance
>>>>>>          - debug these infos for that transform
>>>>>>
>>>>>>          etc...
>>>>>>
>>>>>>          As a high level design I think it is good to bring hints to
>>>>>> beam
>>>>>>          to avoid to add ad-hoc solution each time and take the risk
>>>>>> to
>>>>>>          loose the portability of the main API.
>>>>>>
>>>>>>
>>>>>>               Kenn
>>>>>>
>>>>>>               On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau
>>>>>>               <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>>>>>          <mailto:rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>>>>> >>
>>>>>>          wrote:
>>>>>>
>>>>>>                   Lukasz, the point is that you have to choice to
>>>>>> either
>>>>>>          bring all
>>>>>>                   specificities to the main API which makes most of
>>>>>> the
>>>>>>          API not
>>>>>>                   usable or implemented or the opposite, not support
>>>>>>          anything.
>>>>>>                   Introducing hints will allow to have eagerly for
>>>>>> some
>>>>>>          runners
>>>>>>                   some features - or just some very specific things -
>>>>>> and
>>>>>>          once
>>>>>>                   mainstream it can find a place in the main API.
>>>>>> This is
>>>>>>          saner
>>>>>>                   than the opposite since some specificities can never
>>>>>>          find a good
>>>>>>                   place.
>>>>>>
>>>>>>                   The little thing we need to take care with that is
>>>>>> to
>>>>>>          avoid to
>>>>>>                   introduce some feature flipping as support some
>>>>>> feature
>>>>>> not
>>>>>>                   doable with another runner. It should really be
>>>>>> about
>>>>>>          runing a
>>>>>>                   runner execution (like the schema in spark).
>>>>>>
>>>>>>
>>>>>>                   Romain Manni-Bucau
>>>>>>                   @rmannibucau <https://twitter.com/rmannibucau
>>>>>>          <https://twitter.com/rmannibucau>> | Blog
>>>>>>                   <https://rmannibucau.metawerx.net/
>>>>>>          <https://rmannibucau.metawerx.net/>> | Old Blog
>>>>>>                   <http://rmannibucau.wordpress.com
>>>>>>          <http://rmannibucau.wordpress.com>> | Github
>>>>>>                   <https://github.com/rmannibucau
>>>>>>          <https://github.com/rmannibucau>> | LinkedIn
>>>>>>                   <https://www.linkedin.com/in/rmannibucau
>>>>>>          <https://www.linkedin.com/in/rmannibucau>>
>>>>>>
>>>>>>                   2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré
>>>>>>          <j...@nanthrax.net <mailto:j...@nanthrax.net>
>>>>>>                   <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>>:
>>>>>>
>>>>>>                       Good point Luke: in that case, the hint will be
>>>>>>          ignored by
>>>>>>                       the runner if the hint is not for him. The hint
>>>>>> can
>>>>>> be
>>>>>>                       generic (not specific to a runner). It could be
>>>>>>          interesting
>>>>>>                       for the schema support or IOs, not specific to a
>>>>>>          runner.
>>>>>>
>>>>>>                       What do you mean by gathering
>>>>>>          PTransforms/PCollections and
>>>>>>                       where ?
>>>>>>
>>>>>>                       Thanks !
>>>>>>                       Regards
>>>>>>                       JB
>>>>>>
>>>>>>                       On 30/01/2018 18:35, Lukasz Cwik wrote:
>>>>>>
>>>>>>                           If the hint is required to run the persons
>>>>>> pipeline
>>>>>>                           well, how do you expect that the person we
>>>>>> be
>>>>>>          able to
>>>>>>                           migrate their pipeline to another runner?
>>>>>>
>>>>>>                           A lot of hints like "spark.persist" are
>>>>>> really
>>>>>>          the user
>>>>>>                           trying to tell us something about the
>>>>>>          PCollection, like
>>>>>>                           it is very small. I would prefer if we
>>>>>> gathered
>>>>>>          this
>>>>>>                           information about PTransforms and
>>>>>> PCollections
>>>>>>          instead
>>>>>>                           of runner specific knobs since then each
>>>>>> runner
>>>>>> can
>>>>>>                           choose how best to map such a property on
>>>>>> their
>>>>>>          internal
>>>>>>                           representation.
>>>>>>
>>>>>>                           On Tue, Jan 30, 2018 at 2:21 AM,
>>>>>> Jean-Baptiste
>>>>>>          Onofré
>>>>>>                           <j...@nanthrax.net <mailto:j...@nanthrax.net>
>>>>>>          <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>
>>>>>>                           <mailto:j...@nanthrax.net
>>>>>>          <mailto:j...@nanthrax.net> <mailto:j...@nanthrax.net
>>>>>>          <mailto:j...@nanthrax.net>>>> wrote:
>>>>>>
>>>>>>                                Hi,
>>>>>>
>>>>>>                                As part of the discussion about schema,
>>>>>> Romain
>>>>>>                           mentioned hint. I
>>>>>>                                think it's
>>>>>>                                worth to have an explanation about that
>>>>>> and
>>>>>>                           especially it could be
>>>>>>                                wider than
>>>>>>                                schema.
>>>>>>
>>>>>>                                Today, to give information to the
>>>>>> runner,
>>>>>>          we use
>>>>>>                           PipelineOptions.
>>>>>>                                The runner can
>>>>>>                                use these options, and apply for all
>>>>>> inner
>>>>>>                           representation of the
>>>>>>                                PCollection in
>>>>>>                                the runner.
>>>>>>
>>>>>>                                For instance, for the Spark runner, the
>>>>>>          persistence
>>>>>>                           storage level
>>>>>>                                (memory, disk,
>>>>>>                                ...) can be defined via pipeline
>>>>>> options.
>>>>>>
>>>>>>                                Then, the Spark runner automatically
>>>>>>          defines if
>>>>>>                           RDDs have to be
>>>>>>                                persisted (using
>>>>>>                                the storage level defined in the
>>>>>> pipeline
>>>>>>          options),
>>>>>>                           for instance if
>>>>>>                                the same
>>>>>>                                POutput/PCollection is read several
>>>>>> time.
>>>>>>
>>>>>>                                However, the user doesn't have any way
>>>>>> to
>>>>>>          provide
>>>>>>                           indication to the
>>>>>>                                runner to
>>>>>>                                deal with a specific PCollection.
>>>>>>
>>>>>>                                Imagine, the user has a pipeline like
>>>>>> this:
>>>>>>                                pipeline.apply().apply().apply(). We
>>>>>>                                have three PCollections involved in
>>>>>> this
>>>>>>          pipeline.
>>>>>>                           It's not
>>>>>>                                currently possible
>>>>>>                                to give indications how the runner
>>>>>> should
>>>>>>                           "optimized" and deal with
>>>>>>                                the second
>>>>>>                                PCollection only.
>>>>>>
>>>>>>                                The idea is to add a method on the
>>>>>>          PCollection:
>>>>>>
>>>>>>                                PCollection.addHint(String key, Object
>>>>>> value);
>>>>>>
>>>>>>                                For instance:
>>>>>>
>>>>>>                                collection.addHint("spark.persist",
>>>>>>                           StorageLevel.MEMORY_ONLY);
>>>>>>
>>>>>>                                I see three direct usage of this:
>>>>>>
>>>>>>                                1. Related to schema: the schema
>>>>>>          definition could
>>>>>>                           be a hint
>>>>>>                                2. Related to the IO: add headers for
>>>>>> the
>>>>>>          IO and
>>>>>>                           the runner how to
>>>>>>                                specifically
>>>>>>                                process a collection. In Apache Camel,
>>>>>> we
>>>>>> have
>>>>>>                           headers on the
>>>>>>                                message and
>>>>>>                                properties on the exchange similar to
>>>>>> this. It
>>>>>>                           allows to give some
>>>>>>                                indication
>>>>>>                                how to process some messages on the
>>>>>> Camel
>>>>>>                           component. We can imagine
>>>>>>                                the same of
>>>>>>                                the IO (using the PCollection hints to
>>>>>> react
>>>>>>                           accordingly).
>>>>>>                                3. Related to runner optimization: I
>>>>>> see
>>>>>> for
>>>>>>                           instance a way to use
>>>>>>                                RDD or
>>>>>>                                dataframe in Spark runner, or even
>>>>>> specific
>>>>>>                           optimization like
>>>>>>                                persist. I had lot
>>>>>>                                of questions from Spark users saying:
>>>>>> "in
>>>>>>          my Spark
>>>>>>                           job, I know where
>>>>>>                                and how I
>>>>>>                                should use persist (rdd.persist()),
>>>>>> but I
>>>>>>          can't do
>>>>>>                           such optimization
>>>>>>                                using
>>>>>>                                Beam". So it could be a good
>>>>>> improvements.
>>>>>>
>>>>>>                                Thoughts ?
>>>>>>
>>>>>>                                Regards
>>>>>>                                JB
>>>>>>                                --
>>>>>>                                Jean-Baptiste Onofré
>>>>>>          jbono...@apache.org <mailto:jbono...@apache.org>
>>>>>>          <mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
>>>>>>                           <mailto:jbono...@apache.org
>>>>>>          <mailto:jbono...@apache.org> <mailto:jbono...@apache.org
>>>>>>          <mailto:jbono...@apache.org>>>
>>>>>>          http://blog.nanthrax.net
>>>>>>                                Talend - http://www.talend.com
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to