Re: [DISCUSSION] Add hint/option on PCollection

Reuven Lax Tue, 30 Jan 2018 13:41:02 -0800

Did we actually reach consensus here? :)

On Tue, Jan 30, 2018 at 1:29 PM, Romain Manni-Bucau <[email protected]>
wrote:


> Not sure how it fits in terms of API yet but +1 for the high level view.
> Makes perfect sense.
>
> Le 30 janv. 2018 21:41, "Jean-Baptiste Onofré" <[email protected]> a écrit :
>
>> Hi Robert,
>>
>> Good point and idea for the Composite transform. It would apply nicely on
>> all transforms based on composite.
>>
>> I also agree that the hint is more on the transform than the PCollection
>> itself.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 30/01/2018 21:26, Robert Bradshaw wrote:
>>
>>> Many hints make more sense for PTransforms (the computation itself)
>>> than for PCollections. In addition, when we want properties attached
>>> to PCollections of themselves, it often makes sense to let these be
>>> provided by the producing PTransform (e.g. coders and schemas are
>>> often functions of the input metadata and the operation itself, and
>>> can't just be set arbitrarily).
>>>
>>> Also, we already have a perfectly standard way of nesting transforms
>>> (or even sets of transforms), namely composite transforms. In terms of
>>> API design I would propose writing a composite transform that applies
>>> constraints/hints/requirements to all its inner transforms. This
>>> translates nicely to the Fn API as well.
>>>
>>> On Tue, Jan 30, 2018 at 12:14 PM, Kenneth Knowles <[email protected]>
>>> wrote:
>>>
>>>> It seems like most of these use cases are hints on a PTransform and not
>>>> a
>>>> PCollection, no? CPU, memory, expected parallelism, etc are. Then you
>>>> could
>>>> just have:
>>>>      pc.apply(WithHints(myTransform, <hints>))
>>>>
>>>> For a PCollection hints that might make sense are bits like total size,
>>>> element size, and throughput. All things that the Dataflow folks have
>>>> said
>>>> should be measured instead of hinted. But I understand that we shouldn't
>>>> force runners to do infeasible things like build a whole no-knobs
>>>> service on
>>>> top of a super-knobby engine.
>>>>
>>>> Incidentally for portability, we have this "environment" object that is
>>>> basically the docker URL of an SDK harness that can execute a function.
>>>> We
>>>> always intended that same area of the proto (exact fields TBD) to have
>>>> things like requirements for CPU, memory, GPUs, disk, etc. It is likely
>>>> a
>>>> good place for hints.
>>>>
>>>> BTW good idea to ask users@ for their pain points and bring them back
>>>> to the
>>>> dev list to motivate feature design discussions.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Jan 30, 2018 at 12:00 PM, Reuven Lax <[email protected]> wrote:
>>>>
>>>>>
>>>>> I think the hints would logically be metadata in the pcollection, just
>>>>> like coder and schema.
>>>>>
>>>>> On Jan 30, 2018 11:57 AM, "Jean-Baptiste Onofré" <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Great idea for AddHints.of() !
>>>>>>
>>>>>> What would be the resulting PCollection ? Just a PCollection of hints
>>>>>> or
>>>>>> the pc elements + hints ?
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 30/01/2018 20:52, Reuven Lax wrote:
>>>>>>
>>>>>>>
>>>>>>> I think adding hints for runners is reasonable, though hints should
>>>>>>> always be assumed to be optional - they shouldn't change semantics
>>>>>>> of the
>>>>>>> program (otherwise you destroy the portability promise of Beam).
>>>>>>> However
>>>>>>> there are many types of hints that some runners might find useful
>>>>>>> (e.g. this
>>>>>>> step needs more memory. this step runs ML algorithms, and should run
>>>>>>> on a
>>>>>>> machine with GPUs. etc.)
>>>>>>>
>>>>>>> Robert has mentioned in the past that we should try and keep
>>>>>>> PCollection
>>>>>>> an immutable object, and not introduce new setters on it. We
>>>>>>> slightly break
>>>>>>> this already today with PCollection.setCoder, and that has caused
>>>>>>> some
>>>>>>> problems. Hints can be set on PTransforms though, and propagate to
>>>>>>> that
>>>>>>> PTransform's output PCollections. This is nearly as easy to use
>>>>>>> however, as
>>>>>>> we can implement a helper PTransform that can be used to set hints.
>>>>>>> I.e.
>>>>>>>
>>>>>>> pc.apply(AddHints.of(hint1, hint2, hint3))
>>>>>>>
>>>>>>> Is no harder than called pc.addHint()
>>>>>>>
>>>>>>> Reuven
>>>>>>>
>>>>>>> On Tue, Jan 30, 2018 at 11:39 AM, Jean-Baptiste Onofré <
>>>>>>> [email protected]
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>
>>>>>>>      Maybe I should have started the discussion on the user mailing
>>>>>>> list:
>>>>>>>      it would be great to have user feedback on this, even if I got
>>>>>>> your
>>>>>>>      points.
>>>>>>>
>>>>>>>      Sometime, I have the feeling that whatever we are proposing and
>>>>>>>      discussing, it doesn't go anywhere. At some point, to attract
>>>>>>> more
>>>>>>>      people, we have to get ideas from different
>>>>>>> perspective/standpoint.
>>>>>>>
>>>>>>>      Thanks for the feedback anyway.
>>>>>>>
>>>>>>>      Regards
>>>>>>>      JB
>>>>>>>
>>>>>>>      On 30/01/2018 20:27, Romain Manni-Bucau wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>          2018-01-30 19:52 GMT+01:00 Kenneth Knowles <[email protected]
>>>>>>>          <mailto:[email protected]> <mailto:[email protected]
>>>>>>>          <mailto:[email protected]>>>:
>>>>>>>
>>>>>>>
>>>>>>>               I generally like having certain "escape hatches" that
>>>>>>> are
>>>>>>> well
>>>>>>>               designed and limited in scope, and anything that turns
>>>>>>> out
>>>>>>>          to be
>>>>>>>               important becomes first-class. But this one I don't
>>>>>>> really
>>>>>>> like
>>>>>>>               because the use cases belong elsewhere. Of course, they
>>>>>>>          creep so you
>>>>>>>               should assume they will be unbounded in how much gets
>>>>>>>          stuffed into
>>>>>>>               them. And the definition of a "hint" is that deleting
>>>>>>> it
>>>>>>>          does not
>>>>>>>               change semantics, just performance/monitor/UI etc but
>>>>>>> this
>>>>>>>          does not
>>>>>>>               seem to be true.
>>>>>>>
>>>>>>>               "spark.persist" for idempotent replay in a sink:
>>>>>>>                 - this is already @RequiresStableInput
>>>>>>>                 - it is not a hint because if you don't persist your
>>>>>>>          results are
>>>>>>>               incorrect
>>>>>>>                 - it is a property of a DoFn / transform not a
>>>>>>> PCollection
>>>>>>>
>>>>>>>
>>>>>>>          Let's put this last point aside since we'll manage to make
>>>>>>> it
>>>>>>>          working wherever we store it ;).
>>>>>>>
>>>>>>>
>>>>>>>               schema:
>>>>>>>                 - should be first-class
>>>>>>>
>>>>>>>
>>>>>>>          Except it doesn't make sense everywhere. It is exactly like
>>>>>>>          saying "implement this" and 2 lines later "it doesn't do
>>>>>>>          anything for you". If you think wider on schema you will
>>>>>>> want to
>>>>>>>          do far more - like getting them from the previous step
>>>>>>> etc... -
>>>>>>>          which makes it not an API thing. However, with some runner
>>>>>>> like
>>>>>>>          spark, being able to specifiy it will enable to optimize the
>>>>>>>          execution. There is a clear mismatch between a consistent
>>>>>>> and
>>>>>>>          user friendly generic and portable API, and a runtime,
>>>>>>> runner
>>>>>>>          specific, implementation.
>>>>>>>
>>>>>>>          This is all fine as an issue for a portable API and why all
>>>>>>> EE
>>>>>>>          API have a map to pass properties somewhere so I don't see
>>>>>>> why
>>>>>>>          beam wouldn't fall in that exact same bucket since it
>>>>>>> embraces
>>>>>>>          the drawback of the portability and we already hit it since
>>>>>>>          several releases.
>>>>>>>
>>>>>>>
>>>>>>>               step parallelism (you didn't mention but most runners
>>>>>>> need
>>>>>>> some
>>>>>>>               control):
>>>>>>>                 - this is a property of the data and the pipeline
>>>>>>>          together, not
>>>>>>>               just the pipeline
>>>>>>>
>>>>>>>
>>>>>>>          Good one but this can be configured from the pipeline or
>>>>>>> even a
>>>>>>>          transform. This doesn't mean the data is not important -
>>>>>>> and you
>>>>>>>          are more than right on that point, just that it is
>>>>>>> configurable
>>>>>>>          without referencing the data (using ranges is a trivial
>>>>>>> example
>>>>>>>          even if not the most efficient).
>>>>>>>
>>>>>>>
>>>>>>>               So I just don't actually see a use case for free-form
>>>>>>> hints
>>>>>>>          that we
>>>>>>>               haven't already covered.
>>>>>>>
>>>>>>>
>>>>>>>          There are several cases, even in the direct runner to be
>>>>>>> able to
>>>>>>>          industrialize it:
>>>>>>>          - use that particular executor instance
>>>>>>>          - debug these infos for that transform
>>>>>>>
>>>>>>>          etc...
>>>>>>>
>>>>>>>          As a high level design I think it is good to bring hints to
>>>>>>> beam
>>>>>>>          to avoid to add ad-hoc solution each time and take the risk
>>>>>>> to
>>>>>>>          loose the portability of the main API.
>>>>>>>
>>>>>>>
>>>>>>>               Kenn
>>>>>>>
>>>>>>>               On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau
>>>>>>>               <[email protected] <mailto:[email protected]>
>>>>>>>          <mailto:[email protected] <mailto:[email protected]
>>>>>>> >>>
>>>>>>>          wrote:
>>>>>>>
>>>>>>>                   Lukasz, the point is that you have to choice to
>>>>>>> either
>>>>>>>          bring all
>>>>>>>                   specificities to the main API which makes most of
>>>>>>> the
>>>>>>>          API not
>>>>>>>                   usable or implemented or the opposite, not support
>>>>>>>          anything.
>>>>>>>                   Introducing hints will allow to have eagerly for
>>>>>>> some
>>>>>>>          runners
>>>>>>>                   some features - or just some very specific things
>>>>>>> - and
>>>>>>>          once
>>>>>>>                   mainstream it can find a place in the main API.
>>>>>>> This is
>>>>>>>          saner
>>>>>>>                   than the opposite since some specificities can
>>>>>>> never
>>>>>>>          find a good
>>>>>>>                   place.
>>>>>>>
>>>>>>>                   The little thing we need to take care with that is
>>>>>>> to
>>>>>>>          avoid to
>>>>>>>                   introduce some feature flipping as support some
>>>>>>> feature
>>>>>>> not
>>>>>>>                   doable with another runner. It should really be
>>>>>>> about
>>>>>>>          runing a
>>>>>>>                   runner execution (like the schema in spark).
>>>>>>>
>>>>>>>
>>>>>>>                   Romain Manni-Bucau
>>>>>>>                   @rmannibucau <https://twitter.com/rmannibucau
>>>>>>>          <https://twitter.com/rmannibucau>> | Blog
>>>>>>>                   <https://rmannibucau.metawerx.net/
>>>>>>>          <https://rmannibucau.metawerx.net/>> | Old Blog
>>>>>>>                   <http://rmannibucau.wordpress.com
>>>>>>>          <http://rmannibucau.wordpress.com>> | Github
>>>>>>>                   <https://github.com/rmannibucau
>>>>>>>          <https://github.com/rmannibucau>> | LinkedIn
>>>>>>>                   <https://www.linkedin.com/in/rmannibucau
>>>>>>>          <https://www.linkedin.com/in/rmannibucau>>
>>>>>>>
>>>>>>>                   2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré
>>>>>>>          <[email protected] <mailto:[email protected]>
>>>>>>>                   <mailto:[email protected] <mailto:[email protected]
>>>>>>> >>>:
>>>>>>>
>>>>>>>                       Good point Luke: in that case, the hint will be
>>>>>>>          ignored by
>>>>>>>                       the runner if the hint is not for him. The
>>>>>>> hint can
>>>>>>> be
>>>>>>>                       generic (not specific to a runner). It could be
>>>>>>>          interesting
>>>>>>>                       for the schema support or IOs, not specific to
>>>>>>> a
>>>>>>>          runner.
>>>>>>>
>>>>>>>                       What do you mean by gathering
>>>>>>>          PTransforms/PCollections and
>>>>>>>                       where ?
>>>>>>>
>>>>>>>                       Thanks !
>>>>>>>                       Regards
>>>>>>>                       JB
>>>>>>>
>>>>>>>                       On 30/01/2018 18:35, Lukasz Cwik wrote:
>>>>>>>
>>>>>>>                           If the hint is required to run the persons
>>>>>>> pipeline
>>>>>>>                           well, how do you expect that the person we
>>>>>>> be
>>>>>>>          able to
>>>>>>>                           migrate their pipeline to another runner?
>>>>>>>
>>>>>>>                           A lot of hints like "spark.persist" are
>>>>>>> really
>>>>>>>          the user
>>>>>>>                           trying to tell us something about the
>>>>>>>          PCollection, like
>>>>>>>                           it is very small. I would prefer if we
>>>>>>> gathered
>>>>>>>          this
>>>>>>>                           information about PTransforms and
>>>>>>> PCollections
>>>>>>>          instead
>>>>>>>                           of runner specific knobs since then each
>>>>>>> runner
>>>>>>> can
>>>>>>>                           choose how best to map such a property on
>>>>>>> their
>>>>>>>          internal
>>>>>>>                           representation.
>>>>>>>
>>>>>>>                           On Tue, Jan 30, 2018 at 2:21 AM,
>>>>>>> Jean-Baptiste
>>>>>>>          Onofré
>>>>>>>                           <[email protected] <mailto:[email protected]>
>>>>>>>          <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>                           <mailto:[email protected]
>>>>>>>          <mailto:[email protected]> <mailto:[email protected]
>>>>>>>          <mailto:[email protected]>>>> wrote:
>>>>>>>
>>>>>>>                                Hi,
>>>>>>>
>>>>>>>                                As part of the discussion about
>>>>>>> schema,
>>>>>>> Romain
>>>>>>>                           mentioned hint. I
>>>>>>>                                think it's
>>>>>>>                                worth to have an explanation about
>>>>>>> that
>>>>>>> and
>>>>>>>                           especially it could be
>>>>>>>                                wider than
>>>>>>>                                schema.
>>>>>>>
>>>>>>>                                Today, to give information to the
>>>>>>> runner,
>>>>>>>          we use
>>>>>>>                           PipelineOptions.
>>>>>>>                                The runner can
>>>>>>>                                use these options, and apply for all
>>>>>>> inner
>>>>>>>                           representation of the
>>>>>>>                                PCollection in
>>>>>>>                                the runner.
>>>>>>>
>>>>>>>                                For instance, for the Spark runner,
>>>>>>> the
>>>>>>>          persistence
>>>>>>>                           storage level
>>>>>>>                                (memory, disk,
>>>>>>>                                ...) can be defined via pipeline
>>>>>>> options.
>>>>>>>
>>>>>>>                                Then, the Spark runner automatically
>>>>>>>          defines if
>>>>>>>                           RDDs have to be
>>>>>>>                                persisted (using
>>>>>>>                                the storage level defined in the
>>>>>>> pipeline
>>>>>>>          options),
>>>>>>>                           for instance if
>>>>>>>                                the same
>>>>>>>                                POutput/PCollection is read several
>>>>>>> time.
>>>>>>>
>>>>>>>                                However, the user doesn't have any
>>>>>>> way to
>>>>>>>          provide
>>>>>>>                           indication to the
>>>>>>>                                runner to
>>>>>>>                                deal with a specific PCollection.
>>>>>>>
>>>>>>>                                Imagine, the user has a pipeline like
>>>>>>> this:
>>>>>>>                                pipeline.apply().apply().apply(). We
>>>>>>>                                have three PCollections involved in
>>>>>>> this
>>>>>>>          pipeline.
>>>>>>>                           It's not
>>>>>>>                                currently possible
>>>>>>>                                to give indications how the runner
>>>>>>> should
>>>>>>>                           "optimized" and deal with
>>>>>>>                                the second
>>>>>>>                                PCollection only.
>>>>>>>
>>>>>>>                                The idea is to add a method on the
>>>>>>>          PCollection:
>>>>>>>
>>>>>>>                                PCollection.addHint(String key, Object
>>>>>>> value);
>>>>>>>
>>>>>>>                                For instance:
>>>>>>>
>>>>>>>                                collection.addHint("spark.persist",
>>>>>>>                           StorageLevel.MEMORY_ONLY);
>>>>>>>
>>>>>>>                                I see three direct usage of this:
>>>>>>>
>>>>>>>                                1. Related to schema: the schema
>>>>>>>          definition could
>>>>>>>                           be a hint
>>>>>>>                                2. Related to the IO: add headers for
>>>>>>> the
>>>>>>>          IO and
>>>>>>>                           the runner how to
>>>>>>>                                specifically
>>>>>>>                                process a collection. In Apache
>>>>>>> Camel, we
>>>>>>> have
>>>>>>>                           headers on the
>>>>>>>                                message and
>>>>>>>                                properties on the exchange similar to
>>>>>>> this. It
>>>>>>>                           allows to give some
>>>>>>>                                indication
>>>>>>>                                how to process some messages on the
>>>>>>> Camel
>>>>>>>                           component. We can imagine
>>>>>>>                                the same of
>>>>>>>                                the IO (using the PCollection hints to
>>>>>>> react
>>>>>>>                           accordingly).
>>>>>>>                                3. Related to runner optimization: I
>>>>>>> see
>>>>>>> for
>>>>>>>                           instance a way to use
>>>>>>>                                RDD or
>>>>>>>                                dataframe in Spark runner, or even
>>>>>>> specific
>>>>>>>                           optimization like
>>>>>>>                                persist. I had lot
>>>>>>>                                of questions from Spark users saying:
>>>>>>> "in
>>>>>>>          my Spark
>>>>>>>                           job, I know where
>>>>>>>                                and how I
>>>>>>>                                should use persist (rdd.persist()),
>>>>>>> but I
>>>>>>>          can't do
>>>>>>>                           such optimization
>>>>>>>                                using
>>>>>>>                                Beam". So it could be a good
>>>>>>> improvements.
>>>>>>>
>>>>>>>                                Thoughts ?
>>>>>>>
>>>>>>>                                Regards
>>>>>>>                                JB
>>>>>>>                                --
>>>>>>>                                Jean-Baptiste Onofré
>>>>>>>          [email protected] <mailto:[email protected]>
>>>>>>>          <mailto:[email protected] <mailto:[email protected]>>
>>>>>>>                           <mailto:[email protected]
>>>>>>>          <mailto:[email protected]> <mailto:[email protected]
>>>>>>>          <mailto:[email protected]>>>
>>>>>>>          http://blog.nanthrax.net
>>>>>>>                                Talend - http://www.talend.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to