Re: [DISCUSSION] Add hint/option on PCollection

Romain Manni-Bucau Tue, 30 Jan 2018 11:27:57 -0800

2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com>:

> I generally like having certain "escape hatches" that are well designed
> and limited in scope, and anything that turns out to be important becomes
> first-class. But this one I don't really like because the use cases belong
> elsewhere. Of course, they creep so you should assume they will be
> unbounded in how much gets stuffed into them. And the definition of a
> "hint" is that deleting it does not change semantics, just
> performance/monitor/UI etc but this does not seem to be true.
>
> "spark.persist" for idempotent replay in a sink:
>  - this is already @RequiresStableInput
>  - it is not a hint because if you don't persist your results are incorrect
>  - it is a property of a DoFn / transform not a PCollection
>


Let's put this last point aside since we'll manage to make it working
wherever we store it ;).


>
> schema:
>  - should be first-class
>

Except it doesn't make sense everywhere. It is exactly like saying
"implement this" and 2 lines later "it doesn't do anything for you". If you
think wider on schema you will want to do far more - like getting them from
the previous step etc... - which makes it not an API thing. However, with
some runner like spark, being able to specifiy it will enable to optimize
the execution. There is a clear mismatch between a consistent and user
friendly generic and portable API, and a runtime, runner specific,
implementation.

This is all fine as an issue for a portable API and why all EE API have a
map to pass properties somewhere so I don't see why beam wouldn't fall in
that exact same bucket since it embraces the drawback of the portability
and we already hit it since several releases.


>
> step parallelism (you didn't mention but most runners need some control):
>  - this is a property of the data and the pipeline together, not just the
> pipeline
>

Good one but this can be configured from the pipeline or even a transform.
This doesn't mean the data is not important - and you are more than right
on that point, just that it is configurable without referencing the data
(using ranges is a trivial example even if not the most efficient).


>
> So I just don't actually see a use case for free-form hints that we
> haven't already covered.
>

There are several cases, even in the direct runner to be able to
industrialize it:
- use that particular executor instance
- debug these infos for that transform

etc...

As a high level design I think it is good to bring hints to beam to avoid
to add ad-hoc solution each time and take the risk to loose the portability
of the main API.


>
> Kenn
>
> On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>
>> Lukasz, the point is that you have to choice to either bring all
>> specificities to the main API which makes most of the API not usable or
>> implemented or the opposite, not support anything. Introducing hints will
>> allow to have eagerly for some runners some features - or just some very
>> specific things - and once mainstream it can find a place in the main API.
>> This is saner than the opposite since some specificities can never find a
>> good place.
>>
>> The little thing we need to take care with that is to avoid to introduce
>> some feature flipping as support some feature not doable with another
>> runner. It should really be about runing a runner execution (like the
>> schema in spark).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau>
>>
>> 2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Good point Luke: in that case, the hint will be ignored by the runner if
>>> the hint is not for him. The hint can be generic (not specific to a
>>> runner). It could be interesting for the schema support or IOs, not
>>> specific to a runner.
>>>
>>> What do you mean by gathering PTransforms/PCollections and where ?
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> On 30/01/2018 18:35, Lukasz Cwik wrote:
>>>
>>>> If the hint is required to run the persons pipeline well, how do you
>>>> expect that the person we be able to migrate their pipeline to another
>>>> runner?
>>>>
>>>> A lot of hints like "spark.persist" are really the user trying to tell
>>>> us something about the PCollection, like it is very small. I would prefer
>>>> if we gathered this information about PTransforms and PCollections instead
>>>> of runner specific knobs since then each runner can choose how best to map
>>>> such a property on their internal representation.
>>>>
>>>> On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> <mailto:j...@nanthrax.net>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     As part of the discussion about schema, Romain mentioned hint. I
>>>>     think it's
>>>>     worth to have an explanation about that and especially it could be
>>>>     wider than
>>>>     schema.
>>>>
>>>>     Today, to give information to the runner, we use PipelineOptions.
>>>>     The runner can
>>>>     use these options, and apply for all inner representation of the
>>>>     PCollection in
>>>>     the runner.
>>>>
>>>>     For instance, for the Spark runner, the persistence storage level
>>>>     (memory, disk,
>>>>     ...) can be defined via pipeline options.
>>>>
>>>>     Then, the Spark runner automatically defines if RDDs have to be
>>>>     persisted (using
>>>>     the storage level defined in the pipeline options), for instance if
>>>>     the same
>>>>     POutput/PCollection is read several time.
>>>>
>>>>     However, the user doesn't have any way to provide indication to the
>>>>     runner to
>>>>     deal with a specific PCollection.
>>>>
>>>>     Imagine, the user has a pipeline like this:
>>>>     pipeline.apply().apply().apply(). We
>>>>     have three PCollections involved in this pipeline. It's not
>>>>     currently possible
>>>>     to give indications how the runner should "optimized" and deal with
>>>>     the second
>>>>     PCollection only.
>>>>
>>>>     The idea is to add a method on the PCollection:
>>>>
>>>>     PCollection.addHint(String key, Object value);
>>>>
>>>>     For instance:
>>>>
>>>>     collection.addHint("spark.persist", StorageLevel.MEMORY_ONLY);
>>>>
>>>>     I see three direct usage of this:
>>>>
>>>>     1. Related to schema: the schema definition could be a hint
>>>>     2. Related to the IO: add headers for the IO and the runner how to
>>>>     specifically
>>>>     process a collection. In Apache Camel, we have headers on the
>>>>     message and
>>>>     properties on the exchange similar to this. It allows to give some
>>>>     indication
>>>>     how to process some messages on the Camel component. We can imagine
>>>>     the same of
>>>>     the IO (using the PCollection hints to react accordingly).
>>>>     3. Related to runner optimization: I see for instance a way to use
>>>>     RDD or
>>>>     dataframe in Spark runner, or even specific optimization like
>>>>     persist. I had lot
>>>>     of questions from Spark users saying: "in my Spark job, I know where
>>>>     and how I
>>>>     should use persist (rdd.persist()), but I can't do such optimization
>>>>     using
>>>>     Beam". So it could be a good improvements.
>>>>
>>>>     Thoughts ?
>>>>
>>>>     Regards
>>>>     JB
>>>>     --
>>>>     Jean-Baptiste Onofré
>>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>>     http://blog.nanthrax.net
>>>>     Talend - http://www.talend.com
>>>>
>>>>
>>>>
>>
>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to