Re: [DISCUSSION] Add hint/option on PCollection

Kenneth Knowles Tue, 30 Jan 2018 10:53:16 -0800

I generally like having certain "escape hatches" that are well designed and
limited in scope, and anything that turns out to be important becomes
first-class. But this one I don't really like because the use cases belong
elsewhere. Of course, they creep so you should assume they will be
unbounded in how much gets stuffed into them. And the definition of a
"hint" is that deleting it does not change semantics, just
performance/monitor/UI etc but this does not seem to be true.


"spark.persist" for idempotent replay in a sink:
 - this is already @RequiresStableInput
 - it is not a hint because if you don't persist your results are incorrect
 - it is a property of a DoFn / transform not a PCollection

schema:
 - should be first-class

step parallelism (you didn't mention but most runners need some control):
 - this is a property of the data and the pipeline together, not just the
pipeline

So I just don't actually see a use case for free-form hints that we haven't
already covered.

Kenn

On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Lukasz, the point is that you have to choice to either bring all
> specificities to the main API which makes most of the API not usable or
> implemented or the opposite, not support anything. Introducing hints will
> allow to have eagerly for some runners some features - or just some very
> specific things - and once mainstream it can find a place in the main API.
> This is saner than the opposite since some specificities can never find a
> good place.
>
> The little thing we need to take care with that is to avoid to introduce
> some feature flipping as support some feature not doable with another
> runner. It should really be about runing a runner execution (like the
> schema in spark).
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau>
>
> 2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Good point Luke: in that case, the hint will be ignored by the runner if
>> the hint is not for him. The hint can be generic (not specific to a
>> runner). It could be interesting for the schema support or IOs, not
>> specific to a runner.
>>
>> What do you mean by gathering PTransforms/PCollections and where ?
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 30/01/2018 18:35, Lukasz Cwik wrote:
>>
>>> If the hint is required to run the persons pipeline well, how do you
>>> expect that the person we be able to migrate their pipeline to another
>>> runner?
>>>
>>> A lot of hints like "spark.persist" are really the user trying to tell
>>> us something about the PCollection, like it is very small. I would prefer
>>> if we gathered this information about PTransforms and PCollections instead
>>> of runner specific knobs since then each runner can choose how best to map
>>> such a property on their internal representation.
>>>
>>> On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>>> <mailto:j...@nanthrax.net>> wrote:
>>>
>>>     Hi,
>>>
>>>     As part of the discussion about schema, Romain mentioned hint. I
>>>     think it's
>>>     worth to have an explanation about that and especially it could be
>>>     wider than
>>>     schema.
>>>
>>>     Today, to give information to the runner, we use PipelineOptions.
>>>     The runner can
>>>     use these options, and apply for all inner representation of the
>>>     PCollection in
>>>     the runner.
>>>
>>>     For instance, for the Spark runner, the persistence storage level
>>>     (memory, disk,
>>>     ...) can be defined via pipeline options.
>>>
>>>     Then, the Spark runner automatically defines if RDDs have to be
>>>     persisted (using
>>>     the storage level defined in the pipeline options), for instance if
>>>     the same
>>>     POutput/PCollection is read several time.
>>>
>>>     However, the user doesn't have any way to provide indication to the
>>>     runner to
>>>     deal with a specific PCollection.
>>>
>>>     Imagine, the user has a pipeline like this:
>>>     pipeline.apply().apply().apply(). We
>>>     have three PCollections involved in this pipeline. It's not
>>>     currently possible
>>>     to give indications how the runner should "optimized" and deal with
>>>     the second
>>>     PCollection only.
>>>
>>>     The idea is to add a method on the PCollection:
>>>
>>>     PCollection.addHint(String key, Object value);
>>>
>>>     For instance:
>>>
>>>     collection.addHint("spark.persist", StorageLevel.MEMORY_ONLY);
>>>
>>>     I see three direct usage of this:
>>>
>>>     1. Related to schema: the schema definition could be a hint
>>>     2. Related to the IO: add headers for the IO and the runner how to
>>>     specifically
>>>     process a collection. In Apache Camel, we have headers on the
>>>     message and
>>>     properties on the exchange similar to this. It allows to give some
>>>     indication
>>>     how to process some messages on the Camel component. We can imagine
>>>     the same of
>>>     the IO (using the PCollection hints to react accordingly).
>>>     3. Related to runner optimization: I see for instance a way to use
>>>     RDD or
>>>     dataframe in Spark runner, or even specific optimization like
>>>     persist. I had lot
>>>     of questions from Spark users saying: "in my Spark job, I know where
>>>     and how I
>>>     should use persist (rdd.persist()), but I can't do such optimization
>>>     using
>>>     Beam". So it could be a good improvements.
>>>
>>>     Thoughts ?
>>>
>>>     Regards
>>>     JB
>>>     --
>>>     Jean-Baptiste Onofré
>>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>>     http://blog.nanthrax.net
>>>     Talend - http://www.talend.com
>>>
>>>
>>>
>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to