Re: [DISCUSSION] Add hint/option on PCollection

Lukasz Cwik Tue, 30 Jan 2018 09:55:38 -0800

There have been suggestions in the past for Dataflow 1.x to extend
PipelineOptions to be usable per PTransform. So when you apply a PTransform
you can also provide a set of options that apply to it and all
subtransforms contained within. This is the closest suggestion to what your
describing that I can think of but I don't think its a good fit.


I don't really have a concrete approach as to how hints would be passed
through since I believe with a sufficiently powerful runner, none of these
hints are actually useful since the system should adapt during execution.

Until something is designed, this can be something that is tried out within
a specific runner initially by adding methods to a specific runner like:
runner.addHint(PValue, String key, String value)
runner.addHint(PTransform, String, key, String value)



On Tue, Jan 30, 2018 at 9:42 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Good point Luke: in that case, the hint will be ignored by the runner if
> the hint is not for him. The hint can be generic (not specific to a
> runner). It could be interesting for the schema support or IOs, not
> specific to a runner.
>
> What do you mean by gathering PTransforms/PCollections and where ?
>
> Thanks !
> Regards
> JB
>
> On 30/01/2018 18:35, Lukasz Cwik wrote:
>
>> If the hint is required to run the persons pipeline well, how do you
>> expect that the person we be able to migrate their pipeline to another
>> runner?
>>
>> A lot of hints like "spark.persist" are really the user trying to tell us
>> something about the PCollection, like it is very small. I would prefer if
>> we gathered this information about PTransforms and PCollections instead of
>> runner specific knobs since then each runner can choose how best to map
>> such a property on their internal representation.
>>
>> On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré <j...@nanthrax.net
>> <mailto:j...@nanthrax.net>> wrote:
>>
>>     Hi,
>>
>>     As part of the discussion about schema, Romain mentioned hint. I
>>     think it's
>>     worth to have an explanation about that and especially it could be
>>     wider than
>>     schema.
>>
>>     Today, to give information to the runner, we use PipelineOptions.
>>     The runner can
>>     use these options, and apply for all inner representation of the
>>     PCollection in
>>     the runner.
>>
>>     For instance, for the Spark runner, the persistence storage level
>>     (memory, disk,
>>     ...) can be defined via pipeline options.
>>
>>     Then, the Spark runner automatically defines if RDDs have to be
>>     persisted (using
>>     the storage level defined in the pipeline options), for instance if
>>     the same
>>     POutput/PCollection is read several time.
>>
>>     However, the user doesn't have any way to provide indication to the
>>     runner to
>>     deal with a specific PCollection.
>>
>>     Imagine, the user has a pipeline like this:
>>     pipeline.apply().apply().apply(). We
>>     have three PCollections involved in this pipeline. It's not
>>     currently possible
>>     to give indications how the runner should "optimized" and deal with
>>     the second
>>     PCollection only.
>>
>>     The idea is to add a method on the PCollection:
>>
>>     PCollection.addHint(String key, Object value);
>>
>>     For instance:
>>
>>     collection.addHint("spark.persist", StorageLevel.MEMORY_ONLY);
>>
>>     I see three direct usage of this:
>>
>>     1. Related to schema: the schema definition could be a hint
>>     2. Related to the IO: add headers for the IO and the runner how to
>>     specifically
>>     process a collection. In Apache Camel, we have headers on the
>>     message and
>>     properties on the exchange similar to this. It allows to give some
>>     indication
>>     how to process some messages on the Camel component. We can imagine
>>     the same of
>>     the IO (using the PCollection hints to react accordingly).
>>     3. Related to runner optimization: I see for instance a way to use
>>     RDD or
>>     dataframe in Spark runner, or even specific optimization like
>>     persist. I had lot
>>     of questions from Spark users saying: "in my Spark job, I know where
>>     and how I
>>     should use persist (rdd.persist()), but I can't do such optimization
>>     using
>>     Beam". So it could be a good improvements.
>>
>>     Thoughts ?
>>
>>     Regards
>>     JB
>>     --
>>     Jean-Baptiste Onofré
>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>     http://blog.nanthrax.net
>>     Talend - http://www.talend.com
>>
>>
>>

Re: [DISCUSSION] Add hint/option on PCollection

Reply via email to