Maybe I should have started the discussion on the user mailing list: it
would be great to have user feedback on this, even if I got your points.
Sometime, I have the feeling that whatever we are proposing and
discussing, it doesn't go anywhere. At some point, to attract more
people, we have to get ideas from different perspective/standpoint.
Thanks for the feedback anyway.
Regards
JB
On 30/01/2018 20:27, Romain Manni-Bucau wrote:
2018-01-30 19:52 GMT+01:00 Kenneth Knowles <k...@google.com
<mailto:k...@google.com>>:
I generally like having certain "escape hatches" that are well
designed and limited in scope, and anything that turns out to be
important becomes first-class. But this one I don't really like
because the use cases belong elsewhere. Of course, they creep so you
should assume they will be unbounded in how much gets stuffed into
them. And the definition of a "hint" is that deleting it does not
change semantics, just performance/monitor/UI etc but this does not
seem to be true.
"spark.persist" for idempotent replay in a sink:
- this is already @RequiresStableInput
- it is not a hint because if you don't persist your results are
incorrect
- it is a property of a DoFn / transform not a PCollection
Let's put this last point aside since we'll manage to make it working
wherever we store it ;).
schema:
- should be first-class
Except it doesn't make sense everywhere. It is exactly like saying
"implement this" and 2 lines later "it doesn't do anything for you". If
you think wider on schema you will want to do far more - like getting
them from the previous step etc... - which makes it not an API thing.
However, with some runner like spark, being able to specifiy it will
enable to optimize the execution. There is a clear mismatch between a
consistent and user friendly generic and portable API, and a runtime,
runner specific, implementation.
This is all fine as an issue for a portable API and why all EE API have
a map to pass properties somewhere so I don't see why beam wouldn't fall
in that exact same bucket since it embraces the drawback of the
portability and we already hit it since several releases.
step parallelism (you didn't mention but most runners need some
control):
- this is a property of the data and the pipeline together, not
just the pipeline
Good one but this can be configured from the pipeline or even a
transform. This doesn't mean the data is not important - and you are
more than right on that point, just that it is configurable without
referencing the data (using ranges is a trivial example even if not the
most efficient).
So I just don't actually see a use case for free-form hints that we
haven't already covered.
There are several cases, even in the direct runner to be able to
industrialize it:
- use that particular executor instance
- debug these infos for that transform
etc...
As a high level design I think it is good to bring hints to beam to
avoid to add ad-hoc solution each time and take the risk to loose the
portability of the main API.
Kenn
On Tue, Jan 30, 2018 at 9:55 AM, Romain Manni-Bucau
<rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote:
Lukasz, the point is that you have to choice to either bring all
specificities to the main API which makes most of the API not
usable or implemented or the opposite, not support anything.
Introducing hints will allow to have eagerly for some runners
some features - or just some very specific things - and once
mainstream it can find a place in the main API. This is saner
than the opposite since some specificities can never find a good
place.
The little thing we need to take care with that is to avoid to
introduce some feature flipping as support some feature not
doable with another runner. It should really be about runing a
runner execution (like the schema in spark).
Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> | Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github
<https://github.com/rmannibucau> | LinkedIn
<https://www.linkedin.com/in/rmannibucau>
2018-01-30 18:42 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>>:
Good point Luke: in that case, the hint will be ignored by
the runner if the hint is not for him. The hint can be
generic (not specific to a runner). It could be interesting
for the schema support or IOs, not specific to a runner.
What do you mean by gathering PTransforms/PCollections and
where ?
Thanks !
Regards
JB
On 30/01/2018 18:35, Lukasz Cwik wrote:
If the hint is required to run the persons pipeline
well, how do you expect that the person we be able to
migrate their pipeline to another runner?
A lot of hints like "spark.persist" are really the user
trying to tell us something about the PCollection, like
it is very small. I would prefer if we gathered this
information about PTransforms and PCollections instead
of runner specific knobs since then each runner can
choose how best to map such a property on their internal
representation.
On Tue, Jan 30, 2018 at 2:21 AM, Jean-Baptiste Onofré
<j...@nanthrax.net <mailto:j...@nanthrax.net>
<mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> wrote:
Hi,
As part of the discussion about schema, Romain
mentioned hint. I
think it's
worth to have an explanation about that and
especially it could be
wider than
schema.
Today, to give information to the runner, we use
PipelineOptions.
The runner can
use these options, and apply for all inner
representation of the
PCollection in
the runner.
For instance, for the Spark runner, the persistence
storage level
(memory, disk,
...) can be defined via pipeline options.
Then, the Spark runner automatically defines if
RDDs have to be
persisted (using
the storage level defined in the pipeline options),
for instance if
the same
POutput/PCollection is read several time.
However, the user doesn't have any way to provide
indication to the
runner to
deal with a specific PCollection.
Imagine, the user has a pipeline like this:
pipeline.apply().apply().apply(). We
have three PCollections involved in this pipeline.
It's not
currently possible
to give indications how the runner should
"optimized" and deal with
the second
PCollection only.
The idea is to add a method on the PCollection:
PCollection.addHint(String key, Object value);
For instance:
collection.addHint("spark.persist",
StorageLevel.MEMORY_ONLY);
I see three direct usage of this:
1. Related to schema: the schema definition could
be a hint
2. Related to the IO: add headers for the IO and
the runner how to
specifically
process a collection. In Apache Camel, we have
headers on the
message and
properties on the exchange similar to this. It
allows to give some
indication
how to process some messages on the Camel
component. We can imagine
the same of
the IO (using the PCollection hints to react
accordingly).
3. Related to runner optimization: I see for
instance a way to use
RDD or
dataframe in Spark runner, or even specific
optimization like
persist. I had lot
of questions from Spark users saying: "in my Spark
job, I know where
and how I
should use persist (rdd.persist()), but I can't do
such optimization
using
Beam". So it could be a good improvements.
Thoughts ?
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
<mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
http://blog.nanthrax.net
Talend - http://www.talend.com