(Warning: I may just be rehashing Davor's email in different language,
but I think it might help.)
I think we could use a few definitions here: (aside: do we have a
definitions page? I could not find one).
- a *primitive* is a fundamental component of the Beam programming
model. It should mean the same thing in every language and it should not
be expressible in terms of other primitives. If you look at the Beam
Compatibility Matrix
<http://beam.incubator.apache.org/capability-matrix/> and the Beam
Technical Vision
<https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing#>
you will see things like GroupByKey and ParDo listed among the primitives.
- an *SDK* is a language-specific implementation of the Beam model. An
SDK will usually have at least 2 runners: a "testing and experimenting"
runner that is likely to be single-process and in-memory; and a "real"
runner that scale to large datasets and may use multiple cores and be
distributed. In some languages, an SDK may have multiple runners (Java:
Direct,InProcess,Cloud Dataflow,Flink,Spark).
I think that Log does not fit well with the definition for a Primitive.
It is likely to have somewhat different definitions and semantics across
languages, and we can already implement it as a Composite PTransform
containing a ParDo+DoFn.
Next, I do think that each SDK needs to have a standard way to log.
This needs to go above and beyond a Logging primitive -- log
messages inside of DoFns or libraries of composite transforms, for
instance, should all be loggable the same way and end up being processed
in a coherent manner.
This standardized logging should not be runner-specific, because we
do not want library authors to have to specialize their logging code to
the intended runner, but it should be "interceptable" by runners to
provide specialized behavior. In Java, for instance, we commend slf4j
loggers, but each runner can register customized logging handlers.
Third, we get to the question of whether most SDKs should include a Log
PTransform in the "standard library" along with say, the ability to read
and write from text files.
We could make a reasonable argument for yes, if there we common
patterns such as Ismael suggested, e.g., "log if some condition holds",
perhaps with some frequency.
We should think somewhat carefully about what these patterns are:
it's easy to dramatically slow down a pipeline by logging too much, and
it's hard to implement certain primitives in a global way. E.g., "log
every 1 second" would naturally mean "1 second ... per process" if we
had many processes, unless we have a distributed rate limiter.
Hope that makes sense / hope it helps,
Thanks!
Dan
On Sun, Mar 20, 2016 at 1:05 PM, Jean-Baptiste Onofré <[email protected]
<mailto:[email protected]>> wrote:
Welcome aboard and great idea ;)
Probably DoFn is easier and straight forward in your case.
Anyway, it would be a good addition to the SDK (as primitives), or
at least in a "Data Integration DSL" (as I thought first).
Regards
JB
On 03/20/2016 09:01 PM, Ismaël Mejía wrote:
Hello,
I agree with you JB, Log is a more appropriate name for the case
of 'print',
we can definitely create a richer transform with your ideas, and
we will
discuss the details later on when we start to work together.
The more abstract case which I call Debug since I didn't find a
better
name is a
general transform that can be the base of many others who
produce side
effects
but don't change the data in the PTransform, that's why I
consider it a
different (more abstract) Transform per se, and I implemented
the general
predicate + function application just to prove my point, and the
Log/print case
was just a test of a specific case.
Since I am new to the Dataflow model I don't know which unintended
consequences
this transform can have (or which good practices a transform that
side-effects must
take care of), aditionally I have not thought about how to
support more
advanced features of the model (e.g. side inputs/outputs). Any
ideas ?
But well, this is my hello world in the Dataflow model, so we'll see
what's to
come :)
-Ismaël
On Sun, Mar 20, 2016 at 4:18 PM, Jean-Baptiste Onofré
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi,
thanks for the update.
IMHO, I would name Debug transform as Log:
.apply(Log.withLevel("DEBUG"))
.apply(Log.withLevel("INFO").withPattern("%d %m ..."))
.apply(Log.withLevel("WARN").withMessage("Foo").withStream("System.out")
It would more flexible and related to the actual behavior.
I would mimic a bit the Camel log component for instance.
If you don't mind, I will do it with you.
Thanks
Regards
JB
On 03/20/2016 12:07 PM, Ismaël Mejía wrote:
Hi,
The code of the transform is here in a playground for Beam
experiments I
created (it is a bit alpha for the moment, and it does
not have
comments):
https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java
Since my initial goal was more of a test scenario in the
DirectPipelineRunner I haven't considered yet more
advanced logging
capabilities and the possible issues of distribution
(serialization, in
particular of dependencies, as well as exceptions,
etc), but of
course
it is something I expect to improve if there is
interest. Do you see
some immediate things to improve to try it with the
distributed
runners
(I want to do this, as a excuse also to try the
FlinkRunner).
Best,
-Ismael
On Sun, Mar 20, 2016 at 11:13 AM, Jean-Baptiste Onofré
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>> wrote:
By the way, for the "Integration" DSL, in addition of
explicit debug
transform, it would make sense to have an implicit
"Tracer". It's
something that I planned: it would allow us to
have sampling on
PCollection if the pipeline tracer is enabled
(like we do
in a Camel
route with the tracer).
Regards
JB
On 03/20/2016 10:14 AM, Ismaël Mejía wrote:
Hello,
I just started playing with Beam and I wanted
to debug
what happens
between transforms in pipelines. I wrote a
simple 'Debug'
transform for
this.
The idea is to apply a function based on a
predicate to any
element in a
collection without changing the collection, or
in other
words, a
transform that
does not transform but produces side effects.
The idea is better illustrated with this
simple example:
.apply(FlatMapElements.via((String text) ->
Arrays.asList(text.split(" ")))
.withOutputType(new
TypeDescriptor<String>() {
}))
.apply(Debug
.when((String s) -> s.startsWith("A"))
.with((String s) -> {
System.out.println(s);
return null;
}));
.apply(Filter.byPredicate((String text) ->
text.length() > 5))
.apply(Debug.print()); // sugared
method, same
as above
I think this can be useful (at least for debugging
purposes), is
there
something
like this already in the SDK ? If this is not
the case,
can you
please
give me some
feedback/ideas to improve my transform.
Thanks,
-Ismael
ps. You can find the code of the first version
of the
transform
here:
https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java
--
Jean-Baptiste Onofré
[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
http://blog.nanthrax.net
Talend - http://www.talend.com
--
Jean-Baptiste Onofré
[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
http://blog.nanthrax.net
Talend - http://www.talend.com
--
Jean-Baptiste Onofré
[email protected] <mailto:[email protected]>
http://blog.nanthrax.net
Talend - http://www.talend.com