Re: Debugging pipelines

Jean-Baptiste Onofré Sun, 20 Mar 2016 23:59:45 -0700

Hi Dan,

great explanation ;)


We are all in the same track ;)

Regards
JB

On 03/20/2016 10:10 PM, Dan Halperin wrote:

(Warning: I may just be rehashing Davor's email in different language,
but I think it might help.)

I think we could use a few definitions here:  (aside: do we have a
definitions page? I could not find one).

- a *primitive* is a fundamental component of the Beam programming
model. It should mean the same thing in every language and it should not
be expressible in terms of other primitives. If you look at the Beam
Compatibility Matrix
<http://beam.incubator.apache.org/capability-matrix/> and the Beam
Technical Vision
<https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing#>
you will see things like GroupByKey and ParDo listed among the primitives.

- an *SDK* is a language-specific implementation of the Beam model. An
SDK will usually have at least 2 runners: a "testing and experimenting"
runner that is likely to be single-process and in-memory; and a "real"
runner that scale to large datasets and may use multiple cores and be
distributed. In some languages, an SDK may have multiple runners (Java:
Direct,InProcess,Cloud Dataflow,Flink,Spark).

I think that Log does not fit well with the definition for a Primitive.
It is likely to have somewhat different definitions and semantics across
languages, and we can already implement it as a Composite PTransform
containing a ParDo+DoFn.

Next, I do think that each SDK needs to have a standard way to log.
    This needs to go above and beyond a Logging primitive -- log
messages inside of DoFns or libraries of composite transforms, for
instance, should all be loggable the same way and end up being processed
in a coherent manner.
    This standardized logging should not be runner-specific, because we
do not want library authors to have to specialize their logging code to
the intended runner, but it should be "interceptable" by runners to
provide specialized behavior. In Java, for instance, we commend slf4j
loggers, but each runner can register customized logging handlers.

Third, we get to the question of whether most SDKs should include a Log
PTransform in the "standard library" along with say, the ability to read
and write from text files.
     We could make a reasonable argument for yes, if there we common
patterns such as Ismael suggested, e.g., "log if some condition holds",
perhaps with some frequency.
     We should think somewhat carefully about what these patterns are:
it's easy to dramatically slow down a pipeline by logging too much, and
it's hard to implement certain primitives in a global way. E.g., "log
every 1 second" would naturally mean "1 second ... per process" if we
had many processes, unless we have a distributed rate limiter.

Hope that makes sense / hope it helps,

Thanks!
Dan

On Sun, Mar 20, 2016 at 1:05 PM, Jean-Baptiste Onofré <[email protected]
<mailto:[email protected]>> wrote:

    Welcome aboard and great idea ;)

    Probably DoFn is easier and straight forward in your case.

    Anyway, it would be a good addition to the SDK (as primitives), or
    at least in a "Data Integration DSL" (as I thought first).

    Regards
    JB

    On 03/20/2016 09:01 PM, Ismaël Mejía wrote:

        Hello,

        I agree with you JB, Log is a more appropriate name for the case
        of 'print',
        we can definitely create a richer transform with your ideas, and
        we will
        discuss the details later on when we start to work together.

        The more abstract case which I call Debug since I didn't find a
        better
        name is a
        general transform that can be the base of many others who
        produce side
        effects
        but don't change the data in the PTransform, that's why I
        consider it a
        different (more abstract) Transform per se, and I implemented
        the general
        predicate + function application just to prove my point, and the
        Log/print case
        was just a test of a specific case.

        Since I am new to the Dataflow model I don't know which unintended
        consequences
        this transform can have (or which good practices a transform that
        side-effects must
        take care of), aditionally I have not thought about how to
        support more
        advanced features of the model (e.g. side inputs/outputs). Any
        ideas ?

        But well, this is my hello world in the Dataflow model, so we'll see
        what's to
        come :)

        -Ismaël



        On Sun, Mar 20, 2016 at 4:18 PM, Jean-Baptiste Onofré
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             Hi,

             thanks for the update.

             IMHO, I would name Debug transform as Log:

             .apply(Log.withLevel("DEBUG"))
             .apply(Log.withLevel("INFO").withPattern("%d %m ..."))

        .apply(Log.withLevel("WARN").withMessage("Foo").withStream("System.out")

             It would more flexible and related to the actual behavior.

             I would mimic a bit the Camel log component for instance.

             If you don't mind, I will do it with you.

             Thanks
             Regards
             JB

             On 03/20/2016 12:07 PM, Ismaël Mejía wrote:

                 Hi,

                 The code of the transform is here in a playground for Beam
                 experiments I
                 created (it is a bit alpha for the moment, and it does
        not have
                 comments):

        
https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java

                 Since my initial goal was more of a test scenario in the
                 DirectPipelineRunner I haven't considered yet more
        advanced logging
                 capabilities and the possible issues of distribution
                 (serialization, in
                 particular of dependencies, as well as exceptions,
        etc), but of
                 course
                 it is something I expect to improve if there is
        interest. Do you see
                 some immediate things to improve to try it with the
        distributed
                 runners
                 (I want to do this, as a excuse also to  try the
        FlinkRunner).

                 Best,
                 -Ismael


                 On Sun, Mar 20, 2016 at 11:13 AM, Jean-Baptiste Onofré
                 <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>> wrote:

                      By the way, for the "Integration" DSL, in addition of
                 explicit debug
                      transform, it would make sense to have an implicit
                 "Tracer". It's
                      something that I planned: it would allow us to
        have sampling on
                      PCollection if the pipeline tracer is enabled
        (like we do
                 in a Camel
                      route with the tracer).

                      Regards
                      JB

                      On 03/20/2016 10:14 AM, Ismaël Mejía wrote:

                          Hello,

                          I just started playing with Beam and I wanted
        to debug
                 what happens
                          between transforms in pipelines. I wrote a
        simple 'Debug'
                          transform for
                          this.
                          The idea is to apply a function based on a
        predicate to any
                          element in a
                          collection without changing the collection, or
        in other
                 words, a
                          transform that
                          does not transform but produces side effects.

                          The idea is better illustrated with this
        simple example:

                                .apply(FlatMapElements.via((String text) ->
                          Arrays.asList(text.split(" ")))
                                  .withOutputType(new
        TypeDescriptor<String>() {
                                 }))
                                .apply(Debug
                                  .when((String s) -> s.startsWith("A"))
                                  .with((String s) -> {
                                    System.out.println(s);
                                    return null;
                                  }));
                                .apply(Filter.byPredicate((String text) ->
                 text.length() > 5))
                                .apply(Debug.print());  // sugared
        method, same
                 as above

                          I think this can be useful (at least for debugging
                 purposes), is
                          there
                          something
                          like this already in the SDK ? If this is not
        the case,
                 can you
                          please
                          give me some
                          feedback/ideas to improve my transform.

                          Thanks,
                          -Ismael

                          ps. You can find the code of the first version
        of the
                 transform
                          here:
        
https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java



                      --
                      Jean-Baptiste Onofré
        [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
        http://blog.nanthrax.net
                      Talend - http://www.talend.com



             --
             Jean-Baptiste Onofré
        [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        http://blog.nanthrax.net
             Talend - http://www.talend.com



    --
    Jean-Baptiste Onofré
    [email protected] <mailto:[email protected]>
    http://blog.nanthrax.net
    Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Debugging pipelines

Reply via email to