Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-09-08 Thread Alex Amato
Hi,

Just wanted to mention that I updated this document with one detail
https://s.apache.org/beam-gcp-debuggability


Date

Changes

Sept 8, 2020

   -

   Clarified that the InstructionRequest/Control Channel will be used in
   “Proposal: SDKHs to Report non-bundle metrics.”

May 15, 2020

   -

   Completed review with beam dev list.


PTAL, and LMK what you think

On Fri, May 15, 2020 at 6:02 PM Alex Amato  wrote:

> Thanks everyone. I was able to collect a lot of good feedback from
> everyone who contributed. I am going to wrap it up for now and label the
> design as "Design Finalized (Unimplemented)".
>
> I really believe we have made a much better design than I initially wrote
> up. I couldn't have done it without the help of everyone who offered their
> time, energy and viewpoints. :)
>
> Thanks again, please let me know if you see any major issues with the
> design still. I think I have enough information to begin some
> implementation as soon as I have some time in the coming weeks.
> Alex
>
> https://s.apache.org/beam-gcp-debuggability
> https://s.apache.org/beam-histogram-metrics
>
> On Thu, May 14, 2020 at 5:22 PM Alex Amato  wrote:
>
>> Thanks to all who have spent their time on this, there were many great
>> suggestions, just another reminder that tomorrow I will be finalizing the
>> documents, unless there are any major objections left. Please take a look
>> at it if you are interested.
>>
>> I will still welcome feedback at any time :).
>>
>> But I believe we have gathered enough information to produce a good
>> design, which I will start to work on soon.
>> I will begin to build the necessary subset of the new features proposed
>> to support the BigQueryIO metrics use case, proposed.
>> I will likely start with the python SDK first.
>>
>> https://s.apache.org/beam-gcp-debuggability
>> https://s.apache.org/beam-histogram-metrics
>>
>>
>> On Wed, May 13, 2020 at 3:07 PM Alex Amato  wrote:
>>
>>> Thanks again for more feedback :). I have iterated on things again. I'll
>>> report back at the end of the week. If there are no major disagreements
>>> still, I'll close the discussion, believe it to be in a good enough state
>>> to start some implementation. But welcome feedback.
>>>
>>> Latest changes are changing the exponential format to allow denser
>>> buckets. Using only two MonitoringInfoSpec now for all of the IOs to use.
>>> Requiring some labels, but allowing optional
>>> ones for specific IOs to provide more contents.
>>>
>>> https://s.apache.org/beam-gcp-debuggability
>>> https://s.apache.org/beam-histogram-metrics
>>>
>>> On Mon, May 11, 2020 at 4:24 PM Alex Amato  wrote:
>>>
 Thanks for the great feedback so far :). I've included many new ideas,
 and made some revisions. Both docs have changed a fair bit since the
 initial mail out.

 https://s.apache.org/beam-gcp-debuggability
 https://s.apache.org/beam-histogram-metrics

 PTAL and let me know what you think, and hopefully we can resolve major
 issues by the end of the week. I'll try to finalize things by then, but of
 course always stay open to your great ideas. :)

 On Wed, May 6, 2020 at 6:19 PM Alex Amato  wrote:

> Thanks everyone so far for taking a look so far :).
>
> I am hoping to have this finalize the two reviews by the end of next
> week, May 15th.
>
> I'll continue to follow up on feedback and make changes, and I will
> add some more mentions to the documents to draw attention
>
> https://s.apache.org/beam-gcp-debuggability
>  https://s.apache.org/beam-histogram-metrics
>
> On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:
>
>> Thanks, also took a look and left some comments.
>>
>> On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:
>>
>>> Hello,
>>>
>>> I created another design document. This time for GCP IO
>>> Debuggability Metrics. Which defines some new metrics to collect in the 
>>> GCP
>>> IO libraries. This is for monitoring request counts and request 
>>> latencies.
>>>
>>> Please take a look and let me know what you think:
>>> https://s.apache.org/beam-gcp-debuggability
>>>
>>> I also sent out a separate design yesterday (
>>> https://s.apache.org/beam-histogram-metrics) which is related as
>>> this document uses a Histogram style metric :).
>>>
>>> I would love some feedback to make this feature the best possible :D,
>>> Alex
>>>
>>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-09-08 Thread Alex Amato
Hi Again, Just reviving this thread to mention that I updated the doc with
a few sections:
https://s.apache.org/beam-histogram-metrics


Date

Changes

Sept 8, 2020

   -

   Added alternative section: “Collect Moment Sketch Variables Instead of
   Bucket Counts” (recommend not pursuing, due to opposing trade offs and
   significant implementation/maintenance challenge. But may be worth pursuing
   in a future MonitoringInfo type).
   -

   Add distribution variables: min, max, sum, count
   -

   Added alternative section: “Update all distribution metrics to be
   Histograms” (recommend not pursuing, update to histogramDistribution on a
   case by case, due to performance concerns).

May 15, 2020

   -

   Completed review with beam dev list.


PTAL and LMK what you think :)

On Wed, May 6, 2020 at 9:58 AM Luke Cwik  wrote:

> Thanks Alex, I had some minor comments.
>
> On Mon, May 4, 2020 at 4:04 PM Alex Amato  wrote:
>
>> Thanks Ismaël :). Done
>>
>> On Mon, May 4, 2020 at 3:59 PM Ismaël Mejía  wrote:
>>
>>> Moving the short link to this thread
>>> https://s.apache.org/beam-histogram-metrics
>>>
>>> Alex can you add this link (and any other of your documents that may
>>> not be there) to
>>> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>>>
>>>
>>> On Tue, May 5, 2020 at 12:51 AM Pablo Estrada 
>>> wrote:
>>> >
>>> > FYI +Boyuan Zhang worked on implementing a histogram metric that was
>>> performance-optimized into outer space for Python : ) - I don't recall if
>>> she ended up getting it merged, but it's worth looking at the work. I also
>>> remember Scott Wegner wrote the metrics for Java.
>>> >
>>> > Best
>>> > -P.
>>> >
>>> > On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> I have created a proposal for Apache Beam FN API to support Histogram
>>> Style Metrics. Which defines a method to collect Histogram style metrics
>>> and pass them over the FN API.
>>> >>
>>> >> I would love to hear your feedback in order to improve this proposal,
>>> please let me know what you think. Thanks for taking a look :)
>>> >> Alex
>>>
>>


Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-15 Thread Alex Amato
Thanks everyone. I was able to collect a lot of good feedback from everyone
who contributed. I am going to wrap it up for now and label the design as
"Design Finalized (Unimplemented)".

I really believe we have made a much better design than I initially wrote
up. I couldn't have done it without the help of everyone who offered their
time, energy and viewpoints. :)

Thanks again, please let me know if you see any major issues with the
design still. I think I have enough information to begin some
implementation as soon as I have some time in the coming weeks.
Alex

https://s.apache.org/beam-gcp-debuggability
https://s.apache.org/beam-histogram-metrics

On Thu, May 14, 2020 at 5:22 PM Alex Amato  wrote:

> Thanks to all who have spent their time on this, there were many great
> suggestions, just another reminder that tomorrow I will be finalizing the
> documents, unless there are any major objections left. Please take a look
> at it if you are interested.
>
> I will still welcome feedback at any time :).
>
> But I believe we have gathered enough information to produce a good
> design, which I will start to work on soon.
> I will begin to build the necessary subset of the new features proposed to
> support the BigQueryIO metrics use case, proposed.
> I will likely start with the python SDK first.
>
> https://s.apache.org/beam-gcp-debuggability
> https://s.apache.org/beam-histogram-metrics
>
>
> On Wed, May 13, 2020 at 3:07 PM Alex Amato  wrote:
>
>> Thanks again for more feedback :). I have iterated on things again. I'll
>> report back at the end of the week. If there are no major disagreements
>> still, I'll close the discussion, believe it to be in a good enough state
>> to start some implementation. But welcome feedback.
>>
>> Latest changes are changing the exponential format to allow denser
>> buckets. Using only two MonitoringInfoSpec now for all of the IOs to use.
>> Requiring some labels, but allowing optional
>> ones for specific IOs to provide more contents.
>>
>> https://s.apache.org/beam-gcp-debuggability
>> https://s.apache.org/beam-histogram-metrics
>>
>> On Mon, May 11, 2020 at 4:24 PM Alex Amato  wrote:
>>
>>> Thanks for the great feedback so far :). I've included many new ideas,
>>> and made some revisions. Both docs have changed a fair bit since the
>>> initial mail out.
>>>
>>> https://s.apache.org/beam-gcp-debuggability
>>> https://s.apache.org/beam-histogram-metrics
>>>
>>> PTAL and let me know what you think, and hopefully we can resolve major
>>> issues by the end of the week. I'll try to finalize things by then, but of
>>> course always stay open to your great ideas. :)
>>>
>>> On Wed, May 6, 2020 at 6:19 PM Alex Amato  wrote:
>>>
 Thanks everyone so far for taking a look so far :).

 I am hoping to have this finalize the two reviews by the end of next
 week, May 15th.

 I'll continue to follow up on feedback and make changes, and I will add
 some more mentions to the documents to draw attention

 https://s.apache.org/beam-gcp-debuggability
  https://s.apache.org/beam-histogram-metrics

 On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:

> Thanks, also took a look and left some comments.
>
> On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:
>
>> Hello,
>>
>> I created another design document. This time for GCP IO Debuggability
>> Metrics. Which defines some new metrics to collect in the GCP IO 
>> libraries.
>> This is for monitoring request counts and request latencies.
>>
>> Please take a look and let me know what you think:
>> https://s.apache.org/beam-gcp-debuggability
>>
>> I also sent out a separate design yesterday (
>> https://s.apache.org/beam-histogram-metrics) which is related as
>> this document uses a Histogram style metric :).
>>
>> I would love some feedback to make this feature the best possible :D,
>> Alex
>>
>


Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-14 Thread Alex Amato
Thanks to all who have spent their time on this, there were many great
suggestions, just another reminder that tomorrow I will be finalizing the
documents, unless there are any major objections left. Please take a look
at it if you are interested.

I will still welcome feedback at any time :).

But I believe we have gathered enough information to produce a good design,
which I will start to work on soon.
I will begin to build the necessary subset of the new features proposed to
support the BigQueryIO metrics use case, proposed.
I will likely start with the python SDK first.

https://s.apache.org/beam-gcp-debuggability
https://s.apache.org/beam-histogram-metrics


On Wed, May 13, 2020 at 3:07 PM Alex Amato  wrote:

> Thanks again for more feedback :). I have iterated on things again. I'll
> report back at the end of the week. If there are no major disagreements
> still, I'll close the discussion, believe it to be in a good enough state
> to start some implementation. But welcome feedback.
>
> Latest changes are changing the exponential format to allow denser
> buckets. Using only two MonitoringInfoSpec now for all of the IOs to use.
> Requiring some labels, but allowing optional
> ones for specific IOs to provide more contents.
>
> https://s.apache.org/beam-gcp-debuggability
> https://s.apache.org/beam-histogram-metrics
>
> On Mon, May 11, 2020 at 4:24 PM Alex Amato  wrote:
>
>> Thanks for the great feedback so far :). I've included many new ideas,
>> and made some revisions. Both docs have changed a fair bit since the
>> initial mail out.
>>
>> https://s.apache.org/beam-gcp-debuggability
>> https://s.apache.org/beam-histogram-metrics
>>
>> PTAL and let me know what you think, and hopefully we can resolve major
>> issues by the end of the week. I'll try to finalize things by then, but of
>> course always stay open to your great ideas. :)
>>
>> On Wed, May 6, 2020 at 6:19 PM Alex Amato  wrote:
>>
>>> Thanks everyone so far for taking a look so far :).
>>>
>>> I am hoping to have this finalize the two reviews by the end of next
>>> week, May 15th.
>>>
>>> I'll continue to follow up on feedback and make changes, and I will add
>>> some more mentions to the documents to draw attention
>>>
>>> https://s.apache.org/beam-gcp-debuggability
>>>  https://s.apache.org/beam-histogram-metrics
>>>
>>> On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:
>>>
 Thanks, also took a look and left some comments.

 On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:

> Hello,
>
> I created another design document. This time for GCP IO Debuggability
> Metrics. Which defines some new metrics to collect in the GCP IO 
> libraries.
> This is for monitoring request counts and request latencies.
>
> Please take a look and let me know what you think:
> https://s.apache.org/beam-gcp-debuggability
>
> I also sent out a separate design yesterday (
> https://s.apache.org/beam-histogram-metrics) which is related as this
> document uses a Histogram style metric :).
>
> I would love some feedback to make this feature the best possible :D,
> Alex
>



Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-13 Thread Alex Amato
Thanks again for more feedback :). I have iterated on things again. I'll
report back at the end of the week. If there are no major disagreements
still, I'll close the discussion, believe it to be in a good enough state
to start some implementation. But welcome feedback.

Latest changes are changing the exponential format to allow denser buckets.
Using only two MonitoringInfoSpec now for all of the IOs to use. Requiring
some labels, but allowing optional
ones for specific IOs to provide more contents.

https://s.apache.org/beam-gcp-debuggability
https://s.apache.org/beam-histogram-metrics

On Mon, May 11, 2020 at 4:24 PM Alex Amato  wrote:

> Thanks for the great feedback so far :). I've included many new ideas, and
> made some revisions. Both docs have changed a fair bit since the initial
> mail out.
>
> https://s.apache.org/beam-gcp-debuggability
> https://s.apache.org/beam-histogram-metrics
>
> PTAL and let me know what you think, and hopefully we can resolve major
> issues by the end of the week. I'll try to finalize things by then, but of
> course always stay open to your great ideas. :)
>
> On Wed, May 6, 2020 at 6:19 PM Alex Amato  wrote:
>
>> Thanks everyone so far for taking a look so far :).
>>
>> I am hoping to have this finalize the two reviews by the end of next
>> week, May 15th.
>>
>> I'll continue to follow up on feedback and make changes, and I will add
>> some more mentions to the documents to draw attention
>>
>> https://s.apache.org/beam-gcp-debuggability
>>  https://s.apache.org/beam-histogram-metrics
>>
>> On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:
>>
>>> Thanks, also took a look and left some comments.
>>>
>>> On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:
>>>
 Hello,

 I created another design document. This time for GCP IO Debuggability
 Metrics. Which defines some new metrics to collect in the GCP IO libraries.
 This is for monitoring request counts and request latencies.

 Please take a look and let me know what you think:
 https://s.apache.org/beam-gcp-debuggability

 I also sent out a separate design yesterday (
 https://s.apache.org/beam-histogram-metrics) which is related as this
 document uses a Histogram style metric :).

 I would love some feedback to make this feature the best possible :D,
 Alex

>>>


Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-11 Thread Alex Amato
Thanks for the great feedback so far :). I've included many new ideas, and
made some revisions. Both docs have changed a fair bit since the initial
mail out.

https://s.apache.org/beam-gcp-debuggability
https://s.apache.org/beam-histogram-metrics

PTAL and let me know what you think, and hopefully we can resolve major
issues by the end of the week. I'll try to finalize things by then, but of
course always stay open to your great ideas. :)

On Wed, May 6, 2020 at 6:19 PM Alex Amato  wrote:

> Thanks everyone so far for taking a look so far :).
>
> I am hoping to have this finalize the two reviews by the end of next week,
> May 15th.
>
> I'll continue to follow up on feedback and make changes, and I will add
> some more mentions to the documents to draw attention
>
> https://s.apache.org/beam-gcp-debuggability
>  https://s.apache.org/beam-histogram-metrics
>
> On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:
>
>> Thanks, also took a look and left some comments.
>>
>> On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:
>>
>>> Hello,
>>>
>>> I created another design document. This time for GCP IO Debuggability
>>> Metrics. Which defines some new metrics to collect in the GCP IO libraries.
>>> This is for monitoring request counts and request latencies.
>>>
>>> Please take a look and let me know what you think:
>>> https://s.apache.org/beam-gcp-debuggability
>>>
>>> I also sent out a separate design yesterday (
>>> https://s.apache.org/beam-histogram-metrics) which is related as this
>>> document uses a Histogram style metric :).
>>>
>>> I would love some feedback to make this feature the best possible :D,
>>> Alex
>>>
>>


Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-06 Thread Alex Amato
Thanks everyone so far for taking a look so far :).

I am hoping to have this finalize the two reviews by the end of next week,
May 15th.

I'll continue to follow up on feedback and make changes, and I will add
some more mentions to the documents to draw attention

https://s.apache.org/beam-gcp-debuggability
 https://s.apache.org/beam-histogram-metrics

On Wed, May 6, 2020 at 10:00 AM Luke Cwik  wrote:

> Thanks, also took a look and left some comments.
>
> On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:
>
>> Hello,
>>
>> I created another design document. This time for GCP IO Debuggability
>> Metrics. Which defines some new metrics to collect in the GCP IO libraries.
>> This is for monitoring request counts and request latencies.
>>
>> Please take a look and let me know what you think:
>> https://s.apache.org/beam-gcp-debuggability
>>
>> I also sent out a separate design yesterday (
>> https://s.apache.org/beam-histogram-metrics) which is related as this
>> document uses a Histogram style metric :).
>>
>> I would love some feedback to make this feature the best possible :D,
>> Alex
>>
>


Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-06 Thread Luke Cwik
Thanks, also took a look and left some comments.

On Tue, May 5, 2020 at 6:24 PM Alex Amato  wrote:

> Hello,
>
> I created another design document. This time for GCP IO Debuggability
> Metrics. Which defines some new metrics to collect in the GCP IO libraries.
> This is for monitoring request counts and request latencies.
>
> Please take a look and let me know what you think:
> https://s.apache.org/beam-gcp-debuggability
>
> I also sent out a separate design yesterday (
> https://s.apache.org/beam-histogram-metrics) which is related as this
> document uses a Histogram style metric :).
>
> I would love some feedback to make this feature the best possible :D,
> Alex
>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-06 Thread Luke Cwik
Thanks Alex, I had some minor comments.

On Mon, May 4, 2020 at 4:04 PM Alex Amato  wrote:

> Thanks Ismaël :). Done
>
> On Mon, May 4, 2020 at 3:59 PM Ismaël Mejía  wrote:
>
>> Moving the short link to this thread
>> https://s.apache.org/beam-histogram-metrics
>>
>> Alex can you add this link (and any other of your documents that may
>> not be there) to
>> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>>
>>
>> On Tue, May 5, 2020 at 12:51 AM Pablo Estrada  wrote:
>> >
>> > FYI +Boyuan Zhang worked on implementing a histogram metric that was
>> performance-optimized into outer space for Python : ) - I don't recall if
>> she ended up getting it merged, but it's worth looking at the work. I also
>> remember Scott Wegner wrote the metrics for Java.
>> >
>> > Best
>> > -P.
>> >
>> > On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
>> >>
>> >> Hello,
>> >>
>> >> I have created a proposal for Apache Beam FN API to support Histogram
>> Style Metrics. Which defines a method to collect Histogram style metrics
>> and pass them over the FN API.
>> >>
>> >> I would love to hear your feedback in order to improve this proposal,
>> please let me know what you think. Thanks for taking a look :)
>> >> Alex
>>
>


[Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-05 Thread Alex Amato
Hello,

I created another design document. This time for GCP IO Debuggability
Metrics. Which defines some new metrics to collect in the GCP IO libraries.
This is for monitoring request counts and request latencies.

Please take a look and let me know what you think:
https://s.apache.org/beam-gcp-debuggability

I also sent out a separate design yesterday (
https://s.apache.org/beam-histogram-metrics) which is related as this
document uses a Histogram style metric :).

I would love some feedback to make this feature the best possible :D,
Alex


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Alex Amato
Thanks Ismaël :). Done

On Mon, May 4, 2020 at 3:59 PM Ismaël Mejía  wrote:

> Moving the short link to this thread
> https://s.apache.org/beam-histogram-metrics
>
> Alex can you add this link (and any other of your documents that may
> not be there) to
> https://cwiki.apache.org/confluence/display/BEAM/Design+Documents
>
>
> On Tue, May 5, 2020 at 12:51 AM Pablo Estrada  wrote:
> >
> > FYI +Boyuan Zhang worked on implementing a histogram metric that was
> performance-optimized into outer space for Python : ) - I don't recall if
> she ended up getting it merged, but it's worth looking at the work. I also
> remember Scott Wegner wrote the metrics for Java.
> >
> > Best
> > -P.
> >
> > On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
> >>
> >> Hello,
> >>
> >> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics. Which defines a method to collect Histogram style metrics
> and pass them over the FN API.
> >>
> >> I would love to hear your feedback in order to improve this proposal,
> please let me know what you think. Thanks for taking a look :)
> >> Alex
>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Ismaël Mejía
Moving the short link to this thread
https://s.apache.org/beam-histogram-metrics

Alex can you add this link (and any other of your documents that may
not be there) to
https://cwiki.apache.org/confluence/display/BEAM/Design+Documents


On Tue, May 5, 2020 at 12:51 AM Pablo Estrada  wrote:
>
> FYI +Boyuan Zhang worked on implementing a histogram metric that was 
> performance-optimized into outer space for Python : ) - I don't recall if she 
> ended up getting it merged, but it's worth looking at the work. I also 
> remember Scott Wegner wrote the metrics for Java.
>
> Best
> -P.
>
> On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:
>>
>> Hello,
>>
>> I have created a proposal for Apache Beam FN API to support Histogram Style 
>> Metrics. Which defines a method to collect Histogram style metrics and pass 
>> them over the FN API.
>>
>> I would love to hear your feedback in order to improve this proposal, please 
>> let me know what you think. Thanks for taking a look :)
>> Alex


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Pablo Estrada
FYI +Boyuan Zhang  worked on implementing a histogram
metric that was performance-optimized into outer space for Python : ) - I
don't recall if she ended up getting it merged, but it's worth looking at
the work. I also remember Scott Wegner wrote the metrics for Java.

Best
-P.

On Mon, May 4, 2020 at 3:33 PM Alex Amato  wrote:

> Hello,
>
> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics
> <https://docs.google.com/document/d/1kiNG2BAR-51pRdBCK4-XFmc0WuIkSuBzeb__Zv8owbU/edit#>.
> Which defines a method to collect Histogram style metrics and pass them
> over the FN API.
>
> I would love to hear your feedback in order to improve this
> proposal, please let me know what you think. Thanks for taking a look :)
> Alex
>


[Proposal] Apache Beam Fn API - Histogram Style Metrics (Correct link this time)

2020-05-04 Thread Alex Amato
Hello,

I have created a proposal for Apache Beam FN API to support Histogram Style
Metrics
<https://docs.google.com/document/d/1kiNG2BAR-51pRdBCK4-XFmc0WuIkSuBzeb__Zv8owbU/edit#>.
Which defines a method to collect Histogram style metrics and pass them
over the FN API.

I would love to hear your feedback in order to improve this
proposal, please let me know what you think. Thanks for taking a look :)
Alex


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Alex Amato
Sorry, wrong link. Let's close this thread and I'll send another...

On Mon, May 4, 2020 at 3:28 PM Pablo Estrada  wrote:

> Hi Alex!
> Thanks for the proposal. I've created
> https://s.apache.org/beam-histogram-metrics
>
> On Mon, May 4, 2020 at 2:44 PM Alex Amato  wrote:
>
>> Hello,
>>
>> I have created a proposal for Apache Beam FN API to support Histogram
>> Style Metrics
>> <https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit#heading=h.c6fjf0g6rsbc>.
>> Which defines a method to collect Histogram style metrics and pass them
>> over the FN API.
>>
>> Also, I would appreciate it if someone could generate an s.apache.org
>> link for this document? Unless there is some way for me to do it myself.
>>
>> I would love to hear your feedback in order to improve this
>> proposal, please let me know what you think. Thanks for taking a look :)
>> Alex
>>
>


Re: [Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Pablo Estrada
Hi Alex!
Thanks for the proposal. I've created
https://s.apache.org/beam-histogram-metrics

On Mon, May 4, 2020 at 2:44 PM Alex Amato  wrote:

> Hello,
>
> I have created a proposal for Apache Beam FN API to support Histogram
> Style Metrics
> <https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit#heading=h.c6fjf0g6rsbc>.
> Which defines a method to collect Histogram style metrics and pass them
> over the FN API.
>
> Also, I would appreciate it if someone could generate an s.apache.org
> link for this document? Unless there is some way for me to do it myself.
>
> I would love to hear your feedback in order to improve this
> proposal, please let me know what you think. Thanks for taking a look :)
> Alex
>


[Proposal] Apache Beam Fn API - Histogram Style Metrics

2020-05-04 Thread Alex Amato
Hello,

I have created a proposal for Apache Beam FN API to support Histogram Style
Metrics
<https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit#heading=h.c6fjf0g6rsbc>.
Which defines a method to collect Histogram style metrics and pass them
over the FN API.

Also, I would appreciate it if someone could generate an s.apache.org link
for this document? Unless there is some way for me to do it myself.

I would love to hear your feedback in order to improve this
proposal, please let me know what you think. Thanks for taking a look :)
Alex


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-19 Thread Alex Amato
Hello,

I have rewritten most of the proposal. Though I think that there is some
more research that needs to be done to get the Metric specification
perfect. I plan to do more research, and would like to ask you all for more
help to make this proposal better. In particular, now that the metrics
format by default is designed to allow metrics to pass through to
monitoring collection systems such as Dropwizard and Stackdriver, they need
to be complete enough to be compatible with these systems.

I think some changes will be needed to fulfill this, but I wanted to send
out this document, which contains the general idea, and continue refining
it.

Please take a look and let me know what you think.
https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit

Major Revision: April 17, 2018

The design has been reworked, to use a metric format which resembles
Dropwizard and Stackdriver formats, allowing metrics to be passed through.

The generic bytes payload style of metrics is still available but is
reserved for complex use cases which do not fit into these typical metrics
collection systems.

Note: This document isn’t 100% complete, there are a few areas which need
to be improved, though our discussion and more research I want to complete
these details. Please share any thoughts that you have.

   1.

   The metric specification and Metric proto schemas may need revisions:
   1.

  The distribution format needs to be refined so that its compatible
  with Stackdriver and Dropwizard. The current example format is. A second
  distribution format need.
  2.

  Annotations needs to be examine in detail, if there are first class
  annotations which should be supported to pass through properly to
  Dropwizard and Stackdriver.
  3.

  Aggregation functions may need parameters. For example Top(n) may
  need to be parameterized. How should this best be supported.






On Tue, Apr 17, 2018 at 11:10 AM Ben Chambers  wrote:

> That sounds like a very reasonable choice -- given the discussion seemed
> to be focusing on the differences between these two categories, separating
> them will allow the proposal (and implementation) to address each category
> in the best way possible without needing to make compromises.
>
> Looking forward to the updated proposal.
>
> On Tue, Apr 17, 2018 at 10:53 AM Alex Amato  wrote:
>
>> Hello,
>>
>> I just wanted to give an update .
>>
>> After some discussion, I've realized that its best to break up the two
>> concepts, with two separate way of reporting monitoring data. These two
>> categories are:
>>
>>1. Metrics - Counters, Gauges, Distributions. These are well defined
>>concepts for monitoring information and ned to integrate with existing
>>metrics collection systems such as Dropwizard and Stackdriver. Most 
>> metrics
>>will go through this model, which will allow runners to process new 
>> metrics
>>without adding extra code to support them, forwarding them to metric
>>collection systems.
>>2. Monitoring State - This supports general monitoring data which may
>>not fit into the standard model for Metrics. For example an I/O source may
>>provide a table of filenames+metadata, for files which are old and 
>> blocking
>>the system. I will propose a general approach, similar to the URN+payload
>>approach used in the doc right now.
>>
>> One thing to keep in mind -- even though it makes sense to allow each I/O
> source to define their own monitoring state, this then shifts
> responsibility for collecting that information to each runner and
> displaying that information to every consumer. It would be reasonable to
> see if there could be a set of 10 or so that covered most of the cases that
> could become the "standard" set (eg., watermark information, performance
> information, etc.).
>
>
>> I will rewrite most of the doc and propose separating these two very
>> different use cases, one which optimizes for integration with existing
>> monitoring systems. The other which optimizes for flexibility, allowing
>> more complex and custom metrics formats for other debugging scenarios.
>>
>> I just wanted to give a brief update on the direction of this change,
>> before writing it up in full detail.
>>
>>
>> On Mon, Apr 16, 2018 at 10:36 AM Robert Bradshaw 
>> wrote:
>>
>>> I agree that the user/system dichotomy is false, the real question of
>>> how counters can be scoped to avoid accidental (or even intentional)
>>> interference. A system that entirely controls the interaction between the
>>> "user" (from its perspective) and the underlying system can do this by
>>> prefixing all requested "user" counters with a prefix it will not use
>>> itself. Of course this breaks down whenever the wrapping isn't complete
>>> (either on the production or consumption side), but may be worth doing for
>>> some components (like the SDKs that value being able to provide this
>>> isolation for better behavi

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-17 Thread Ben Chambers
That sounds like a very reasonable choice -- given the discussion seemed to
be focusing on the differences between these two categories, separating
them will allow the proposal (and implementation) to address each category
in the best way possible without needing to make compromises.

Looking forward to the updated proposal.

On Tue, Apr 17, 2018 at 10:53 AM Alex Amato  wrote:

> Hello,
>
> I just wanted to give an update .
>
> After some discussion, I've realized that its best to break up the two
> concepts, with two separate way of reporting monitoring data. These two
> categories are:
>
>1. Metrics - Counters, Gauges, Distributions. These are well defined
>concepts for monitoring information and ned to integrate with existing
>metrics collection systems such as Dropwizard and Stackdriver. Most metrics
>will go through this model, which will allow runners to process new metrics
>without adding extra code to support them, forwarding them to metric
>collection systems.
>2. Monitoring State - This supports general monitoring data which may
>not fit into the standard model for Metrics. For example an I/O source may
>provide a table of filenames+metadata, for files which are old and blocking
>the system. I will propose a general approach, similar to the URN+payload
>approach used in the doc right now.
>
> One thing to keep in mind -- even though it makes sense to allow each I/O
source to define their own monitoring state, this then shifts
responsibility for collecting that information to each runner and
displaying that information to every consumer. It would be reasonable to
see if there could be a set of 10 or so that covered most of the cases that
could become the "standard" set (eg., watermark information, performance
information, etc.).


> I will rewrite most of the doc and propose separating these two very
> different use cases, one which optimizes for integration with existing
> monitoring systems. The other which optimizes for flexibility, allowing
> more complex and custom metrics formats for other debugging scenarios.
>
> I just wanted to give a brief update on the direction of this change,
> before writing it up in full detail.
>
>
> On Mon, Apr 16, 2018 at 10:36 AM Robert Bradshaw 
> wrote:
>
>> I agree that the user/system dichotomy is false, the real question of how
>> counters can be scoped to avoid accidental (or even intentional)
>> interference. A system that entirely controls the interaction between the
>> "user" (from its perspective) and the underlying system can do this by
>> prefixing all requested "user" counters with a prefix it will not use
>> itself. Of course this breaks down whenever the wrapping isn't complete
>> (either on the production or consumption side), but may be worth doing for
>> some components (like the SDKs that value being able to provide this
>> isolation for better behavior). Actual (human) end users are likely to be
>> much less careful about avoiding conflicts than library authors who in turn
>> are generally less careful than authors of the system itself.
>>
>> We could alternatively allow for specifying fully qualified URNs for
>> counter names in the SDK APIs, and letting "normal" user counters be in the
>> empty namespace rather than something like beam:metrics:{user,other,...},
>> perhaps with SDKs prohibiting certain conflicting prefixes (which is less
>> than ideal). A layer above the SDK that has similar absolute control over
>> its "users" would have a similar decision to make.
>>
>>
>> On Sat, Apr 14, 2018 at 4:00 PM Kenneth Knowles  wrote:
>>
>>> One reason I resist the user/system distinction is that Beam is a
>>> multi-party system with at least SDK, runner, and pipeline. Often there may
>>> be a DSL like SQL or Scio, or similarly someone may be building a platform
>>> for their company where there is no user authoring the pipeline. Should
>>> Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to
>>> tack on the prefix? It looks like it is the SDK harness? Are there just
>>> three namespaces "runner", "sdk", and "user"?  Most of what you'd think
>>> of as "user" version "system" should simply be the different between
>>> dynamically defined & typed metrics and fields in control plane protos. If
>>> that layer of the namespaces is not finite and limited, who can extend make
>>> a valid extension? Just some questions that I think would flesh out the
>>> meaning of the "user" prefix.
>>>
>>> Kenn
>>>
>>> On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler 
>>> wrote:
>>>


 On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw 
 wrote:

> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler 
> wrote:
>
>> Thanks, Robert!
>>
>> I think my lack of clarity is around the MetricSpec.  Maybe what's in
>> my head and what's being proposed are the same thing.  When I read that 
>> the
>> MetricSpec describes the proto structure, that sound kind of complicated

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-17 Thread Alex Amato
Hello,

I just wanted to give an update .

After some discussion, I've realized that its best to break up the two
concepts, with two separate way of reporting monitoring data. These two
categories are:

   1. Metrics - Counters, Gauges, Distributions. These are well defined
   concepts for monitoring information and ned to integrate with existing
   metrics collection systems such as Dropwizard and Stackdriver. Most metrics
   will go through this model, which will allow runners to process new metrics
   without adding extra code to support them, forwarding them to metric
   collection systems.
   2. Monitoring State - This supports general monitoring data which may
   not fit into the standard model for Metrics. For example an I/O source may
   provide a table of filenames+metadata, for files which are old and blocking
   the system. I will propose a general approach, similar to the URN+payload
   approach used in the doc right now.

I will rewrite most of the doc and propose separating these two very
different use cases, one which optimizes for integration with existing
monitoring systems. The other which optimizes for flexibility, allowing
more complex and custom metrics formats for other debugging scenarios.

I just wanted to give a brief update on the direction of this change,
before writing it up in full detail.


On Mon, Apr 16, 2018 at 10:36 AM Robert Bradshaw 
wrote:

> I agree that the user/system dichotomy is false, the real question of how
> counters can be scoped to avoid accidental (or even intentional)
> interference. A system that entirely controls the interaction between the
> "user" (from its perspective) and the underlying system can do this by
> prefixing all requested "user" counters with a prefix it will not use
> itself. Of course this breaks down whenever the wrapping isn't complete
> (either on the production or consumption side), but may be worth doing for
> some components (like the SDKs that value being able to provide this
> isolation for better behavior). Actual (human) end users are likely to be
> much less careful about avoiding conflicts than library authors who in turn
> are generally less careful than authors of the system itself.
>
> We could alternatively allow for specifying fully qualified URNs for
> counter names in the SDK APIs, and letting "normal" user counters be in the
> empty namespace rather than something like beam:metrics:{user,other,...},
> perhaps with SDKs prohibiting certain conflicting prefixes (which is less
> than ideal). A layer above the SDK that has similar absolute control over
> its "users" would have a similar decision to make.
>
>
> On Sat, Apr 14, 2018 at 4:00 PM Kenneth Knowles  wrote:
>
>> One reason I resist the user/system distinction is that Beam is a
>> multi-party system with at least SDK, runner, and pipeline. Often there may
>> be a DSL like SQL or Scio, or similarly someone may be building a platform
>> for their company where there is no user authoring the pipeline. Should
>> Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to
>> tack on the prefix? It looks like it is the SDK harness? Are there just
>> three namespaces "runner", "sdk", and "user"?  Most of what you'd think
>> of as "user" version "system" should simply be the different between
>> dynamically defined & typed metrics and fields in control plane protos. If
>> that layer of the namespaces is not finite and limited, who can extend make
>> a valid extension? Just some questions that I think would flesh out the
>> meaning of the "user" prefix.
>>
>> Kenn
>>
>> On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler 
>> wrote:
>>
>>>
>>>
>>> On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler 
 wrote:

> Thanks, Robert!
>
> I think my lack of clarity is around the MetricSpec.  Maybe what's in
> my head and what's being proposed are the same thing.  When I read that 
> the
> MetricSpec describes the proto structure, that sound kind of complicated 
> to
> me.  But I may be misinterpreting it.  What I picture is something like a
> MetricSpec that looks like (note: my picture looks a lot like Stackdriver
> :):
>
> {
> name: "my_timer"
>

 name: "beam:metric:user:my_namespace:my_timer" (assuming we want to
 keep requiring namespaces). Or "beam:metric:[some non-user designation]"

>>>
>>> Sure. Looks good.
>>>
>>>

 labels: { "ptransform" }
>

 How does an SDK act on this information?

>>>
>>> The SDK is obligated to submit any metric values for that spec with a
>>> "ptransform" -> "transformName" in the labels field.  Autogenerating code
>>> from the spec to avoid typos should be easy.
>>>
>>>


> type: GAUGE
> value_type: int64
>

 I was lumping type and value_type into the same field, as a urn for
 possibly extensibility, as they're tightly coupled (e.g. quantiles,
>

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-16 Thread Robert Bradshaw
I agree that the user/system dichotomy is false, the real question of how
counters can be scoped to avoid accidental (or even intentional)
interference. A system that entirely controls the interaction between the
"user" (from its perspective) and the underlying system can do this by
prefixing all requested "user" counters with a prefix it will not use
itself. Of course this breaks down whenever the wrapping isn't complete
(either on the production or consumption side), but may be worth doing for
some components (like the SDKs that value being able to provide this
isolation for better behavior). Actual (human) end users are likely to be
much less careful about avoiding conflicts than library authors who in turn
are generally less careful than authors of the system itself.

We could alternatively allow for specifying fully qualified URNs for
counter names in the SDK APIs, and letting "normal" user counters be in the
empty namespace rather than something like beam:metrics:{user,other,...},
perhaps with SDKs prohibiting certain conflicting prefixes (which is less
than ideal). A layer above the SDK that has similar absolute control over
its "users" would have a similar decision to make.


On Sat, Apr 14, 2018 at 4:00 PM Kenneth Knowles  wrote:

> One reason I resist the user/system distinction is that Beam is a
> multi-party system with at least SDK, runner, and pipeline. Often there may
> be a DSL like SQL or Scio, or similarly someone may be building a platform
> for their company where there is no user authoring the pipeline. Should
> Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to
> tack on the prefix? It looks like it is the SDK harness? Are there just
> three namespaces "runner", "sdk", and "user"?  Most of what you'd think
> of as "user" version "system" should simply be the different between
> dynamically defined & typed metrics and fields in control plane protos. If
> that layer of the namespaces is not finite and limited, who can extend make
> a valid extension? Just some questions that I think would flesh out the
> meaning of the "user" prefix.
>
> Kenn
>
> On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler  wrote:
>
>>
>>
>> On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler 
>>> wrote:
>>>
 Thanks, Robert!

 I think my lack of clarity is around the MetricSpec.  Maybe what's in
 my head and what's being proposed are the same thing.  When I read that the
 MetricSpec describes the proto structure, that sound kind of complicated to
 me.  But I may be misinterpreting it.  What I picture is something like a
 MetricSpec that looks like (note: my picture looks a lot like Stackdriver
 :):

 {
 name: "my_timer"

>>>
>>> name: "beam:metric:user:my_namespace:my_timer" (assuming we want to keep
>>> requiring namespaces). Or "beam:metric:[some non-user designation]"
>>>
>>
>> Sure. Looks good.
>>
>>
>>>
>>> labels: { "ptransform" }

>>>
>>> How does an SDK act on this information?
>>>
>>
>> The SDK is obligated to submit any metric values for that spec with a
>> "ptransform" -> "transformName" in the labels field.  Autogenerating code
>> from the spec to avoid typos should be easy.
>>
>>
>>>
>>>
 type: GAUGE
 value_type: int64

>>>
>>> I was lumping type and value_type into the same field, as a urn for
>>> possibly extensibility, as they're tightly coupled (e.g. quantiles,
>>> distributions).
>>>
>>
>> My inclination is that keeping this set relatively small and fixed to a
>> set that can be readily exported to external monitoring systems is more
>> useful than the added indirection to support extensibility.  Lumping
>> together seems reasonable.
>>
>>
>>>
>>>
 units: SECONDS
 description: "Times my stuff"

>>>
>>> Are both of these optional metadata, in the form of key-value field, for
>>> flattened into the field itself (along with every other kind of metadata
>>> you may want to attach)?
>>>
>>
>> Optional metadata in the form of fixed fields.  Is there a use case for
>> arbitrary metadata?  What would you do with it when exporting?
>>
>>
>>>
>>>
 }

 Then metrics submitted would look like:
 {
 name: "my_timer"
 labels: {"ptransform": "MyTransform"}
 int_value: 100
 }

>>>
>>> Yes, or value could be a bytes field that is encoded according to
>>> [value_]type above, if we want that extensibility (e.g. if we want to
>>> bundle the pardo sub-timings together, we'd need a proto for the value, but
>>> that seems to specific to hard code into the basic structure).
>>>
>>>
>> The simplicity coming from the fact that there's only one proto format
 for the spec and for the value.  The only thing that varies are the entries
 in the map and the value field set.  It's pretty easy to establish
 contracts around this type of spec and even generate protos for use the in
 SDK that make the expectations explicit.

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-14 Thread Kenneth Knowles
One reason I resist the user/system distinction is that Beam is a
multi-party system with at least SDK, runner, and pipeline. Often there may
be a DSL like SQL or Scio, or similarly someone may be building a platform
for their company where there is no user authoring the pipeline. Should
Scio, SQL, or MyCompanyFramework metrics end up in "user"? Who decides to
tack on the prefix? It looks like it is the SDK harness? Are there just
three namespaces "runner", "sdk", and "user"?  Most of what you'd think of
as "user" version "system" should simply be the different between
dynamically defined & typed metrics and fields in control plane protos. If
that layer of the namespaces is not finite and limited, who can extend make
a valid extension? Just some questions that I think would flesh out the
meaning of the "user" prefix.

Kenn

On Fri, Apr 13, 2018 at 5:26 PM Andrea Foegler  wrote:

>
>
> On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler 
>> wrote:
>>
>>> Thanks, Robert!
>>>
>>> I think my lack of clarity is around the MetricSpec.  Maybe what's in my
>>> head and what's being proposed are the same thing.  When I read that the
>>> MetricSpec describes the proto structure, that sound kind of complicated to
>>> me.  But I may be misinterpreting it.  What I picture is something like a
>>> MetricSpec that looks like (note: my picture looks a lot like Stackdriver
>>> :):
>>>
>>> {
>>> name: "my_timer"
>>>
>>
>> name: "beam:metric:user:my_namespace:my_timer" (assuming we want to keep
>> requiring namespaces). Or "beam:metric:[some non-user designation]"
>>
>
> Sure. Looks good.
>
>
>>
>> labels: { "ptransform" }
>>>
>>
>> How does an SDK act on this information?
>>
>
> The SDK is obligated to submit any metric values for that spec with a
> "ptransform" -> "transformName" in the labels field.  Autogenerating code
> from the spec to avoid typos should be easy.
>
>
>>
>>
>>> type: GAUGE
>>> value_type: int64
>>>
>>
>> I was lumping type and value_type into the same field, as a urn for
>> possibly extensibility, as they're tightly coupled (e.g. quantiles,
>> distributions).
>>
>
> My inclination is that keeping this set relatively small and fixed to a
> set that can be readily exported to external monitoring systems is more
> useful than the added indirection to support extensibility.  Lumping
> together seems reasonable.
>
>
>>
>>
>>> units: SECONDS
>>> description: "Times my stuff"
>>>
>>
>> Are both of these optional metadata, in the form of key-value field, for
>> flattened into the field itself (along with every other kind of metadata
>> you may want to attach)?
>>
>
> Optional metadata in the form of fixed fields.  Is there a use case for
> arbitrary metadata?  What would you do with it when exporting?
>
>
>>
>>
>>> }
>>>
>>> Then metrics submitted would look like:
>>> {
>>> name: "my_timer"
>>> labels: {"ptransform": "MyTransform"}
>>> int_value: 100
>>> }
>>>
>>
>> Yes, or value could be a bytes field that is encoded according to
>> [value_]type above, if we want that extensibility (e.g. if we want to
>> bundle the pardo sub-timings together, we'd need a proto for the value, but
>> that seems to specific to hard code into the basic structure).
>>
>>
> The simplicity coming from the fact that there's only one proto format for
>>> the spec and for the value.  The only thing that varies are the entries in
>>> the map and the value field set.  It's pretty easy to establish contracts
>>> around this type of spec and even generate protos for use the in SDK that
>>> make the expectations explicit.
>>>
>>>
>>> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles  wrote:

>
> Or just "beam:counter::" or even
> "beam:metric::" since metrics have a type separate from
> their name.
>

 I proposed keeping the "user" in there to avoid possible clashes with
 the system namespaces. (No preference on counter vs. metric, I wasn't
 trying to imply counter = SumInts)


 On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler 
 wrote:

> I like the generalization from entity -> labels.  I view the purpose
> of those fields to provide context.  And labels feel like they supports a
> richer set of contexts.
>

 If we think such a generalization provides value, I'm fine with doing
 that now, as sets or key-value maps, if we have good enough examples to
 justify this.


> The URN concept gets a little tricky.  I totally agree that the
> context fields should not be embedded in the name.
> There's a "name" which is the identifier that can be used to
> communicate what context values are supported / allowed for metrics with
> that name (for example, element_count expects a ptransform ID).  But then
> there's the context.  In Stackdriver, this context is a map of key-value
> pairs; the type is consider

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Andrea Foegler
On Fri, Apr 13, 2018 at 5:00 PM Robert Bradshaw  wrote:

> On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler  wrote:
>
>> Thanks, Robert!
>>
>> I think my lack of clarity is around the MetricSpec.  Maybe what's in my
>> head and what's being proposed are the same thing.  When I read that the
>> MetricSpec describes the proto structure, that sound kind of complicated to
>> me.  But I may be misinterpreting it.  What I picture is something like a
>> MetricSpec that looks like (note: my picture looks a lot like Stackdriver
>> :):
>>
>> {
>> name: "my_timer"
>>
>
> name: "beam:metric:user:my_namespace:my_timer" (assuming we want to keep
> requiring namespaces). Or "beam:metric:[some non-user designation]"
>

Sure. Looks good.


>
> labels: { "ptransform" }
>>
>
> How does an SDK act on this information?
>

The SDK is obligated to submit any metric values for that spec with a
"ptransform" -> "transformName" in the labels field.  Autogenerating code
from the spec to avoid typos should be easy.


>
>
>> type: GAUGE
>> value_type: int64
>>
>
> I was lumping type and value_type into the same field, as a urn for
> possibly extensibility, as they're tightly coupled (e.g. quantiles,
> distributions).
>

My inclination is that keeping this set relatively small and fixed to a set
that can be readily exported to external monitoring systems is more useful
than the added indirection to support extensibility.  Lumping together
seems reasonable.


>
>
>> units: SECONDS
>> description: "Times my stuff"
>>
>
> Are both of these optional metadata, in the form of key-value field, for
> flattened into the field itself (along with every other kind of metadata
> you may want to attach)?
>

Optional metadata in the form of fixed fields.  Is there a use case for
arbitrary metadata?  What would you do with it when exporting?


>
>
>> }
>>
>> Then metrics submitted would look like:
>> {
>> name: "my_timer"
>> labels: {"ptransform": "MyTransform"}
>> int_value: 100
>> }
>>
>
> Yes, or value could be a bytes field that is encoded according to
> [value_]type above, if we want that extensibility (e.g. if we want to
> bundle the pardo sub-timings together, we'd need a proto for the value, but
> that seems to specific to hard code into the basic structure).
>
>
The simplicity coming from the fact that there's only one proto format for
>> the spec and for the value.  The only thing that varies are the entries in
>> the map and the value field set.  It's pretty easy to establish contracts
>> around this type of spec and even generate protos for use the in SDK that
>> make the expectations explicit.
>>
>>
>> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles  wrote:
>>>

 Or just "beam:counter::" or even
 "beam:metric::" since metrics have a type separate from
 their name.

>>>
>>> I proposed keeping the "user" in there to avoid possible clashes with
>>> the system namespaces. (No preference on counter vs. metric, I wasn't
>>> trying to imply counter = SumInts)
>>>
>>>
>>> On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler 
>>> wrote:
>>>
 I like the generalization from entity -> labels.  I view the purpose of
 those fields to provide context.  And labels feel like they supports a
 richer set of contexts.

>>>
>>> If we think such a generalization provides value, I'm fine with doing
>>> that now, as sets or key-value maps, if we have good enough examples to
>>> justify this.
>>>
>>>
 The URN concept gets a little tricky.  I totally agree that the context
 fields should not be embedded in the name.
 There's a "name" which is the identifier that can be used to
 communicate what context values are supported / allowed for metrics with
 that name (for example, element_count expects a ptransform ID).  But then
 there's the context.  In Stackdriver, this context is a map of key-value
 pairs; the type is considered metadata associated with the name, but not
 communicated with the value.

>>>
>>> I'm not quite following you here. If context contains a ptransform id,
>>> then it cannot be associated with a single name.
>>>
>>>
 Could the URN be "beam:namespace:name" and every metric have a map of
 key-value pairs for context?

>>>
>>> The URN is the name. Something like
>>> "beam:metric:ptransform_execution_times:v1."
>>>
>>>
 Not sure where this fits in the discussion or if this is handled
 somewhere, but allowing for a metric configuration that's provided
 independently of the value allows for configuring "type", "units", etc in a
 uniform way without having to encode them in the metric name / value.
 Stackdriver expects each metric type has been configured ahead of time with
 these annotations / metadata.  Then values are reported separately.  For
 system metrics, the definitions can be packaged with the SDK.  For user
 metrics, they'd be defined at runtime.

>>>
>>> This feels 

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2018 at 4:30 PM Alex Amato  wrote:

> There are a few more confusing concepts in this thread
> *Name*
>
>- Name can mean a *"string name"* used to refer to a metric in a
>metrics system such as stackdriver, i.e. "ElementCount", "ExecutionTime"
>- Name can mean a set of *context* fields added to a counter, either
>embedded in a complex string, or in a structured name. Typically referring
>to *aggregation entities, *which define how the metric updates get
>aggregated into final metric values, i.e. all Metric updates with the same
>field are aggregated together.
>   - e.g.my_ptransform_id-ElementCount
>   - e.g.{ name : 'ElementCount', 'ptransform_name' :
>   'my_ptransform_id' }
>- The *URN* of a Metric, which identifies a proto to use in a payload
>field for the Metric and MetricSpec. Note: The string name, can literally
>be the URN value in most cases, except for metrics which can specify a
>separate name (i.e. user counters).
>
> @Robert,
> You have proposed that metrics should contain the following parts, I still
> don't fully understand what you mean by each one.
>
>- Name - Why is a name a URN + bytes payload? What type of name are
>you referring to, *string name*? *context*? *URN*? Or something else.
>
> As you say above, the URN can literally be the string name. I see no
reason why this can't be the case for user counters as well (the user
counter name becoming part of the urn). The payload, should we decide to
keep it, is "part" of the name because it helps identify what exactly we're
counting. I.e. {urnX, payload1} would be distinct from {urnX, payload2}.
The only reason to have a payload is to avoid sticking stuff that would be
ugly to parse into the URN.

>
>- Entity - This is how the metric is aggregated together. If I
>understand you correctly. And you correctly point out that a singular
>entity is not sufficient, a set of labels may be more appropriate.
>
> Alternatively, the entity/labels specifies possible sub-partitions of the
metric identified by its name (as above).

>
>- Value - *Are you saying this is just the metric value, not including
>any fields related to entity or name.*
>
> Exactly. Like "5077." For some types it would be composite. The type also
indicates how it's encoded (e.g. as bytes, or which field of a oneof should
be populated).

>
>- Type - I am not clear at all on what this is or what it would look
>like. Are you referring to units, like milliseconds/seconds? Why it
>wouldn't be part of the value payload. Is this some sort of reason to
>separate it out from the value? What if the value has multiple fields for
>example.
>
> Type would be "beam:metric_type:sum:ints" or
"beam:metric_type:distribution:doubles." We could separate "data type" from
"aggregation type" if desired, though of course the full cross-product
doesn't makes sense. We could put the unit in the type (e.g. sum_durations
!= sum_ints), but, preferably, I'd put this as metadata on the counter
spec. It is often fully determined by the URN, but provided so one can
reason about the metric without having to interpret the URN. It also means
we don't have to have a separate URN for each user metric type. (In fact,
any metric the runner doesn't understand would be treated as a user metric,
and aggregated as such if it understand the type.)

Some pros and cons as I see them
> Pros:
>
>- More separation and flexibility for an SDK to specify labels
>separately from the value/type. Though, maybe I don't understand enough,
>and I am not so sure this is a con over just having the URN payload contain
>everything in itself.
>
> We can't interpret a URN payload unless we know the URN. Separating things
out allows us to act on metrics without interpreting the URN (both for
unknown URNs, and simplifying the logic by not having to do lookups on the
URN everywhere).


> Cons:
>
>- I think this means that the SDK must properly pick two separate
>payloads and populate them correctly. We can run into issues where.
>   - Having one URN which specifies all the fields you would need to
>   populate for a specific metric avoids this, this was a concern brought 
> up
>   by Luke. The runner would then be responsible for packaging metrics up 
> to
>   send to external monitoring systems.
>
> I'm not following you here. We'd return exactly what Andrea suggested.


>
> @Andrea, please correct me if I misunderstand
> Thank you for the metric spec example in your last response, I think that
> makes the idea much more clear.
>
> Using your approach I see the following pros and cons
> Pros:
>
>- Runners have a cleaner more reusable codepath to forwarding metrics
>to external monitoring systems. This will mean less work on the runner side
>to support each metric (perhaps none in many cases).
>- SDKs may need less code as well to package up new metrics.
>- As long\ as 

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
On Fri, Apr 13, 2018 at 3:28 PM Andrea Foegler  wrote:

> Thanks, Robert!
>
> I think my lack of clarity is around the MetricSpec.  Maybe what's in my
> head and what's being proposed are the same thing.  When I read that the
> MetricSpec describes the proto structure, that sound kind of complicated to
> me.  But I may be misinterpreting it.  What I picture is something like a
> MetricSpec that looks like (note: my picture looks a lot like Stackdriver
> :):
>
> {
> name: "my_timer"
>

name: "beam:metric:user:my_namespace:my_timer" (assuming we want to keep
requiring namespaces). Or "beam:metric:[some non-user designation]"

labels: { "ptransform" }
>

How does an SDK act on this information?


> type: GAUGE
> value_type: int64
>

I was lumping type and value_type into the same field, as a urn for
possibly extensibility, as they're tightly coupled (e.g. quantiles,
distributions).


> units: SECONDS
> description: "Times my stuff"
>

Are both of these optional metadata, in the form of key-value field, for
flattened into the field itself (along with every other kind of metadata
you may want to attach)?


> }
>
> Then metrics submitted would look like:
> {
> name: "my_timer"
> labels: {"ptransform": "MyTransform"}
> int_value: 100
> }
>

Yes, or value could be a bytes field that is encoded according to
[value_]type above, if we want that extensibility (e.g. if we want to
bundle the pardo sub-timings together, we'd need a proto for the value, but
that seems to specific to hard code into the basic structure).


> The simplicity coming from the fact that there's only one proto format for
> the spec and for the value.  The only thing that varies are the entries in
> the map and the value field set.  It's pretty easy to establish contracts
> around this type of spec and even generate protos for use the in SDK that
> make the expectations explicit.
>
>
> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles  wrote:
>>
>>>
>>> Or just "beam:counter::" or even
>>> "beam:metric::" since metrics have a type separate from
>>> their name.
>>>
>>
>> I proposed keeping the "user" in there to avoid possible clashes with the
>> system namespaces. (No preference on counter vs. metric, I wasn't trying to
>> imply counter = SumInts)
>>
>>
>> On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler 
>> wrote:
>>
>>> I like the generalization from entity -> labels.  I view the purpose of
>>> those fields to provide context.  And labels feel like they supports a
>>> richer set of contexts.
>>>
>>
>> If we think such a generalization provides value, I'm fine with doing
>> that now, as sets or key-value maps, if we have good enough examples to
>> justify this.
>>
>>
>>> The URN concept gets a little tricky.  I totally agree that the context
>>> fields should not be embedded in the name.
>>> There's a "name" which is the identifier that can be used to communicate
>>> what context values are supported / allowed for metrics with that name (for
>>> example, element_count expects a ptransform ID).  But then there's the
>>> context.  In Stackdriver, this context is a map of key-value pairs; the
>>> type is considered metadata associated with the name, but not communicated
>>> with the value.
>>>
>>
>> I'm not quite following you here. If context contains a ptransform id,
>> then it cannot be associated with a single name.
>>
>>
>>> Could the URN be "beam:namespace:name" and every metric have a map of
>>> key-value pairs for context?
>>>
>>
>> The URN is the name. Something like
>> "beam:metric:ptransform_execution_times:v1."
>>
>>
>>> Not sure where this fits in the discussion or if this is handled
>>> somewhere, but allowing for a metric configuration that's provided
>>> independently of the value allows for configuring "type", "units", etc in a
>>> uniform way without having to encode them in the metric name / value.
>>> Stackdriver expects each metric type has been configured ahead of time with
>>> these annotations / metadata.  Then values are reported separately.  For
>>> system metrics, the definitions can be packaged with the SDK.  For user
>>> metrics, they'd be defined at runtime.
>>>
>>
>> This feels like the metrics spec, that specifies that the metric with
>> name/URN X has this type plus a bunch of other metadata (e.g. units, if
>> they're not implicit in the type? This gets into whether the type should be
>> Duration{Sum,Max,Distribution,...} vs. Int{Sum,Max,Distribution,...} +
>> units metadata).
>>
>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Andrea Foegler
:25 AM Kenneth Knowles 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Agree with all of this. It echoes a thread on the doc that I
>>>>>>>>>>>>>> was going to bring here. Let's keep it simple and use concrete 
>>>>>>>>>>>>>> use cases to
>>>>>>>>>>>>>> drive additional abstraction if/when it becomes compelling.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 12, 2018 at 9:21 AM Ben Chambers <
>>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sounds perfect. Just wanted to make sure that "custom
>>>>>>>>>>>>>>> metrics of supported type" didn't include new ways of 
>>>>>>>>>>>>>>> aggregating ints. As
>>>>>>>>>>>>>>> long as that means we have a fixed set of aggregations (that 
>>>>>>>>>>>>>>> align with
>>>>>>>>>>>>>>> what what users want and metrics back end support) it seems 
>>>>>>>>>>>>>>> like we are
>>>>>>>>>>>>>>> doing user metrics right.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Ben
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau <
>>>>>>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters
>>>>>>>>>>>>>>>> are used a lot but others are less mainstream so being too 
>>>>>>>>>>>>>>>> fine from the
>>>>>>>>>>>>>>>> start can just add complexity and bugs in impls IMHO.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw" <
>>>>>>>>>>>>>>>> rober...@google.com> a écrit :
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be 
>>>>>>>>>>>>>>>>> a type, as
>>>>>>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When you say type do you mean accumulator type, result
>>>>>>>>>>>>>>>>>> type, or accumulator strategy? Specifically, what is the 
>>>>>>>>>>>>>>>>>> "type" of sumint,
>>>>>>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the 
>>>>>>>>>>>>>>>>>>> road (and may
>>>>>>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support 
>>>>>>>>>>>>>>>>>>> custom metrics of
>>>>>>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The metric api is designed to prevent user defined
>>>>>>>>>>>>>>>>>>>> metric types based on the fact they just weren't used 
>>>>>>>>>>>>>>>>>>>> enough to justify
>>>>>>>>>>>>>>>>>>>> support.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set 
>>>>>>>>>>>>>>>>>>>> plus any special
>>>>>>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One thing that has occurred to me is that we're
>>>>>>>>>>>>>>>>>>>>> conflating the idea of custom metrics and custom metric 
>>>>>>>>>>>>>>>>>>>>> types. I would
>>>>>>>>>>>>>>>>>>>>> propose the MetricSpec field be augmented with an 
>>>>>>>>>>>>>>>>>>>>> additional field "type"
>>>>>>>>>>>>>>>>>>>>> which is a urn specifying the type of metric it is (i.e. 
>>>>>>>>>>>>>>>>>>>>> the contents of
>>>>>>>>>>>>>>>>>>>>> its payload, as well as the form of aggregation). Summing 
>>>>>>>>>>>>>>>>>>>>> or maxing over
>>>>>>>>>>>>>>>>>>>>> ints would be a typical example. Though we could pursue 
>>>>>>>>>>>>>>>>>>>>> making this opaque
>>>>>>>>>>>>>>>>>>>>> to the runner in the long run, that's a more speculative 
>>>>>>>>>>>>>>>>>>>>> (and difficult)
>>>>>>>>>>>>>>>>>>>>> feature to tackle. This would allow the runner to at 
>>>>>>>>>>>>>>>>>>>>> least aggregate and
>>>>>>>>>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>>>>>>>>>> understand the
>>>>>>>>>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of 
>>>>>>>>>>>>>>>>>>>>> the specialization
>>>>>>>>>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand 
>>>>>>>>>>>>>>>>>>>>> as well.)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for
>>>>>>>>>>>>>>>>>>>>> every type X one would have a single URN for UserMetric 
>>>>>>>>>>>>>>>>>>>>> and it spec would
>>>>>>>>>>>>>>>>>>>>> designate the type and payload designate the (qualified) 
>>>>>>>>>>>>>>>>>>>>> name.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> - Robert
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>>>>>>>>>>> I have made a revision today which is to make all
>>>>>>>>>>>>>>>>>>>>>> metrics refer to a primary entity, so I have 
>>>>>>>>>>>>>>>>>>>>>> restructured some of the
>>>>>>>>>>>>>>>>>>>>>> protos a little bit.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The point of this change was to futureproof the
>>>>>>>>>>>>>>>>>>>>>> possibility of allowing custom user metrics, with custom 
>>>>>>>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>>>>> functions for its metric updates.
>>>>>>>>>>>>>>>>>>>>>> Now that each metric has an aggregation_entity
>>>>>>>>>>>>>>>>>>>>>> associated with it (e.g. PCollection, PTransform), we 
>>>>>>>>>>>>>>>>>>>>>> can design an
>>>>>>>>>>>>>>>>>>>>>> approach which forwards the opaque bytes metric updates, 
>>>>>>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>>>>> deserializing them. These are forwarded to user provided 
>>>>>>>>>>>>>>>>>>>>>> code which then
>>>>>>>>>>>>>>>>>>>>>> would deserialize the metric update payloads and perform 
>>>>>>>>>>>>>>>>>>>>>> the custom
>>>>>>>>>>>>>>>>>>>>>> aggregations.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I think it has also simplified some of the URN metric
>>>>>>>>>>>>>>>>>>>>>> protos, as they do not need to keep track of ptransform 
>>>>>>>>>>>>>>>>>>>>>> names inside
>>>>>>>>>>>>>>>>>>>>>> themselves now. The result is simpler structures, for 
>>>>>>>>>>>>>>>>>>>>>> the metrics as the
>>>>>>>>>>>>>>>>>>>>>> entities are pulled outside of the metric.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to
>>>>>>>>>>>>>>>>>>>>>> draw attention to this particular revision.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato <
>>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to
>>>>>>>>>>>>>>>>>>>>>>> make a decision by Friday, and begin working on related 
>>>>>>>>>>>>>>>>>>>>>>> PRs next week.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Please make sure that you provide your feedback
>>>>>>>>>>>>>>>>>>>>>>> before then and I will post the final decisions made to 
>>>>>>>>>>>>>>>>>>>>>>> this thread Friday
>>>>>>>>>>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to
>>>>>>>>>>>>>>>>>>>>>>>> it easily in
>>>>>>>>>>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some
>>>>>>>>>>>>>>>>>>>>>>>> comments.
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on
>>>>>>>>>>>>>>>>>>>>>>>> this proposal so far. I
>>>>>>>>>>>>>>>>>>>>>>>> >> have made some revisions based on the feedback.
>>>>>>>>>>>>>>>>>>>>>>>> There were some larger
>>>>>>>>>>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of
>>>>>>>>>>>>>>>>>>>>>>>> these I have added a
>>>>>>>>>>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed
>>>>>>>>>>>>>>>>>>>>>>>> my recommendation as well
>>>>>>>>>>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised
>>>>>>>>>>>>>>>>>>>>>>>> proposal. Please take
>>>>>>>>>>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could
>>>>>>>>>>>>>>>>>>>>>>>> please take another look after
>>>>>>>>>>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Alex Amato
;>>> they'll be 100% redundant for all entities for a given metric convinces 
>>>>>>> me
>>>>>>> that it's not worth creating and tracking an enum for the type alongside
>>>>>>> the id.
>>>>>>>
>>>>>>>
>>>>>>>> *}*
>>>>>>>>
>>>>>>>> On Fri, Apr 13, 2018 at 9:14 AM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Apr 13, 2018 at 8:31 AM Kenneth Knowles 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To Robert's proto:
>>>>>>>>>>
>>>>>>>>>>  // A mapping of entities to (encoded) values.
>>>>>>>>>>>  map values;
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are the keys here the names of the metrics, aka what is used for
>>>>>>>>>> URNs in the doc?
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> They're the entities to which a metric is attached, e.g. a
>>>>>>>>> PTransform, a PCollection, or perhaps a process/worker.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 12, 2018 at 9:25 AM Kenneth Knowles 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Agree with all of this. It echoes a thread on the doc that I
>>>>>>>>>>>>> was going to bring here. Let's keep it simple and use concrete 
>>>>>>>>>>>>> use cases to
>>>>>>>>>>>>> drive additional abstraction if/when it becomes compelling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 12, 2018 at 9:21 AM Ben Chambers <
>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sounds perfect. Just wanted to make sure that "custom metrics
>>>>>>>>>>>>>> of supported type" didn't include new ways of aggregating ints. 
>>>>>>>>>>>>>> As long as
>>>>>>>>>>>>>> that means we have a fixed set of aggregations (that align with 
>>>>>>>>>>>>>> what what
>>>>>>>>>>>>>> users want and metrics back end support) it seems like we are 
>>>>>>>>>>>>>> doing user
>>>>>>>>>>>>>> metrics right.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Ben
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau <
>>>>>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters
>>>>>>>>>>>>>>> are used a lot but others are less mainstream so being too fine 
>>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>>> start can just add complexity and bugs in impls IMHO.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw" <
>>>>>>>>>>>>>>> rober...@google.com> a écrit :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>&g

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Andrea Foegler
>>>>>
>>>>>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>>>>>>>>>>>> type, as
>>>>>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When you say type do you mean accumulator type, result
>>>>>>>>>>>>>>>>> type, or accumulator strategy? Specifically, what is the 
>>>>>>>>>>>>>>>>> "type" of sumint,
>>>>>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the 
>>>>>>>>>>>>>>>>>> road (and may
>>>>>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The metric api is designed to prevent user defined
>>>>>>>>>>>>>>>>>>> metric types based on the fact they just weren't used 
>>>>>>>>>>>>>>>>>>> enough to justify
>>>>>>>>>>>>>>>>>>> support.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set 
>>>>>>>>>>>>>>>>>>> plus any special
>>>>>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> One thing that has occurred to me is that we're
>>>>>>>>>>>>>>>>>>>> conflating the idea of custom metrics and custom metric 
>>>>>>>>>>>>>>>>>>>> types. I would
>>>>>>>>>>>>>>>>>>>> propose the MetricSpec field be augmented with an 
>>>>>>>>>>>>>>>>>>>> additional field "type"
>>>>>>>>>>>>>>>>>>>> which is a urn specifying the type of metric it is (i.e. 
>>>>>>>>>>>>>>>>>>>> the contents of
>>>>>>>>>>>>>>>>>>>> its payload, as well as the form of aggregation). Summing 
>>>>>>>>>>>>>>>>>>>> or maxing over
>>>>>>>>>>>>>>>>>>>> ints would be a typical example. Though we could pursue 
>>>>>>>>>>>>>>>>>>>> making this opaque
>>>>>>>>>>>>>>>>>>>> to the runner in the long run, that's a more speculative 
>>>>>>>>>>>>>>>>>>>> (and difficult)
>>>>>>>>>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>>>>>>>>>> aggregate and
>>>>>>>>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>>>>>>>>> understand the
>>>>>>>>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of 
>>>>>>>>>>>>>>>>>>>> the specialization
>>>>>>>>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand 
>>>>>>>>>>>>>>>>>>>> as well.)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for
>>>>>>>>>>>>>>>>>>>> every type X one would have a single URN for UserMetric 
>>>>>>>>>>>>>>>>>>>> and it spec would
>>>>>>>>>>>>>>>>>>>> designate the type and payload designate the (qualified) 
>>>>>>>>>>>>>>>>>>>> name.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Robert
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>>>>>>>>>> I have made a revision today which is to make all
>>>>>>>>>>>>>>>>>>>>> metrics refer to a primary entity, so I have restructured 
>>>>>>>>>>>>>>>>>>>>> some of the
>>>>>>>>>>>>>>>>>>>>> protos a little bit.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The point of this change was to futureproof the
>>>>>>>>>>>>>>>>>>>>> possibility of allowing custom user metrics, with custom 
>>>>>>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>>>> functions for its metric updates.
>>>>>>>>>>>>>>>>>>>>> Now that each metric has an aggregation_entity
>>>>>>>>>>>>>>>>>>>>> associated with it (e.g. PCollection, PTransform), we can 
>>>>>>>>>>>>>>>>>>>>> design an
>>>>>>>>>>>>>>>>>>>>> approach which forwards the opaque bytes metric updates, 
>>>>>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>>>> deserializing them. These are forwarded to user provided 
>>>>>>>>>>>>>>>>>>>>> code which then
>>>>>>>>>>>>>>>>>>>>> would deserialize the metric update payloads and perform 
>>>>>>>>>>>>>>>>>>>>> the custom
>>>>>>>>>>>>>>>>>>>>> aggregations.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think it has also simplified some of the URN metric
>>>>>>>>>>>>>>>>>>>>> protos, as they do not need to keep track of ptransform 
>>>>>>>>>>>>>>>>>>>>> names inside
>>>>>>>>>>>>>>>>>>>>> themselves now. The result is simpler structures, for the 
>>>>>>>>>>>>>>>>>>>>> metrics as the
>>>>>>>>>>>>>>>>>>>>> entities are pulled outside of the metric.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to
>>>>>>>>>>>>>>>>>>>>> draw attention to this particular revision.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato <
>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to
>>>>>>>>>>>>>>>>>>>>>> make a decision by Friday, and begin working on related 
>>>>>>>>>>>>>>>>>>>>>> PRs next week.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please make sure that you provide your feedback
>>>>>>>>>>>>>>>>>>>>>> before then and I will post the final decisions made to 
>>>>>>>>>>>>>>>>>>>>>> this thread Friday
>>>>>>>>>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to
>>>>>>>>>>>>>>>>>>>>>>> it easily in
>>>>>>>>>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on
>>>>>>>>>>>>>>>>>>>>>>> this proposal so far. I
>>>>>>>>>>>>>>>>>>>>>>> >> have made some revisions based on the feedback.
>>>>>>>>>>>>>>>>>>>>>>> There were some larger
>>>>>>>>>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of
>>>>>>>>>>>>>>>>>>>>>>> these I have added a
>>>>>>>>>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed
>>>>>>>>>>>>>>>>>>>>>>> my recommendation as well
>>>>>>>>>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised
>>>>>>>>>>>>>>>>>>>>>>> proposal. Please take
>>>>>>>>>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could
>>>>>>>>>>>>>>>>>>>>>>> please take another look after
>>>>>>>>>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
; cases to
>>>>>>>>>>>> drive additional abstraction if/when it becomes compelling.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 12, 2018 at 9:21 AM Ben Chambers <
>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sounds perfect. Just wanted to make sure that "custom metrics
>>>>>>>>>>>>> of supported type" didn't include new ways of aggregating ints. 
>>>>>>>>>>>>> As long as
>>>>>>>>>>>>> that means we have a fixed set of aggregations (that align with 
>>>>>>>>>>>>> what what
>>>>>>>>>>>>> users want and metrics back end support) it seems like we are 
>>>>>>>>>>>>> doing user
>>>>>>>>>>>>> metrics right.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau <
>>>>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters
>>>>>>>>>>>>>> are used a lot but others are less mainstream so being too fine 
>>>>>>>>>>>>>> from the
>>>>>>>>>>>>>> start can just add complexity and bugs in impls IMHO.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw" 
>>>>>>>>>>>>>> a écrit :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>>>>>>>>>>> type, as
>>>>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When you say type do you mean accumulator type, result
>>>>>>>>>>>>>>>> type, or accumulator strategy? Specifically, what is the 
>>>>>>>>>>>>>>>> "type" of sumint,
>>>>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the road 
>>>>>>>>>>>>>>>>> (and may
>>>>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The metric api is designed to prevent user defined metr

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Andrea Foegler
;
>>>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>>>>>>>>>> type, as
>>>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When you say type do you mean accumulator type, result type,
>>>>>>>>>>>>>>> or accumulator strategy? Specifically, what is the "type" of 
>>>>>>>>>>>>>>> sumint,
>>>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the road 
>>>>>>>>>>>>>>>> (and may
>>>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The metric api is designed to prevent user defined metric
>>>>>>>>>>>>>>>>> types based on the fact they just weren't used enough to 
>>>>>>>>>>>>>>>>> justify support.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus 
>>>>>>>>>>>>>>>>> any special
>>>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing that has occurred to me is that we're
>>>>>>>>>>>>>>>>>> conflating the idea of custom metrics and custom metric 
>>>>>>>>>>>>>>>>>> types. I would
>>>>>>>>>>>>>>>>>> propose the MetricSpec field be augmented with an additional 
>>>>>>>>>>>>>>>>>> field "type"
>>>>>>>>>>>>>>>>>> which is a urn specifying the type of metric it is (i.e. the 
>>>>>>>>>>>>>>>>>> contents of
>>>>>>>>>>>>>>>>>> its payload, as well as the form of aggregation). Summing or 
>>>>>>>>>>>>>>>>>> maxing over
>>>>>>>>>>>>>>>>>> ints would be a typical example. Though we could pursue 
>>>>>>>>>>>>>>>>>> making this opaque
>>>>>>>>>>>>>>>>>> to the runner in the long run, that's a more speculative 
>>>>>>>>>>>>>>>>>> (and difficult)
>>>>>>>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>>>>>>>> aggregate and
>>>>>>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>>>>>>> understand the
>>>>>>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>>>>>>>>> specialization
>>>>>>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand as 
>>>>>>>>>>>>>>>>>> well.)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for
>>>>>>>>>>>>>>>>>> every type X one would have a single URN for UserMetric and 
>>>>>>>>>>>>>>>>>> it spec would
>>>>>>>>>>>>>>>>>> designate the type and payload designate the (qualified) 
>>>>>>>>>>>>>>>>>> name.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Robert
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato <
>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>>>>>>>> I have made a revision today which is to make all
>>>>>>>>>>>>>>>>>>> metrics refer to a primary entity, so I have restructured 
>>>>>>>>>>>>>>>>>>> some of the
>>>>>>>>>>>>>>>>>>> protos a little bit.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The point of this change was to futureproof the
>>>>>>>>>>>>>>>>>>> possibility of allowing custom user metrics, with custom 
>>>>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>> functions for its metric updates.
>>>>>>>>>>>>>>>>>>> Now that each metric has an aggregation_entity
>>>>>>>>>>>>>>>>>>> associated with it (e.g. PCollection, PTransform), we can 
>>>>>>>>>>>>>>>>>>> design an
>>>>>>>>>>>>>>>>>>> approach which forwards the opaque bytes metric updates, 
>>>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>>>> deserializing them. These are forwarded to user provided 
>>>>>>>>>>>>>>>>>>> code which then
>>>>>>>>>>>>>>>>>>> would deserialize the metric update payloads and perform 
>>>>>>>>>>>>>>>>>>> the custom
>>>>>>>>>>>>>>>>>>> aggregations.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think it has also simplified some of the URN metric
>>>>>>>>>>>>>>>>>>> protos, as they do not need to keep track of ptransform 
>>>>>>>>>>>>>>>>>>> names inside
>>>>>>>>>>>>>>>>>>> themselves now. The result is simpler structures, for the 
>>>>>>>>>>>>>>>>>>> metrics as the
>>>>>>>>>>>>>>>>>>> entities are pulled outside of the metric.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato <
>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make
>>>>>>>>>>>>>>>>>>>> a decision by Friday, and begin working on related PRs 
>>>>>>>>>>>>>>>>>>>> next week.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please make sure that you provide your feedback before
>>>>>>>>>>>>>>>>>>>> then and I will post the final decisions made to this 
>>>>>>>>>>>>>>>>>>>> thread Friday
>>>>>>>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to it
>>>>>>>>>>>>>>>>>>>>> easily in
>>>>>>>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on
>>>>>>>>>>>>>>>>>>>>> this proposal so far. I
>>>>>>>>>>>>>>>>>>>>> >> have made some revisions based on the feedback.
>>>>>>>>>>>>>>>>>>>>> There were some larger
>>>>>>>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of
>>>>>>>>>>>>>>>>>>>>> these I have added a
>>>>>>>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised
>>>>>>>>>>>>>>>>>>>>> proposal. Please take
>>>>>>>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please
>>>>>>>>>>>>>>>>>>>>> take another look after
>>>>>>>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Kenneth Knowles
 prevent user defined metric
>>>>>>>>>>>>>>>> types based on the fact they just weren't used enough to 
>>>>>>>>>>>>>>>> justify support.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus 
>>>>>>>>>>>>>>>> any special
>>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One thing that has occurred to me is that we're conflating
>>>>>>>>>>>>>>>>> the idea of custom metrics and custom metric types. I would 
>>>>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>>>>> the MetricSpec field be augmented with an additional field 
>>>>>>>>>>>>>>>>> "type" which is
>>>>>>>>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents 
>>>>>>>>>>>>>>>>> of its
>>>>>>>>>>>>>>>>> payload, as well as the form of aggregation). Summing or 
>>>>>>>>>>>>>>>>> maxing over ints
>>>>>>>>>>>>>>>>> would be a typical example. Though we could pursue making 
>>>>>>>>>>>>>>>>> this opaque to
>>>>>>>>>>>>>>>>> the runner in the long run, that's a more speculative (and 
>>>>>>>>>>>>>>>>> difficult)
>>>>>>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>>>>>>> aggregate and
>>>>>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>>>>>> understand the
>>>>>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>>>>>>>> specialization
>>>>>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand as 
>>>>>>>>>>>>>>>>> well.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for
>>>>>>>>>>>>>>>>> every type X one would have a single URN for UserMetric and 
>>>>>>>>>>>>>>>>> it spec would
>>>>>>>>>>>>>>>>> designate the type and payload designate the (qualified) name.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Robert
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato <
>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>>>>>>> I have made a revision today which is to make all metrics
>>>>>>>>>>>>>>>>>> refer to a primary entity, so I have restructured some of 
>>>>>>>>>>>>>>>>>> the protos a
>>>>>>>>>>>>>>>>>> little bit.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The point of this change was to futureproof the
>>>>>>>>>>>>>>>>>> possibility of allowing custom user metrics, with custom 
>>>>>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>> functions for its metric updates.
>>>>>>>>>>>>>>>>>> Now that each metric has an aggregation_entity associated
>>>>>>>>>>>>>>>>>> with it (e.g. PCollection, PTransform), we can design an 
>>>>>>>>>>>>>>>>>> approach which
>>>>>>>>>>>>>>>>>> forwards the opaque bytes metric updates, without 
>>>>>>>>>>>>>>>>>> deserializing them. These
>>>>>>>>>>>>>>>>>> are forwarded to user provided code which then would 
>>>>>>>>>>>>>>>>>> deserialize the metric
>>>>>>>>>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think it has also simplified some of the URN metric
>>>>>>>>>>>>>>>>>> protos, as they do not need to keep track of ptransform 
>>>>>>>>>>>>>>>>>> names inside
>>>>>>>>>>>>>>>>>> themselves now. The result is simpler structures, for the 
>>>>>>>>>>>>>>>>>> metrics as the
>>>>>>>>>>>>>>>>>> entities are pulled outside of the metric.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato <
>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make
>>>>>>>>>>>>>>>>>>> a decision by Friday, and begin working on related PRs next 
>>>>>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please make sure that you provide your feedback before
>>>>>>>>>>>>>>>>>>> then and I will post the final decisions made to this 
>>>>>>>>>>>>>>>>>>> thread Friday
>>>>>>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to it
>>>>>>>>>>>>>>>>>>>> easily in
>>>>>>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>>>>>>>>>> >> have made some revisions based on the feedback.
>>>>>>>>>>>>>>>>>>>> There were some larger
>>>>>>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of
>>>>>>>>>>>>>>>>>>>> these I have added a
>>>>>>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised
>>>>>>>>>>>>>>>>>>>> proposal. Please take
>>>>>>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please
>>>>>>>>>>>>>>>>>>>> take another look after
>>>>>>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
gt;>>>>>>>>> supported type" didn't include new ways of aggregating ints. As long 
>>>>>>>>>> as
>>>>>>>>>> that means we have a fixed set of aggregations (that align with what 
>>>>>>>>>> what
>>>>>>>>>> users want and metrics back end support) it seems like we are doing 
>>>>>>>>>> user
>>>>>>>>>> metrics right.
>>>>>>>>>>
>>>>>>>>>> - Ben
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau <
>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters are
>>>>>>>>>>> used a lot but others are less mainstream so being too fine from 
>>>>>>>>>>> the start
>>>>>>>>>>> can just add complexity and bugs in impls IMHO.
>>>>>>>>>>>
>>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw" 
>>>>>>>>>>> a écrit :
>>>>>>>>>>>
>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>>>>>>>> type, as
>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> When you say type do you mean accumulator type, result type,
>>>>>>>>>>>>> or accumulator strategy? Specifically, what is the "type" of 
>>>>>>>>>>>>> sumint,
>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the road 
>>>>>>>>>>>>>> (and may
>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The metric api is designed to prevent user defined metric
>>>>>>>>>>>>>>> types based on the fact they just weren't used enough to 
>>>>>>>>>>>>>>> justify support.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus 
>>>>>>>>>>>>>>> any special
>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One thing that has occurred to me

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Kenneth Knowles
 want and metrics back end support) it seems like we are doing 
>>>>>>>>>> user
>>>>>>>>>> metrics right.
>>>>>>>>>>
>>>>>>>>>> - Ben
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau <
>>>>>>>>>> rmannibu...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters are
>>>>>>>>>>> used a lot but others are less mainstream so being too fine from 
>>>>>>>>>>> the start
>>>>>>>>>>> can just add complexity and bugs in impls IMHO.
>>>>>>>>>>>
>>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw" 
>>>>>>>>>>> a écrit :
>>>>>>>>>>>
>>>>>>>>>>>> By "type" of metric, I mean both the data types (including
>>>>>>>>>>>> their encoding) and accumulator strategy. So sumint would be a 
>>>>>>>>>>>> type, as
>>>>>>>>>>>> would double-distribution.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> When you say type do you mean accumulator type, result type,
>>>>>>>>>>>>> or accumulator strategy? Specifically, what is the "type" of 
>>>>>>>>>>>>> sumint,
>>>>>>>>>>>>> sumlong, meanlong, etc?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the road 
>>>>>>>>>>>>>> (and may
>>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The metric api is designed to prevent user defined metric
>>>>>>>>>>>>>>> types based on the fact they just weren't used enough to 
>>>>>>>>>>>>>>> justify support.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus 
>>>>>>>>>>>>>>> any special
>>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One thing that has occurred to me is that we're conflating
>>>>>>>>>>>>>>>> the idea of custom metrics and custom metric types. I would 
>>>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>>>> the MetricSpec field be augmented with an additional field 
>>>>>>>>

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
om> wrote:
>>>>>>>>>
>>>>>>>>>> Maybe leave it out until proven it is needed. ATM counters are
>>>>>>>>>> used a lot but others are less mainstream so being too fine from the 
>>>>>>>>>> start
>>>>>>>>>> can just add complexity and bugs in impls IMHO.
>>>>>>>>>>
>>>>>>>>>> Le 12 avr. 2018 08:06, "Robert Bradshaw"  a
>>>>>>>>>> écrit :
>>>>>>>>>>
>>>>>>>>>>> By "type" of metric, I mean both the data types (including their
>>>>>>>>>>> encoding) and accumulator strategy. So sumint would be a type, as 
>>>>>>>>>>> would
>>>>>>>>>>> double-distribution.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> When you say type do you mean accumulator type, result type, or
>>>>>>>>>>>> accumulator strategy? Specifically, what is the "type" of sumint, 
>>>>>>>>>>>> sumlong,
>>>>>>>>>>>> meanlong, etc?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>>>> difficult" feature that I was proposing we kick down the road 
>>>>>>>>>>>>> (and may
>>>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>>>> metrics of
>>>>>>>>>>>>> standard type.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The metric api is designed to prevent user defined metric
>>>>>>>>>>>>>> types based on the fact they just weren't used enough to justify 
>>>>>>>>>>>>>> support.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus any 
>>>>>>>>>>>>>> special
>>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One thing that has occurred to me is that we're conflating
>>>>>>>>>>>>>>> the idea of custom metrics and custom metric types. I would 
>>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>>> the MetricSpec field be augmented with an additional field 
>>>>>>>>>>>>>>> "type" which is
>>>>>>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of 
>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing 
>>>>>>>>>>>>>>> over ints
>>>>>>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>>>>>>> opaque to
>>>>>&

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Andrea Foegler
on the fact they just weren't used enough to justify 
>>>>>>>>>>>>> support.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus any 
>>>>>>>>>>>>> special
>>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing that has occurred to me is that we're conflating
>>>>>>>>>>>>>> the idea of custom metrics and custom metric types. I would 
>>>>>>>>>>>>>> propose
>>>>>>>>>>>>>> the MetricSpec field be augmented with an additional field 
>>>>>>>>>>>>>> "type" which is
>>>>>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of 
>>>>>>>>>>>>>> its
>>>>>>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing 
>>>>>>>>>>>>>> over ints
>>>>>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>>>>>> opaque to
>>>>>>>>>>>>>> the runner in the long run, that's a more speculative (and 
>>>>>>>>>>>>>> difficult)
>>>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>>>> aggregate and
>>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>>> understand the
>>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>>>>> specialization
>>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand as 
>>>>>>>>>>>>>> well.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every
>>>>>>>>>>>>>> type X one would have a single URN for UserMetric and it spec 
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> designate the type and payload designate the (qualified) name.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Robert
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato <
>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>>>> I have made a revision today which is to make all metrics
>>>>>>>>>>>>>>> refer to a primary entity, so I have restructured some of the 
>>>>>>>>>>>>>>> protos a
>>>>>>>>>>>>>>> little bit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The point of this change was to futureproof the possibility
>>>>>>>>>>>>>>> of allowing custom user metrics, with custom aggregation 
>>>>>>>>>>>>>>> functions for its
>>>>>>>>>>>>>>> metric updates.
>>>>>>>>>>>>>>> Now that each metric has an aggregation_entity associated
>>>>>>>>>>>>>>> with it (e.g. PCollection, PTransform), we can design an 
>>>>>>>>>>>>>>> approach which
>>>>>>>>>>>>>>> forwards the opaque bytes metric updates, without deserializing 
>>>>>>>>>>>>>>> them. These
>>>>>>>>>>>>>>> are forwarded to user provided code which then would 
>>>>>>>>>>>>>>> deserialize the metric
>>>>>>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it has also simplified some of the URN metric
>>>>>>>>>>>>>>> protos, as they do not need to keep track of ptransform names 
>>>>>>>>>>>>>>> inside
>>>>>>>>>>>>>>> themselves now. The result is simpler structures, for the 
>>>>>>>>>>>>>>> metrics as the
>>>>>>>>>>>>>>> entities are pulled outside of the metric.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato <
>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please make sure that you provide your feedback before then
>>>>>>>>>>>>>>>> and I will post the final decisions made to this thread Friday 
>>>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to it
>>>>>>>>>>>>>>>>> easily in
>>>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>>>>>>> >> have made some revisions based on the feedback. There
>>>>>>>>>>>>>>>>> were some larger
>>>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of these
>>>>>>>>>>>>>>>>> I have added a
>>>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised
>>>>>>>>>>>>>>>>> proposal. Please take
>>>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>>>>>>>> another look after
>>>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
t;>>>> By "type" of metric, I mean both the data types (including their
>>>>>>>>> encoding) and accumulator strategy. So sumint would be a type, as 
>>>>>>>>> would
>>>>>>>>> double-distribution.
>>>>>>>>>
>>>>>>>>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers <
>>>>>>>>> bjchamb...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> When you say type do you mean accumulator type, result type, or
>>>>>>>>>> accumulator strategy? Specifically, what is the "type" of sumint, 
>>>>>>>>>> sumlong,
>>>>>>>>>> meanlong, etc?
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Fully custom metric types is the "more speculative and
>>>>>>>>>>> difficult" feature that I was proposing we kick down the road (and 
>>>>>>>>>>> may
>>>>>>>>>>> never get to). What I'm suggesting is that we support custom 
>>>>>>>>>>> metrics of
>>>>>>>>>>> standard type.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The metric api is designed to prevent user defined metric types
>>>>>>>>>>>> based on the fact they just weren't used enough to justify support.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus any 
>>>>>>>>>>>> special
>>>>>>>>>>>> system metrivs?
>>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing that has occurred to me is that we're conflating the
>>>>>>>>>>>>> idea of custom metrics and custom metric types. I would propose
>>>>>>>>>>>>> the MetricSpec field be augmented with an additional field "type" 
>>>>>>>>>>>>> which is
>>>>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of 
>>>>>>>>>>>>> its
>>>>>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing 
>>>>>>>>>>>>> over ints
>>>>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>>>>> opaque to
>>>>>>>>>>>>> the runner in the long run, that's a more speculative (and 
>>>>>>>>>>>>> difficult)
>>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>>> aggregate and
>>>>>>>>>>>>> report/return to the SDK metrics that it did not itself 
>>>>>>>>>>>>> understand the
>>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>>>> specialization
>>>>>>>>>>>>> in the runner itself for metrics that it *did* understand as 
>>>>>>>>>>>>> well.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every
>>>>>>>>>>>>> type X one would have 

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Alex Amato
es is the "more speculative and difficult"
>>>>>>>>>> feature that I was proposing we kick down the road (and may never 
>>>>>>>>>> get to).
>>>>>>>>>> What I'm suggesting is that we support custom metrics of standard 
>>>>>>>>>> type.
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers <
>>>>>>>>>> bchamb...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> The metric api is designed to prevent user defined metric types
>>>>>>>>>>> based on the fact they just weren't used enough to justify support.
>>>>>>>>>>>
>>>>>>>>>>> Is there a reason we are bringing that complexity back?
>>>>>>>>>>> Shouldn't we just need the ability for the standard set plus any 
>>>>>>>>>>> special
>>>>>>>>>>> system metrivs?
>>>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw <
>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>>>
>>>>>>>>>>>> One thing that has occurred to me is that we're conflating the
>>>>>>>>>>>> idea of custom metrics and custom metric types. I would propose
>>>>>>>>>>>> the MetricSpec field be augmented with an additional field "type" 
>>>>>>>>>>>> which is
>>>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of its
>>>>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing 
>>>>>>>>>>>> over ints
>>>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>>>> opaque to
>>>>>>>>>>>> the runner in the long run, that's a more speculative (and 
>>>>>>>>>>>> difficult)
>>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>>> aggregate and
>>>>>>>>>>>> report/return to the SDK metrics that it did not itself understand 
>>>>>>>>>>>> the
>>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>>> specialization
>>>>>>>>>>>> in the runner itself for metrics that it *did* understand as well.)
>>>>>>>>>>>>
>>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every
>>>>>>>>>>>> type X one would have a single URN for UserMetric and it spec would
>>>>>>>>>>>> designate the type and payload designate the (qualified) name.
>>>>>>>>>>>>
>>>>>>>>>>>> - Robert
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>>> I have made a revision today which is to make all metrics
>>>>>>>>>>>>> refer to a primary entity, so I have restructured some of the 
>>>>>>>>>>>>> protos a
>>>>>>>>>>>>> little bit.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>>>>>>> allowing custom user metrics, with custom aggregation functions 
>>>>>>>>>>>>> for its
>>>>>>>>>>>>> metric updates.
>>>>>>>>>>>>> Now that each metric has an aggregation_entity associated with
>>>>>>>>>>>>> it (e.g. PCollection, PTransform), we can design an approach 
>>>>>>>>>>>>> which forwards
>>>>>>>>>>>>> the opaque bytes metric updates, without deserializing them. 
>>>>>>>>>>>>> These are
>>>>>>>>>>>>> forwarded to user provided code which then would deserialize the 
>>>>>>>>>>>>> metric
>>>>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it has also simplified some of the URN metric protos,
>>>>>>>>>>>>> as they do not need to keep track of ptransform names inside 
>>>>>>>>>>>>> themselves
>>>>>>>>>>>>> now. The result is simpler structures, for the metrics as the 
>>>>>>>>>>>>> entities are
>>>>>>>>>>>>> pulled outside of the metric.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please make sure that you provide your feedback before then
>>>>>>>>>>>>>> and I will post the final decisions made to this thread Friday 
>>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía <
>>>>>>>>>>>>>> ieme...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nice, I created a short link so people can refer to it
>>>>>>>>>>>>>>> easily in
>>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>>>>> >> have made some revisions based on the feedback. There
>>>>>>>>>>>>>>> were some larger
>>>>>>>>>>>>>>> >> questions asking about alternatives. For each of these I
>>>>>>>>>>>>>>> have added a
>>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>>>>>>> Please take
>>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>>>>>> another look after
>>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Robert Bradshaw
axing 
>>>>>>>>>>> over ints
>>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>>> opaque to
>>>>>>>>>>> the runner in the long run, that's a more speculative (and 
>>>>>>>>>>> difficult)
>>>>>>>>>>> feature to tackle. This would allow the runner to at least 
>>>>>>>>>>> aggregate and
>>>>>>>>>>> report/return to the SDK metrics that it did not itself understand 
>>>>>>>>>>> the
>>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>>> specialization
>>>>>>>>>>> in the runner itself for metrics that it *did* understand as well.)
>>>>>>>>>>>
>>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every type
>>>>>>>>>>> X one would have a single URN for UserMetric and it spec would 
>>>>>>>>>>> designate
>>>>>>>>>>> the type and payload designate the (qualified) name.
>>>>>>>>>>>
>>>>>>>>>>> - Robert
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>>> I have made a revision today which is to make all metrics refer
>>>>>>>>>>>> to a primary entity, so I have restructured some of the protos a 
>>>>>>>>>>>> little bit.
>>>>>>>>>>>>
>>>>>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>>>>>> allowing custom user metrics, with custom aggregation functions 
>>>>>>>>>>>> for its
>>>>>>>>>>>> metric updates.
>>>>>>>>>>>> Now that each metric has an aggregation_entity associated with
>>>>>>>>>>>> it (e.g. PCollection, PTransform), we can design an approach which 
>>>>>>>>>>>> forwards
>>>>>>>>>>>> the opaque bytes metric updates, without deserializing them. These 
>>>>>>>>>>>> are
>>>>>>>>>>>> forwarded to user provided code which then would deserialize the 
>>>>>>>>>>>> metric
>>>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>>>
>>>>>>>>>>>> I think it has also simplified some of the URN metric protos,
>>>>>>>>>>>> as they do not need to keep track of ptransform names inside 
>>>>>>>>>>>> themselves
>>>>>>>>>>>> now. The result is simpler structures, for the metrics as the 
>>>>>>>>>>>> entities are
>>>>>>>>>>>> pulled outside of the metric.
>>>>>>>>>>>>
>>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please make sure that you provide your feedback before then
>>>>>>>>>>>>> and I will post the final decisions made to this thread Friday 
>>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nice, I created a short link so people can refer to it easily
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>>>> >> have made some revisions based on the feedback. There were
>>>>>>>>>>>>>> some larger
>>>>>>>>>>>>>> >> questions asking about alternatives. For each of these I
>>>>>>>>>>>>>> have added a
>>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>>>>>> Please take
>>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>>>>> another look after
>>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>
>>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-13 Thread Kenneth Knowles
inging that complexity back? Shouldn't
>>>>>>>>> we just need the ability for the standard set plus any special system
>>>>>>>>> metrivs?
>>>>>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks. I think this has simplified things.
>>>>>>>>>>
>>>>>>>>>> One thing that has occurred to me is that we're conflating the
>>>>>>>>>> idea of custom metrics and custom metric types. I would propose
>>>>>>>>>> the MetricSpec field be augmented with an additional field "type" 
>>>>>>>>>> which is
>>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of its
>>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing over 
>>>>>>>>>> ints
>>>>>>>>>> would be a typical example. Though we could pursue making this 
>>>>>>>>>> opaque to
>>>>>>>>>> the runner in the long run, that's a more speculative (and difficult)
>>>>>>>>>> feature to tackle. This would allow the runner to at least aggregate 
>>>>>>>>>> and
>>>>>>>>>> report/return to the SDK metrics that it did not itself understand 
>>>>>>>>>> the
>>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>>> specialization
>>>>>>>>>> in the runner itself for metrics that it *did* understand as well.)
>>>>>>>>>>
>>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every type
>>>>>>>>>> X one would have a single URN for UserMetric and it spec would 
>>>>>>>>>> designate
>>>>>>>>>> the type and payload designate the (qualified) name.
>>>>>>>>>>
>>>>>>>>>> - Robert
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>>> I have made a revision today which is to make all metrics refer
>>>>>>>>>>> to a primary entity, so I have restructured some of the protos a 
>>>>>>>>>>> little bit.
>>>>>>>>>>>
>>>>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>>>>> allowing custom user metrics, with custom aggregation functions for 
>>>>>>>>>>> its
>>>>>>>>>>> metric updates.
>>>>>>>>>>> Now that each metric has an aggregation_entity associated with
>>>>>>>>>>> it (e.g. PCollection, PTransform), we can design an approach which 
>>>>>>>>>>> forwards
>>>>>>>>>>> the opaque bytes metric updates, without deserializing them. These 
>>>>>>>>>>> are
>>>>>>>>>>> forwarded to user provided code which then would deserialize the 
>>>>>>>>>>> metric
>>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>>
>>>>>>>>>>> I think it has also simplified some of the URN metric protos, as
>>>>>>>>>>> they do not need to keep track of ptransform names inside 
>>>>>>>>>>> themselves now.
>>>>>>>>>>> The result is simpler structures, for the metrics as the entities 
>>>>>>>>>>> are
>>>>>>>>>>> pulled outside of the metric.
>>>>>>>>>>>
>>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>>>
>>>>>>>>>>>> Please make sure that you provide your feedback before then and
>>>>>>>>>>>> I will post the final decisions made to this thread Friday 
>>>>>>>>>>>> afternoon.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Nice, I created a short link so people can refer to it easily
>>>>>>>>>>>>> in
>>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato <
>>>>>>>>>>>>> ajam...@google.com> wrote:
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>>> >> have made some revisions based on the feedback. There were
>>>>>>>>>>>>> some larger
>>>>>>>>>>>>> >> questions asking about alternatives. For each of these I
>>>>>>>>>>>>> have added a
>>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>>>>> Please take
>>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>>>> another look after
>>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>>> >> Alex
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >
>>>>>>>>>>>>>
>>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-12 Thread Robert Bradshaw
ed with an additional field "type" 
>>>>>>>>> which is
>>>>>>>>> a urn specifying the type of metric it is (i.e. the contents of its
>>>>>>>>> payload, as well as the form of aggregation). Summing or maxing over 
>>>>>>>>> ints
>>>>>>>>> would be a typical example. Though we could pursue making this opaque 
>>>>>>>>> to
>>>>>>>>> the runner in the long run, that's a more speculative (and difficult)
>>>>>>>>> feature to tackle. This would allow the runner to at least aggregate 
>>>>>>>>> and
>>>>>>>>> report/return to the SDK metrics that it did not itself understand the
>>>>>>>>> semantic meaning of. (It would probably simplify much of the 
>>>>>>>>> specialization
>>>>>>>>> in the runner itself for metrics that it *did* understand as well.)
>>>>>>>>>
>>>>>>>>> In addition, rather than having UserMetricOfTypeX for every type X
>>>>>>>>> one would have a single URN for UserMetric and it spec would 
>>>>>>>>> designate the
>>>>>>>>> type and payload designate the (qualified) name.
>>>>>>>>>
>>>>>>>>> - Robert
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you everyone for your feedback so far.
>>>>>>>>>> I have made a revision today which is to make all metrics refer
>>>>>>>>>> to a primary entity, so I have restructured some of the protos a 
>>>>>>>>>> little bit.
>>>>>>>>>>
>>>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>>>> allowing custom user metrics, with custom aggregation functions for 
>>>>>>>>>> its
>>>>>>>>>> metric updates.
>>>>>>>>>> Now that each metric has an aggregation_entity associated with it
>>>>>>>>>> (e.g. PCollection, PTransform), we can design an approach which 
>>>>>>>>>> forwards
>>>>>>>>>> the opaque bytes metric updates, without deserializing them. These 
>>>>>>>>>> are
>>>>>>>>>> forwarded to user provided code which then would deserialize the 
>>>>>>>>>> metric
>>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>>
>>>>>>>>>> I think it has also simplified some of the URN metric protos, as
>>>>>>>>>> they do not need to keep track of ptransform names inside themselves 
>>>>>>>>>> now.
>>>>>>>>>> The result is simpler structures, for the metrics as the entities are
>>>>>>>>>> pulled outside of the metric.
>>>>>>>>>>
>>>>>>>>>> I have mentioned this in the doc now, and wanted to draw
>>>>>>>>>> attention to this particular revision.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>>
>>>>>>>>>>> Please make sure that you provide your feedback before then and
>>>>>>>>>>> I will post the final decisions made to this thread Friday 
>>>>>>>>>>> afternoon.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Nice, I created a short link so people can refer to it easily in
>>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>>
>>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Thank you everyone for your initial feedback on this
>>>>>>>>>>>> proposal so far. I
>>>>>>>>>>>> >> have made some revisions based on the feedback. There were
>>>>>>>>>>>> some larger
>>>>>>>>>>>> >> questions asking about alternatives. For each of these I
>>>>>>>>>>>> have added a
>>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>>> recommendation as well
>>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>>>> Please take
>>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>>> >>
>>>>>>>>>>>> >>
>>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>>> another look after
>>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>>> >> Alex
>>>>>>>>>>>> >>
>>>>>>>>>>>> >
>>>>>>>>>>>>
>>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-12 Thread Alex Amato
it.
>>>>>>>>>
>>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>>> allowing custom user metrics, with custom aggregation functions for 
>>>>>>>>> its
>>>>>>>>> metric updates.
>>>>>>>>> Now that each metric has an aggregation_entity associated with it
>>>>>>>>> (e.g. PCollection, PTransform), we can design an approach which 
>>>>>>>>> forwards
>>>>>>>>> the opaque bytes metric updates, without deserializing them. These are
>>>>>>>>> forwarded to user provided code which then would deserialize the 
>>>>>>>>> metric
>>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>>
>>>>>>>>> I think it has also simplified some of the URN metric protos, as
>>>>>>>>> they do not need to keep track of ptransform names inside themselves 
>>>>>>>>> now.
>>>>>>>>> The result is simpler structures, for the metrics as the entities are
>>>>>>>>> pulled outside of the metric.
>>>>>>>>>
>>>>>>>>> I have mentioned this in the doc now, and wanted to draw attention
>>>>>>>>> to this particular revision.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I've gathered a lot of feedback so far and want to make a
>>>>>>>>>> decision by Friday, and begin working on related PRs next week.
>>>>>>>>>>
>>>>>>>>>> Please make sure that you provide your feedback before then and I
>>>>>>>>>> will post the final decisions made to this thread Friday afternoon.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Nice, I created a short link so people can refer to it easily in
>>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>>
>>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>>> >
>>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato 
>>>>>>>>>>> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> Hello beam community,
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thank you everyone for your initial feedback on this proposal
>>>>>>>>>>> so far. I
>>>>>>>>>>> >> have made some revisions based on the feedback. There were
>>>>>>>>>>> some larger
>>>>>>>>>>> >> questions asking about alternatives. For each of these I have
>>>>>>>>>>> added a
>>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>>> recommendation as well
>>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>>> >>
>>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>>> Please take
>>>>>>>>>>> >> another look and let me know
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>>> >>
>>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>>> another look after
>>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>>> >>
>>>>>>>>>>> >> Thanks again,
>>>>>>>>>>> >> Alex
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-12 Thread Kenneth Knowles
>> bit.
>>>>>>>>
>>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>>> allowing custom user metrics, with custom aggregation functions for its
>>>>>>>> metric updates.
>>>>>>>> Now that each metric has an aggregation_entity associated with it
>>>>>>>> (e.g. PCollection, PTransform), we can design an approach which 
>>>>>>>> forwards
>>>>>>>> the opaque bytes metric updates, without deserializing them. These are
>>>>>>>> forwarded to user provided code which then would deserialize the metric
>>>>>>>> update payloads and perform the custom aggregations.
>>>>>>>>
>>>>>>>> I think it has also simplified some of the URN metric protos, as
>>>>>>>> they do not need to keep track of ptransform names inside themselves 
>>>>>>>> now.
>>>>>>>> The result is simpler structures, for the metrics as the entities are
>>>>>>>> pulled outside of the metric.
>>>>>>>>
>>>>>>>> I have mentioned this in the doc now, and wanted to draw attention
>>>>>>>> to this particular revision.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I've gathered a lot of feedback so far and want to make a decision
>>>>>>>>> by Friday, and begin working on related PRs next week.
>>>>>>>>>
>>>>>>>>> Please make sure that you provide your feedback before then and I
>>>>>>>>> will post the final decisions made to this thread Friday afternoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Nice, I created a short link so people can refer to it easily in
>>>>>>>>>> future discussions, website, etc.
>>>>>>>>>>
>>>>>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>>>>>
>>>>>>>>>> Thanks for sharing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato 
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Hello beam community,
>>>>>>>>>> >>
>>>>>>>>>> >> Thank you everyone for your initial feedback on this proposal
>>>>>>>>>> so far. I
>>>>>>>>>> >> have made some revisions based on the feedback. There were
>>>>>>>>>> some larger
>>>>>>>>>> >> questions asking about alternatives. For each of these I have
>>>>>>>>>> added a
>>>>>>>>>> >> section tagged with [Alternatives] and discussed my
>>>>>>>>>> recommendation as well
>>>>>>>>>> >> as as few other choices we considered.
>>>>>>>>>> >>
>>>>>>>>>> >> I would appreciate more feedback on the revised proposal.
>>>>>>>>>> Please take
>>>>>>>>>> >> another look and let me know
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>>>>>> >>
>>>>>>>>>> >> Etienne, I would appreciate it if you could please take
>>>>>>>>>> another look after
>>>>>>>>>> >> the revisions I have made as well.
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks again,
>>>>>>>>>> >> Alex
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-12 Thread Ben Chambers
Sounds perfect. Just wanted to make sure that "custom metrics of supported
type" didn't include new ways of aggregating ints. As long as that means we
have a fixed set of aggregations (that align with what what users want and
metrics back end support) it seems like we are doing user metrics right.

- Ben

On Wed, Apr 11, 2018, 11:30 PM Romain Manni-Bucau 
wrote:

> Maybe leave it out until proven it is needed. ATM counters are used a lot
> but others are less mainstream so being too fine from the start can just
> add complexity and bugs in impls IMHO.
>
> Le 12 avr. 2018 08:06, "Robert Bradshaw"  a écrit :
>
>> By "type" of metric, I mean both the data types (including their
>> encoding) and accumulator strategy. So sumint would be a type, as would
>> double-distribution.
>>
>> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers 
>> wrote:
>>
>>> When you say type do you mean accumulator type, result type, or
>>> accumulator strategy? Specifically, what is the "type" of sumint, sumlong,
>>> meanlong, etc?
>>>
>>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> Fully custom metric types is the "more speculative and difficult"
>>>> feature that I was proposing we kick down the road (and may never get to).
>>>> What I'm suggesting is that we support custom metrics of standard type.
>>>>
>>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers 
>>>> wrote:
>>>>
>>>>> The metric api is designed to prevent user defined metric types based
>>>>> on the fact they just weren't used enough to justify support.
>>>>>
>>>>> Is there a reason we are bringing that complexity back? Shouldn't we
>>>>> just need the ability for the standard set plus any special system 
>>>>> metrivs?
>>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> Thanks. I think this has simplified things.
>>>>>>
>>>>>> One thing that has occurred to me is that we're conflating the idea
>>>>>> of custom metrics and custom metric types. I would propose the MetricSpec
>>>>>> field be augmented with an additional field "type" which is a urn
>>>>>> specifying the type of metric it is (i.e. the contents of its payload, as
>>>>>> well as the form of aggregation). Summing or maxing over ints would be a
>>>>>> typical example. Though we could pursue making this opaque to the runner 
>>>>>> in
>>>>>> the long run, that's a more speculative (and difficult) feature to 
>>>>>> tackle.
>>>>>> This would allow the runner to at least aggregate and report/return to 
>>>>>> the
>>>>>> SDK metrics that it did not itself understand the semantic meaning of. 
>>>>>> (It
>>>>>> would probably simplify much of the specialization in the runner itself 
>>>>>> for
>>>>>> metrics that it *did* understand as well.)
>>>>>>
>>>>>> In addition, rather than having UserMetricOfTypeX for every type X
>>>>>> one would have a single URN for UserMetric and it spec would designate 
>>>>>> the
>>>>>> type and payload designate the (qualified) name.
>>>>>>
>>>>>> - Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato 
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you everyone for your feedback so far.
>>>>>>> I have made a revision today which is to make all metrics refer to a
>>>>>>> primary entity, so I have restructured some of the protos a little bit.
>>>>>>>
>>>>>>> The point of this change was to futureproof the possibility of
>>>>>>> allowing custom user metrics, with custom aggregation functions for its
>>>>>>> metric updates.
>>>>>>> Now that each metric has an aggregation_entity associated with it
>>>>>>> (e.g. PCollection, PTransform), we can design an approach which forwards
>>>>>>> the opaque bytes metric updates, without deserializing them. These are
>>>>>>> forwarded to user provided code which then would deserialize the metric
&

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Romain Manni-Bucau
Maybe leave it out until proven it is needed. ATM counters are used a lot
but others are less mainstream so being too fine from the start can just
add complexity and bugs in impls IMHO.

Le 12 avr. 2018 08:06, "Robert Bradshaw"  a écrit :

> By "type" of metric, I mean both the data types (including their encoding)
> and accumulator strategy. So sumint would be a type, as would
> double-distribution.
>
> On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers 
> wrote:
>
>> When you say type do you mean accumulator type, result type, or
>> accumulator strategy? Specifically, what is the "type" of sumint, sumlong,
>> meanlong, etc?
>>
>> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw 
>> wrote:
>>
>>> Fully custom metric types is the "more speculative and difficult"
>>> feature that I was proposing we kick down the road (and may never get to).
>>> What I'm suggesting is that we support custom metrics of standard type.
>>>
>>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers 
>>> wrote:
>>>
>>>> The metric api is designed to prevent user defined metric types based
>>>> on the fact they just weren't used enough to justify support.
>>>>
>>>> Is there a reason we are bringing that complexity back? Shouldn't we
>>>> just need the ability for the standard set plus any special system metrivs?
>>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> Thanks. I think this has simplified things.
>>>>>
>>>>> One thing that has occurred to me is that we're conflating the idea of
>>>>> custom metrics and custom metric types. I would propose the MetricSpec
>>>>> field be augmented with an additional field "type" which is a urn
>>>>> specifying the type of metric it is (i.e. the contents of its payload, as
>>>>> well as the form of aggregation). Summing or maxing over ints would be a
>>>>> typical example. Though we could pursue making this opaque to the runner 
>>>>> in
>>>>> the long run, that's a more speculative (and difficult) feature to tackle.
>>>>> This would allow the runner to at least aggregate and report/return to the
>>>>> SDK metrics that it did not itself understand the semantic meaning of. (It
>>>>> would probably simplify much of the specialization in the runner itself 
>>>>> for
>>>>> metrics that it *did* understand as well.)
>>>>>
>>>>> In addition, rather than having UserMetricOfTypeX for every type X one
>>>>> would have a single URN for UserMetric and it spec would designate the 
>>>>> type
>>>>> and payload designate the (qualified) name.
>>>>>
>>>>> - Robert
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:
>>>>>
>>>>>> Thank you everyone for your feedback so far.
>>>>>> I have made a revision today which is to make all metrics refer to a
>>>>>> primary entity, so I have restructured some of the protos a little bit.
>>>>>>
>>>>>> The point of this change was to futureproof the possibility of
>>>>>> allowing custom user metrics, with custom aggregation functions for its
>>>>>> metric updates.
>>>>>> Now that each metric has an aggregation_entity associated with it
>>>>>> (e.g. PCollection, PTransform), we can design an approach which forwards
>>>>>> the opaque bytes metric updates, without deserializing them. These are
>>>>>> forwarded to user provided code which then would deserialize the metric
>>>>>> update payloads and perform the custom aggregations.
>>>>>>
>>>>>> I think it has also simplified some of the URN metric protos, as they
>>>>>> do not need to keep track of ptransform names inside themselves now. The
>>>>>> result is simpler structures, for the metrics as the entities are pulled
>>>>>> outside of the metric.
>>>>>>
>>>>>> I have mentioned this in the doc now, and wanted to draw attention to
>>>>>> this particular revision.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato 
>>>>>> wrote:
>>>>

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Robert Bradshaw
By "type" of metric, I mean both the data types (including their encoding)
and accumulator strategy. So sumint would be a type, as would
double-distribution.

On Wed, Apr 11, 2018 at 10:39 PM Ben Chambers  wrote:

> When you say type do you mean accumulator type, result type, or
> accumulator strategy? Specifically, what is the "type" of sumint, sumlong,
> meanlong, etc?
>
> On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw  wrote:
>
>> Fully custom metric types is the "more speculative and difficult" feature
>> that I was proposing we kick down the road (and may never get to). What I'm
>> suggesting is that we support custom metrics of standard type.
>>
>> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers 
>> wrote:
>>
>>> The metric api is designed to prevent user defined metric types based on
>>> the fact they just weren't used enough to justify support.
>>>
>>> Is there a reason we are bringing that complexity back? Shouldn't we
>>> just need the ability for the standard set plus any special system metrivs?
>>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> Thanks. I think this has simplified things.
>>>>
>>>> One thing that has occurred to me is that we're conflating the idea of
>>>> custom metrics and custom metric types. I would propose the MetricSpec
>>>> field be augmented with an additional field "type" which is a urn
>>>> specifying the type of metric it is (i.e. the contents of its payload, as
>>>> well as the form of aggregation). Summing or maxing over ints would be a
>>>> typical example. Though we could pursue making this opaque to the runner in
>>>> the long run, that's a more speculative (and difficult) feature to tackle.
>>>> This would allow the runner to at least aggregate and report/return to the
>>>> SDK metrics that it did not itself understand the semantic meaning of. (It
>>>> would probably simplify much of the specialization in the runner itself for
>>>> metrics that it *did* understand as well.)
>>>>
>>>> In addition, rather than having UserMetricOfTypeX for every type X one
>>>> would have a single URN for UserMetric and it spec would designate the type
>>>> and payload designate the (qualified) name.
>>>>
>>>> - Robert
>>>>
>>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:
>>>>
>>>>> Thank you everyone for your feedback so far.
>>>>> I have made a revision today which is to make all metrics refer to a
>>>>> primary entity, so I have restructured some of the protos a little bit.
>>>>>
>>>>> The point of this change was to futureproof the possibility of
>>>>> allowing custom user metrics, with custom aggregation functions for its
>>>>> metric updates.
>>>>> Now that each metric has an aggregation_entity associated with it
>>>>> (e.g. PCollection, PTransform), we can design an approach which forwards
>>>>> the opaque bytes metric updates, without deserializing them. These are
>>>>> forwarded to user provided code which then would deserialize the metric
>>>>> update payloads and perform the custom aggregations.
>>>>>
>>>>> I think it has also simplified some of the URN metric protos, as they
>>>>> do not need to keep track of ptransform names inside themselves now. The
>>>>> result is simpler structures, for the metrics as the entities are pulled
>>>>> outside of the metric.
>>>>>
>>>>> I have mentioned this in the doc now, and wanted to draw attention to
>>>>> this particular revision.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:
>>>>>
>>>>>> I've gathered a lot of feedback so far and want to make a decision by
>>>>>> Friday, and begin working on related PRs next week.
>>>>>>
>>>>>> Please make sure that you provide your feedback before then and I
>>>>>> will post the final decisions made to this thread Friday afternoon.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>>> Nice, I created a short link so people can refer to it eas

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Ben Chambers
When you say type do you mean accumulator type, result type, or accumulator
strategy? Specifically, what is the "type" of sumint, sumlong, meanlong,
etc?

On Wed, Apr 11, 2018, 9:38 PM Robert Bradshaw  wrote:

> Fully custom metric types is the "more speculative and difficult" feature
> that I was proposing we kick down the road (and may never get to). What I'm
> suggesting is that we support custom metrics of standard type.
>
> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers  wrote:
>
>> The metric api is designed to prevent user defined metric types based on
>> the fact they just weren't used enough to justify support.
>>
>> Is there a reason we are bringing that complexity back? Shouldn't we just
>> need the ability for the standard set plus any special system metrivs?
>> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw 
>> wrote:
>>
>>> Thanks. I think this has simplified things.
>>>
>>> One thing that has occurred to me is that we're conflating the idea of
>>> custom metrics and custom metric types. I would propose the MetricSpec
>>> field be augmented with an additional field "type" which is a urn
>>> specifying the type of metric it is (i.e. the contents of its payload, as
>>> well as the form of aggregation). Summing or maxing over ints would be a
>>> typical example. Though we could pursue making this opaque to the runner in
>>> the long run, that's a more speculative (and difficult) feature to tackle.
>>> This would allow the runner to at least aggregate and report/return to the
>>> SDK metrics that it did not itself understand the semantic meaning of. (It
>>> would probably simplify much of the specialization in the runner itself for
>>> metrics that it *did* understand as well.)
>>>
>>> In addition, rather than having UserMetricOfTypeX for every type X one
>>> would have a single URN for UserMetric and it spec would designate the type
>>> and payload designate the (qualified) name.
>>>
>>> - Robert
>>>
>>>
>>>
>>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:
>>>
>>>> Thank you everyone for your feedback so far.
>>>> I have made a revision today which is to make all metrics refer to a
>>>> primary entity, so I have restructured some of the protos a little bit.
>>>>
>>>> The point of this change was to futureproof the possibility of allowing
>>>> custom user metrics, with custom aggregation functions for its metric
>>>> updates.
>>>> Now that each metric has an aggregation_entity associated with it (e.g.
>>>> PCollection, PTransform), we can design an approach which forwards the
>>>> opaque bytes metric updates, without deserializing them. These are
>>>> forwarded to user provided code which then would deserialize the metric
>>>> update payloads and perform the custom aggregations.
>>>>
>>>> I think it has also simplified some of the URN metric protos, as they
>>>> do not need to keep track of ptransform names inside themselves now. The
>>>> result is simpler structures, for the metrics as the entities are pulled
>>>> outside of the metric.
>>>>
>>>> I have mentioned this in the doc now, and wanted to draw attention to
>>>> this particular revision.
>>>>
>>>>
>>>>
>>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:
>>>>
>>>>> I've gathered a lot of feedback so far and want to make a decision by
>>>>> Friday, and begin working on related PRs next week.
>>>>>
>>>>> Please make sure that you provide your feedback before then and I will
>>>>> post the final decisions made to this thread Friday afternoon.
>>>>>
>>>>>
>>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:
>>>>>
>>>>>> Nice, I created a short link so people can refer to it easily in
>>>>>> future discussions, website, etc.
>>>>>>
>>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>>
>>>>>> Thanks for sharing.
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
>>>>>> wrote:
>>>>>> > Thanks for the nice writeup. I added some comments.
>>>>>> >
>>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato 
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Hello beam community,
>>>>>> >>
>>>>>> >> Thank you everyone for your initial feedback on this proposal so
>>>>>> far. I
>>>>>> >> have made some revisions based on the feedback. There were some
>>>>>> larger
>>>>>> >> questions asking about alternatives. For each of these I have
>>>>>> added a
>>>>>> >> section tagged with [Alternatives] and discussed my recommendation
>>>>>> as well
>>>>>> >> as as few other choices we considered.
>>>>>> >>
>>>>>> >> I would appreciate more feedback on the revised proposal. Please
>>>>>> take
>>>>>> >> another look and let me know
>>>>>> >>
>>>>>> >>
>>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>>> >>
>>>>>> >> Etienne, I would appreciate it if you could please take another
>>>>>> look after
>>>>>> >> the revisions I have made as well.
>>>>>> >>
>>>>>> >> Thanks again,
>>>>>> >> Alex
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Robert Bradshaw
Fully custom metric types is the "more speculative and difficult" feature
that I was proposing we kick down the road (and may never get to). What I'm
suggesting is that we support custom metrics of standard type.

On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers  wrote:

> The metric api is designed to prevent user defined metric types based on
> the fact they just weren't used enough to justify support.
>
> Is there a reason we are bringing that complexity back? Shouldn't we just
> need the ability for the standard set plus any special system metrivs?
> On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw  wrote:
>
>> Thanks. I think this has simplified things.
>>
>> One thing that has occurred to me is that we're conflating the idea of
>> custom metrics and custom metric types. I would propose the MetricSpec
>> field be augmented with an additional field "type" which is a urn
>> specifying the type of metric it is (i.e. the contents of its payload, as
>> well as the form of aggregation). Summing or maxing over ints would be a
>> typical example. Though we could pursue making this opaque to the runner in
>> the long run, that's a more speculative (and difficult) feature to tackle.
>> This would allow the runner to at least aggregate and report/return to the
>> SDK metrics that it did not itself understand the semantic meaning of. (It
>> would probably simplify much of the specialization in the runner itself for
>> metrics that it *did* understand as well.)
>>
>> In addition, rather than having UserMetricOfTypeX for every type X one
>> would have a single URN for UserMetric and it spec would designate the type
>> and payload designate the (qualified) name.
>>
>> - Robert
>>
>>
>>
>> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:
>>
>>> Thank you everyone for your feedback so far.
>>> I have made a revision today which is to make all metrics refer to a
>>> primary entity, so I have restructured some of the protos a little bit.
>>>
>>> The point of this change was to futureproof the possibility of allowing
>>> custom user metrics, with custom aggregation functions for its metric
>>> updates.
>>> Now that each metric has an aggregation_entity associated with it (e.g.
>>> PCollection, PTransform), we can design an approach which forwards the
>>> opaque bytes metric updates, without deserializing them. These are
>>> forwarded to user provided code which then would deserialize the metric
>>> update payloads and perform the custom aggregations.
>>>
>>> I think it has also simplified some of the URN metric protos, as they do
>>> not need to keep track of ptransform names inside themselves now. The
>>> result is simpler structures, for the metrics as the entities are pulled
>>> outside of the metric.
>>>
>>> I have mentioned this in the doc now, and wanted to draw attention to
>>> this particular revision.
>>>
>>>
>>>
>>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:
>>>
>>>> I've gathered a lot of feedback so far and want to make a decision by
>>>> Friday, and begin working on related PRs next week.
>>>>
>>>> Please make sure that you provide your feedback before then and I will
>>>> post the final decisions made to this thread Friday afternoon.
>>>>
>>>>
>>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:
>>>>
>>>>> Nice, I created a short link so people can refer to it easily in
>>>>> future discussions, website, etc.
>>>>>
>>>>> https://s.apache.org/beam-fn-api-metrics
>>>>>
>>>>> Thanks for sharing.
>>>>>
>>>>>
>>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
>>>>> wrote:
>>>>> > Thanks for the nice writeup. I added some comments.
>>>>> >
>>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato 
>>>>> wrote:
>>>>> >>
>>>>> >> Hello beam community,
>>>>> >>
>>>>> >> Thank you everyone for your initial feedback on this proposal so
>>>>> far. I
>>>>> >> have made some revisions based on the feedback. There were some
>>>>> larger
>>>>> >> questions asking about alternatives. For each of these I have added
>>>>> a
>>>>> >> section tagged with [Alternatives] and discussed my recommendation
>>>>> as well
>>>>> >> as as few other choices we considered.
>>>>> >>
>>>>> >> I would appreciate more feedback on the revised proposal. Please
>>>>> take
>>>>> >> another look and let me know
>>>>> >>
>>>>> >>
>>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>>> >>
>>>>> >> Etienne, I would appreciate it if you could please take another
>>>>> look after
>>>>> >> the revisions I have made as well.
>>>>> >>
>>>>> >> Thanks again,
>>>>> >> Alex
>>>>> >>
>>>>> >
>>>>>
>>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Ben Chambers
The metric api is designed to prevent user defined metric types based on
the fact they just weren't used enough to justify support.

Is there a reason we are bringing that complexity back? Shouldn't we just
need the ability for the standard set plus any special system metrivs?

On Wed, Apr 11, 2018, 5:43 PM Robert Bradshaw  wrote:

> Thanks. I think this has simplified things.
>
> One thing that has occurred to me is that we're conflating the idea of
> custom metrics and custom metric types. I would propose the MetricSpec
> field be augmented with an additional field "type" which is a urn
> specifying the type of metric it is (i.e. the contents of its payload, as
> well as the form of aggregation). Summing or maxing over ints would be a
> typical example. Though we could pursue making this opaque to the runner in
> the long run, that's a more speculative (and difficult) feature to tackle.
> This would allow the runner to at least aggregate and report/return to the
> SDK metrics that it did not itself understand the semantic meaning of. (It
> would probably simplify much of the specialization in the runner itself for
> metrics that it *did* understand as well.)
>
> In addition, rather than having UserMetricOfTypeX for every type X one
> would have a single URN for UserMetric and it spec would designate the type
> and payload designate the (qualified) name.
>
> - Robert
>
>
>
> On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:
>
>> Thank you everyone for your feedback so far.
>> I have made a revision today which is to make all metrics refer to a
>> primary entity, so I have restructured some of the protos a little bit.
>>
>> The point of this change was to futureproof the possibility of allowing
>> custom user metrics, with custom aggregation functions for its metric
>> updates.
>> Now that each metric has an aggregation_entity associated with it (e.g.
>> PCollection, PTransform), we can design an approach which forwards the
>> opaque bytes metric updates, without deserializing them. These are
>> forwarded to user provided code which then would deserialize the metric
>> update payloads and perform the custom aggregations.
>>
>> I think it has also simplified some of the URN metric protos, as they do
>> not need to keep track of ptransform names inside themselves now. The
>> result is simpler structures, for the metrics as the entities are pulled
>> outside of the metric.
>>
>> I have mentioned this in the doc now, and wanted to draw attention to
>> this particular revision.
>>
>>
>>
>> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:
>>
>>> I've gathered a lot of feedback so far and want to make a decision by
>>> Friday, and begin working on related PRs next week.
>>>
>>> Please make sure that you provide your feedback before then and I will
>>> post the final decisions made to this thread Friday afternoon.
>>>
>>>
>>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:
>>>
>>>> Nice, I created a short link so people can refer to it easily in
>>>> future discussions, website, etc.
>>>>
>>>> https://s.apache.org/beam-fn-api-metrics
>>>>
>>>> Thanks for sharing.
>>>>
>>>>
>>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
>>>> wrote:
>>>> > Thanks for the nice writeup. I added some comments.
>>>> >
>>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:
>>>> >>
>>>> >> Hello beam community,
>>>> >>
>>>> >> Thank you everyone for your initial feedback on this proposal so
>>>> far. I
>>>> >> have made some revisions based on the feedback. There were some
>>>> larger
>>>> >> questions asking about alternatives. For each of these I have added a
>>>> >> section tagged with [Alternatives] and discussed my recommendation
>>>> as well
>>>> >> as as few other choices we considered.
>>>> >>
>>>> >> I would appreciate more feedback on the revised proposal. Please take
>>>> >> another look and let me know
>>>> >>
>>>> >>
>>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>>> >>
>>>> >> Etienne, I would appreciate it if you could please take another look
>>>> after
>>>> >> the revisions I have made as well.
>>>> >>
>>>> >> Thanks again,
>>>> >> Alex
>>>> >>
>>>> >
>>>>
>>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Robert Bradshaw
Thanks. I think this has simplified things.

One thing that has occurred to me is that we're conflating the idea of
custom metrics and custom metric types. I would propose the MetricSpec
field be augmented with an additional field "type" which is a urn
specifying the type of metric it is (i.e. the contents of its payload, as
well as the form of aggregation). Summing or maxing over ints would be a
typical example. Though we could pursue making this opaque to the runner in
the long run, that's a more speculative (and difficult) feature to tackle.
This would allow the runner to at least aggregate and report/return to the
SDK metrics that it did not itself understand the semantic meaning of. (It
would probably simplify much of the specialization in the runner itself for
metrics that it *did* understand as well.)

In addition, rather than having UserMetricOfTypeX for every type X one
would have a single URN for UserMetric and it spec would designate the type
and payload designate the (qualified) name.

- Robert



On Wed, Apr 11, 2018 at 5:12 PM Alex Amato  wrote:

> Thank you everyone for your feedback so far.
> I have made a revision today which is to make all metrics refer to a
> primary entity, so I have restructured some of the protos a little bit.
>
> The point of this change was to futureproof the possibility of allowing
> custom user metrics, with custom aggregation functions for its metric
> updates.
> Now that each metric has an aggregation_entity associated with it (e.g.
> PCollection, PTransform), we can design an approach which forwards the
> opaque bytes metric updates, without deserializing them. These are
> forwarded to user provided code which then would deserialize the metric
> update payloads and perform the custom aggregations.
>
> I think it has also simplified some of the URN metric protos, as they do
> not need to keep track of ptransform names inside themselves now. The
> result is simpler structures, for the metrics as the entities are pulled
> outside of the metric.
>
> I have mentioned this in the doc now, and wanted to draw attention to this
> particular revision.
>
>
>
> On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:
>
>> I've gathered a lot of feedback so far and want to make a decision by
>> Friday, and begin working on related PRs next week.
>>
>> Please make sure that you provide your feedback before then and I will
>> post the final decisions made to this thread Friday afternoon.
>>
>>
>> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:
>>
>>> Nice, I created a short link so people can refer to it easily in
>>> future discussions, website, etc.
>>>
>>> https://s.apache.org/beam-fn-api-metrics
>>>
>>> Thanks for sharing.
>>>
>>>
>>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
>>> wrote:
>>> > Thanks for the nice writeup. I added some comments.
>>> >
>>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:
>>> >>
>>> >> Hello beam community,
>>> >>
>>> >> Thank you everyone for your initial feedback on this proposal so far.
>>> I
>>> >> have made some revisions based on the feedback. There were some larger
>>> >> questions asking about alternatives. For each of these I have added a
>>> >> section tagged with [Alternatives] and discussed my recommendation as
>>> well
>>> >> as as few other choices we considered.
>>> >>
>>> >> I would appreciate more feedback on the revised proposal. Please take
>>> >> another look and let me know
>>> >>
>>> >>
>>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>> >>
>>> >> Etienne, I would appreciate it if you could please take another look
>>> after
>>> >> the revisions I have made as well.
>>> >>
>>> >> Thanks again,
>>> >> Alex
>>> >>
>>> >
>>>
>>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-11 Thread Alex Amato
Thank you everyone for your feedback so far.
I have made a revision today which is to make all metrics refer to a
primary entity, so I have restructured some of the protos a little bit.

The point of this change was to futureproof the possibility of allowing
custom user metrics, with custom aggregation functions for its metric
updates.
Now that each metric has an aggregation_entity associated with it (e.g.
PCollection, PTransform), we can design an approach which forwards the
opaque bytes metric updates, without deserializing them. These are
forwarded to user provided code which then would deserialize the metric
update payloads and perform the custom aggregations.

I think it has also simplified some of the URN metric protos, as they do
not need to keep track of ptransform names inside themselves now. The
result is simpler structures, for the metrics as the entities are pulled
outside of the metric.

I have mentioned this in the doc now, and wanted to draw attention to this
particular revision.



On Tue, Apr 10, 2018 at 9:53 AM Alex Amato  wrote:

> I've gathered a lot of feedback so far and want to make a decision by
> Friday, and begin working on related PRs next week.
>
> Please make sure that you provide your feedback before then and I will
> post the final decisions made to this thread Friday afternoon.
>
>
> On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:
>
>> Nice, I created a short link so people can refer to it easily in
>> future discussions, website, etc.
>>
>> https://s.apache.org/beam-fn-api-metrics
>>
>> Thanks for sharing.
>>
>>
>> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
>> wrote:
>> > Thanks for the nice writeup. I added some comments.
>> >
>> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:
>> >>
>> >> Hello beam community,
>> >>
>> >> Thank you everyone for your initial feedback on this proposal so far. I
>> >> have made some revisions based on the feedback. There were some larger
>> >> questions asking about alternatives. For each of these I have added a
>> >> section tagged with [Alternatives] and discussed my recommendation as
>> well
>> >> as as few other choices we considered.
>> >>
>> >> I would appreciate more feedback on the revised proposal. Please take
>> >> another look and let me know
>> >>
>> >>
>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>> >>
>> >> Etienne, I would appreciate it if you could please take another look
>> after
>> >> the revisions I have made as well.
>> >>
>> >> Thanks again,
>> >> Alex
>> >>
>> >
>>
>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-10 Thread Alex Amato
I've gathered a lot of feedback so far and want to make a decision by
Friday, and begin working on related PRs next week.

Please make sure that you provide your feedback before then and I will post
the final decisions made to this thread Friday afternoon.

On Thu, Apr 5, 2018 at 1:38 AM Ismaël Mejía  wrote:

> Nice, I created a short link so people can refer to it easily in
> future discussions, website, etc.
>
> https://s.apache.org/beam-fn-api-metrics
>
> Thanks for sharing.
>
>
> On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw 
> wrote:
> > Thanks for the nice writeup. I added some comments.
> >
> > On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:
> >>
> >> Hello beam community,
> >>
> >> Thank you everyone for your initial feedback on this proposal so far. I
> >> have made some revisions based on the feedback. There were some larger
> >> questions asking about alternatives. For each of these I have added a
> >> section tagged with [Alternatives] and discussed my recommendation as
> well
> >> as as few other choices we considered.
> >>
> >> I would appreciate more feedback on the revised proposal. Please take
> >> another look and let me know
> >>
> >>
> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
> >>
> >> Etienne, I would appreciate it if you could please take another look
> after
> >> the revisions I have made as well.
> >>
> >> Thanks again,
> >> Alex
> >>
> >
>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-05 Thread Ismaël Mejía
Nice, I created a short link so people can refer to it easily in
future discussions, website, etc.

https://s.apache.org/beam-fn-api-metrics

Thanks for sharing.


On Wed, Apr 4, 2018 at 11:28 PM, Robert Bradshaw  wrote:
> Thanks for the nice writeup. I added some comments.
>
> On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:
>>
>> Hello beam community,
>>
>> Thank you everyone for your initial feedback on this proposal so far. I
>> have made some revisions based on the feedback. There were some larger
>> questions asking about alternatives. For each of these I have added a
>> section tagged with [Alternatives] and discussed my recommendation as well
>> as as few other choices we considered.
>>
>> I would appreciate more feedback on the revised proposal. Please take
>> another look and let me know
>>
>> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>>
>> Etienne, I would appreciate it if you could please take another look after
>> the revisions I have made as well.
>>
>> Thanks again,
>> Alex
>>
>


Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-04 Thread Robert Bradshaw
Thanks for the nice writeup. I added some comments.

On Wed, Apr 4, 2018 at 1:53 PM Alex Amato  wrote:

> Hello beam community,
>
> Thank you everyone for your initial feedback on this proposal so far. I
> have made some revisions based on the feedback. There were some larger
> questions asking about alternatives. For each of these I have added a
> section tagged with [Alternatives] and discussed my recommendation as well
> as as few other choices we considered.
>
> I would appreciate more feedback on the revised proposal. Please take
> another look and let me know
>
> https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit
>
> Etienne, I would appreciate it if you could please take another look after
> the revisions I have made as well.
>
> Thanks again,
> Alex
>
>


Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

2018-04-04 Thread Alex Amato
Hello beam community,

Thank you everyone for your initial feedback on this proposal so far. I
have made some revisions based on the feedback. There were some larger
questions asking about alternatives. For each of these I have added a
section tagged with [Alternatives] and discussed my recommendation as well
as as few other choices we considered.

I would appreciate more feedback on the revised proposal. Please take
another look and let me know
https://docs.google.com/document/d/1MtBZYV7NAcfbwyy9Op8STeFNBxtljxgy69FkHMvhTMA/edit

Etienne, I would appreciate it if you could please take another look after
the revisions I have made as well.

Thanks again,
Alex


Re: Beam Fn API

2017-05-31 Thread Robert Bradshaw
Thank! Looks good. I've added some comments to the doc.

On Wed, May 31, 2017 at 7:00 AM, Etienne Chauchot 
wrote:

> Thanks for all these docs! They are exactly what was needed for new
> contributors as discussed in this thread
>
> https://lists.apache.org/thread.html/ac93d29424e19d57097373b
> 78f3f5bcbc701e4b51385a52a6e27b7ed@%3Cdev.beam.apache.org%3E
>
> Etienne
>
>
> Le 31/05/2017 à 11:12, Aljoscha Krettek a écrit :
>
>> Thanks for banging these out Lukasz. I’ll try and read them all this week.
>>
>> We’re also planning to add support for the Fn API to the Flink Runner so
>> that we can execute Python programs. I’m sure we’ll get some valuable
>> feedback for you while doing that.
>>
>> On 26. May 2017, at 22:49, Lukasz Cwik  wrote:
>>>
>>> I would like to share another document about the Fn API. This document
>>> specifically discusses how to access side inputs, access remote
>>> references
>>> (e.g. large iterables for hot keys produced by a GBK), and support user
>>> state.
>>> https://s.apache.org/beam-fn-state-api-and-bundle-processing
>>>
>>> The document does require a strong foundation in the Apache Beam model
>>> and
>>> a good understanding of the prior shared docs:
>>> * How to process a bundle: https://s.apache.org/beam-fn-api
>>> -processing-a-bundle
>>> * How to send and receive data: https://s.apache.org/beam-fn-api
>>> -send-and-receive-data
>>>
>>> I could really use the help of runner contributors to review the caching
>>> semantics within the SDK harness and whether they would work well for the
>>> runner they contribute to the most.
>>>
>>> On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:
>>>
>>> Manu, the goal is to share here initially, update the docs addressing
>>>> people's comments, and then publish them on the website once they are
>>>> stable enough.
>>>>
>>>> On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
>>>> wrote:
>>>>
>>>> Thanks Lukasz. The following two links were somehow incorrectly
>>>>> formatted
>>>>> in your mail.
>>>>>
>>>>> * How to process a bundle:
>>>>> https://s.apache.org/beam-fn-api-processing-a-bundle
>>>>> * How to send and receive data:
>>>>> https://s.apache.org/beam-fn-api-send-and-receive-data
>>>>>
>>>>> By the way, is there a way to find them from the Beam website ?
>>>>>
>>>>>
>>>>> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
>>>>> wrote:
>>>>>
>>>>> Now that I'm back from vacation and the 2.0.0 release is not taking all
>>>>>>
>>>>> my
>>>>>
>>>>>> time, I am focusing my attention on working on the Beam Portability
>>>>>> framework, specifically the Fn API so that we can get Python and other
>>>>>> language integrations work with any runner.
>>>>>>
>>>>>> For new comers, I would like to reshare the overview:
>>>>>> https://s.apache.org/beam-fn-api
>>>>>>
>>>>>> And for those of you who have been following this thread and
>>>>>>
>>>>> contributors
>>>>>
>>>>>> focusing on Runner integration with Apache Beam:
>>>>>> * How to process a bundle: https://s.apache.org/beam-fn-a
>>>>>>
>>>>> pi-processing-a-
>>>>>
>>>>>> bundle
>>>>>> * How to send and receive data: https://s.apache.org/
>>>>>> beam-fn-api-send-and-receive-data
>>>>>>
>>>>>> If you want to dive deeper, you should look at:
>>>>>> * Runner API Protobuf: https://github.com/apache/beam
>>>>>> /blob/master/sdks/
>>>>>> common/runner-api/src/main/proto/beam_runner_api.proto
>>>>>> <https://github.com/apache/beam/blob/master/sdks/common/runn
>>>>>>
>>>>> er-api/src/main/proto/beam_runner_api.proto>
>>>>>
>>>>>> * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>>>>>> common/fn-api/src/main/proto/beam_fn_api.proto
>>>>>> <https://github.com/apache/beam/blob/master/sdks/common/fn-
>>>>>>
>>>>> api/src/main/proto/beam_fn_api.proto>
>>>>>
>

Re: Beam Fn API

2017-05-31 Thread Etienne Chauchot
Thanks for all these docs! They are exactly what was needed for new 
contributors as discussed in this thread


https://lists.apache.org/thread.html/ac93d29424e19d57097373b78f3f5bcbc701e4b51385a52a6e27b7ed@%3Cdev.beam.apache.org%3E

Etienne

Le 31/05/2017 à 11:12, Aljoscha Krettek a écrit :

Thanks for banging these out Lukasz. I’ll try and read them all this week.

We’re also planning to add support for the Fn API to the Flink Runner so that 
we can execute Python programs. I’m sure we’ll get some valuable feedback for 
you while doing that.


On 26. May 2017, at 22:49, Lukasz Cwik  wrote:

I would like to share another document about the Fn API. This document
specifically discusses how to access side inputs, access remote references
(e.g. large iterables for hot keys produced by a GBK), and support user
state.
https://s.apache.org/beam-fn-state-api-and-bundle-processing

The document does require a strong foundation in the Apache Beam model and
a good understanding of the prior shared docs:
* How to process a bundle: https://s.apache.org/beam-fn-api
-processing-a-bundle
* How to send and receive data: https://s.apache.org/beam-fn-api
-send-and-receive-data

I could really use the help of runner contributors to review the caching
semantics within the SDK harness and whether they would work well for the
runner they contribute to the most.

On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:


Manu, the goal is to share here initially, update the docs addressing
people's comments, and then publish them on the website once they are
stable enough.

On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
wrote:


Thanks Lukasz. The following two links were somehow incorrectly formatted
in your mail.

* How to process a bundle:
https://s.apache.org/beam-fn-api-processing-a-bundle
* How to send and receive data:
https://s.apache.org/beam-fn-api-send-and-receive-data

By the way, is there a way to find them from the Beam website ?


On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
wrote:


Now that I'm back from vacation and the 2.0.0 release is not taking all

my

time, I am focusing my attention on working on the Beam Portability
framework, specifically the Fn API so that we can get Python and other
language integrations work with any runner.

For new comers, I would like to reshare the overview:
https://s.apache.org/beam-fn-api

And for those of you who have been following this thread and

contributors

focusing on Runner integration with Apache Beam:
* How to process a bundle: https://s.apache.org/beam-fn-a

pi-processing-a-

bundle
* How to send and receive data: https://s.apache.org/
beam-fn-api-send-and-receive-data

If you want to dive deeper, you should look at:
* Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/runner-api/src/main/proto/beam_runner_api.proto
<https://github.com/apache/beam/blob/master/sdks/common/runn

er-api/src/main/proto/beam_runner_api.proto>

* Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/fn-api/src/main/proto/beam_fn_api.proto
<https://github.com/apache/beam/blob/master/sdks/common/fn-

api/src/main/proto/beam_fn_api.proto>

* Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
java/harness
<https://github.com/apache/beam/tree/master/sdks/java/harness>
* Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
python/apache_beam/runners/worker
<https://github.com/apache/beam/tree/master/sdks/python/apac

he_beam/runners/worker>

Next I'm planning on talking about Beam Fn State API and will need help
from Runner contributors to talk about caching semantics and key spaces

and

whether the integrations mesh well with current Runner implementations.

The

State API is meant to support user state, side inputs, and re-iteration

for

large values produced by GroupByKey.

On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:


Yes, I was using a Pipeline that was:
Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a

batch

pipeline in the global window using the default trigger)

In Google Cloud Dataflow, the shuffle step uses the binary

representation

to compare keys, so the above pipeline would normally be converted to

the

following two stages:
Read -> GBK Writer
GBK Reader -> IdentityParDo

Note that the GBK Writer and GBK Reader need to use a coder to encode

and

decode the value.

When using the Fn API, those two stages expanded because of the Fn Api
crossings using a gRPC Write/Read pair:
Read -> gRPC Write -> gRPC Read -> GBK Writer
GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo

In my naive prototype implementation, the coder was used to encode
elements at the gRPC steps. This meant that the coder was
encoding/decoding/encoding in the first stage and
decoding/encoding/decoding in the second stage. This tripled the

amount

of

times the coder was being invoked per element. This additional use of

the

coder a

Re: Beam Fn API

2017-05-31 Thread Aljoscha Krettek
Thanks for banging these out Lukasz. I’ll try and read them all this week.

We’re also planning to add support for the Fn API to the Flink Runner so that 
we can execute Python programs. I’m sure we’ll get some valuable feedback for 
you while doing that.

> On 26. May 2017, at 22:49, Lukasz Cwik  wrote:
> 
> I would like to share another document about the Fn API. This document
> specifically discusses how to access side inputs, access remote references
> (e.g. large iterables for hot keys produced by a GBK), and support user
> state.
> https://s.apache.org/beam-fn-state-api-and-bundle-processing
> 
> The document does require a strong foundation in the Apache Beam model and
> a good understanding of the prior shared docs:
> * How to process a bundle: https://s.apache.org/beam-fn-api
> -processing-a-bundle
> * How to send and receive data: https://s.apache.org/beam-fn-api
> -send-and-receive-data
> 
> I could really use the help of runner contributors to review the caching
> semantics within the SDK harness and whether they would work well for the
> runner they contribute to the most.
> 
> On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:
> 
>> Manu, the goal is to share here initially, update the docs addressing
>> people's comments, and then publish them on the website once they are
>> stable enough.
>> 
>> On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
>> wrote:
>> 
>>> Thanks Lukasz. The following two links were somehow incorrectly formatted
>>> in your mail.
>>> 
>>> * How to process a bundle:
>>> https://s.apache.org/beam-fn-api-processing-a-bundle
>>> * How to send and receive data:
>>> https://s.apache.org/beam-fn-api-send-and-receive-data
>>> 
>>> By the way, is there a way to find them from the Beam website ?
>>> 
>>> 
>>> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
>>> wrote:
>>> 
>>>> Now that I'm back from vacation and the 2.0.0 release is not taking all
>>> my
>>>> time, I am focusing my attention on working on the Beam Portability
>>>> framework, specifically the Fn API so that we can get Python and other
>>>> language integrations work with any runner.
>>>> 
>>>> For new comers, I would like to reshare the overview:
>>>> https://s.apache.org/beam-fn-api
>>>> 
>>>> And for those of you who have been following this thread and
>>> contributors
>>>> focusing on Runner integration with Apache Beam:
>>>> * How to process a bundle: https://s.apache.org/beam-fn-a
>>> pi-processing-a-
>>>> bundle
>>>> * How to send and receive data: https://s.apache.org/
>>>> beam-fn-api-send-and-receive-data
>>>> 
>>>> If you want to dive deeper, you should look at:
>>>> * Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>>>> common/runner-api/src/main/proto/beam_runner_api.proto
>>>> <https://github.com/apache/beam/blob/master/sdks/common/runn
>>> er-api/src/main/proto/beam_runner_api.proto>
>>>> * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>>>> common/fn-api/src/main/proto/beam_fn_api.proto
>>>> <https://github.com/apache/beam/blob/master/sdks/common/fn-
>>> api/src/main/proto/beam_fn_api.proto>
>>>> * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>>>> java/harness
>>>> <https://github.com/apache/beam/tree/master/sdks/java/harness>
>>>> * Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>>>> python/apache_beam/runners/worker
>>>> <https://github.com/apache/beam/tree/master/sdks/python/apac
>>> he_beam/runners/worker>
>>>> 
>>>> Next I'm planning on talking about Beam Fn State API and will need help
>>>> from Runner contributors to talk about caching semantics and key spaces
>>> and
>>>> whether the integrations mesh well with current Runner implementations.
>>> The
>>>> State API is meant to support user state, side inputs, and re-iteration
>>> for
>>>> large values produced by GroupByKey.
>>>> 
>>>> On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:
>>>> 
>>>>> Yes, I was using a Pipeline that was:
>>>>> Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
>>> batch
>>>>> pipeline in the global window using the default trigger)
>>>>> 
>>>&

Re: Beam Fn API

2017-05-26 Thread Lukasz Cwik
I would like to share another document about the Fn API. This document
specifically discusses how to access side inputs, access remote references
(e.g. large iterables for hot keys produced by a GBK), and support user
state.
https://s.apache.org/beam-fn-state-api-and-bundle-processing

The document does require a strong foundation in the Apache Beam model and
a good understanding of the prior shared docs:
* How to process a bundle: https://s.apache.org/beam-fn-api
-processing-a-bundle
* How to send and receive data: https://s.apache.org/beam-fn-api
-send-and-receive-data

I could really use the help of runner contributors to review the caching
semantics within the SDK harness and whether they would work well for the
runner they contribute to the most.

On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:

> Manu, the goal is to share here initially, update the docs addressing
> people's comments, and then publish them on the website once they are
> stable enough.
>
> On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
> wrote:
>
>> Thanks Lukasz. The following two links were somehow incorrectly formatted
>> in your mail.
>>
>> * How to process a bundle:
>> https://s.apache.org/beam-fn-api-processing-a-bundle
>> * How to send and receive data:
>> https://s.apache.org/beam-fn-api-send-and-receive-data
>>
>> By the way, is there a way to find them from the Beam website ?
>>
>>
>> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
>> wrote:
>>
>> > Now that I'm back from vacation and the 2.0.0 release is not taking all
>> my
>> > time, I am focusing my attention on working on the Beam Portability
>> > framework, specifically the Fn API so that we can get Python and other
>> > language integrations work with any runner.
>> >
>> > For new comers, I would like to reshare the overview:
>> > https://s.apache.org/beam-fn-api
>> >
>> > And for those of you who have been following this thread and
>> contributors
>> > focusing on Runner integration with Apache Beam:
>> > * How to process a bundle: https://s.apache.org/beam-fn-a
>> pi-processing-a-
>> > bundle
>> > * How to send and receive data: https://s.apache.org/
>> > beam-fn-api-send-and-receive-data
>> >
>> > If you want to dive deeper, you should look at:
>> > * Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>> > common/runner-api/src/main/proto/beam_runner_api.proto
>> > <https://github.com/apache/beam/blob/master/sdks/common/runn
>> er-api/src/main/proto/beam_runner_api.proto>
>> > * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>> > common/fn-api/src/main/proto/beam_fn_api.proto
>> > <https://github.com/apache/beam/blob/master/sdks/common/fn-
>> api/src/main/proto/beam_fn_api.proto>
>> > * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>> > java/harness
>> > <https://github.com/apache/beam/tree/master/sdks/java/harness>
>> > * Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>> > python/apache_beam/runners/worker
>> > <https://github.com/apache/beam/tree/master/sdks/python/apac
>> he_beam/runners/worker>
>> >
>> > Next I'm planning on talking about Beam Fn State API and will need help
>> > from Runner contributors to talk about caching semantics and key spaces
>> and
>> > whether the integrations mesh well with current Runner implementations.
>> The
>> > State API is meant to support user state, side inputs, and re-iteration
>> for
>> > large values produced by GroupByKey.
>> >
>> > On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:
>> >
>> > > Yes, I was using a Pipeline that was:
>> > > Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
>> batch
>> > > pipeline in the global window using the default trigger)
>> > >
>> > > In Google Cloud Dataflow, the shuffle step uses the binary
>> representation
>> > > to compare keys, so the above pipeline would normally be converted to
>> the
>> > > following two stages:
>> > > Read -> GBK Writer
>> > > GBK Reader -> IdentityParDo
>> > >
>> > > Note that the GBK Writer and GBK Reader need to use a coder to encode
>> and
>> > > decode the value.
>> > >
>> > > When using the Fn API, those two stages expanded because of the Fn Api
>> > > crossings using a gRPC Write/Read pair:
>> > > Rea

Re: Beam Fn API

2017-05-21 Thread Lukasz Cwik
Manu, the goal is to share here initially, update the docs addressing
people's comments, and then publish them on the website once they are
stable enough.

On Sun, May 21, 2017 at 5:54 PM, Manu Zhang  wrote:

> Thanks Lukasz. The following two links were somehow incorrectly formatted
> in your mail.
>
> * How to process a bundle:
> https://s.apache.org/beam-fn-api-processing-a-bundle
> * How to send and receive data:
> https://s.apache.org/beam-fn-api-send-and-receive-data
>
> By the way, is there a way to find them from the Beam website ?
>
>
> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
> wrote:
>
> > Now that I'm back from vacation and the 2.0.0 release is not taking all
> my
> > time, I am focusing my attention on working on the Beam Portability
> > framework, specifically the Fn API so that we can get Python and other
> > language integrations work with any runner.
> >
> > For new comers, I would like to reshare the overview:
> > https://s.apache.org/beam-fn-api
> >
> > And for those of you who have been following this thread and contributors
> > focusing on Runner integration with Apache Beam:
> > * How to process a bundle: https://s.apache.org/beam-fn-
> api-processing-a-
> > bundle
> > * How to send and receive data: https://s.apache.org/
> > beam-fn-api-send-and-receive-data
> >
> > If you want to dive deeper, you should look at:
> > * Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
> > common/runner-api/src/main/proto/beam_runner_api.proto
> > <https://github.com/apache/beam/blob/master/sdks/common/
> runner-api/src/main/proto/beam_runner_api.proto>
> > * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
> > common/fn-api/src/main/proto/beam_fn_api.proto
> > <https://github.com/apache/beam/blob/master/sdks/common/
> fn-api/src/main/proto/beam_fn_api.proto>
> > * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
> > java/harness
> > <https://github.com/apache/beam/tree/master/sdks/java/harness>
> > * Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
> > python/apache_beam/runners/worker
> > <https://github.com/apache/beam/tree/master/sdks/python/
> apache_beam/runners/worker>
> >
> > Next I'm planning on talking about Beam Fn State API and will need help
> > from Runner contributors to talk about caching semantics and key spaces
> and
> > whether the integrations mesh well with current Runner implementations.
> The
> > State API is meant to support user state, side inputs, and re-iteration
> for
> > large values produced by GroupByKey.
> >
> > On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:
> >
> > > Yes, I was using a Pipeline that was:
> > > Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
> batch
> > > pipeline in the global window using the default trigger)
> > >
> > > In Google Cloud Dataflow, the shuffle step uses the binary
> representation
> > > to compare keys, so the above pipeline would normally be converted to
> the
> > > following two stages:
> > > Read -> GBK Writer
> > > GBK Reader -> IdentityParDo
> > >
> > > Note that the GBK Writer and GBK Reader need to use a coder to encode
> and
> > > decode the value.
> > >
> > > When using the Fn API, those two stages expanded because of the Fn Api
> > > crossings using a gRPC Write/Read pair:
> > > Read -> gRPC Write -> gRPC Read -> GBK Writer
> > > GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo
> > >
> > > In my naive prototype implementation, the coder was used to encode
> > > elements at the gRPC steps. This meant that the coder was
> > > encoding/decoding/encoding in the first stage and
> > > decoding/encoding/decoding in the second stage. This tripled the amount
> > of
> > > times the coder was being invoked per element. This additional use of
> the
> > > coder accounted for ~12% (80% of the 15%) of the extra execution time.
> > This
> > > implementation is quite inefficient and would benefit from merging the
> > gRPC
> > > Read + GBK Writer into one actor and also the GBK Reader + gRPC Write
> > into
> > > another actor allowing for the creation of a fast path that can skip
> > parts
> > > of the decode/encode cycle through the coder. By using a byte array
> view
> > > over the logical stream, one can minimize the number of byte array
> copies
> > > which plague

Re: Beam Fn API

2017-05-21 Thread Manu Zhang
Thanks Lukasz. The following two links were somehow incorrectly formatted
in your mail.

* How to process a bundle:
https://s.apache.org/beam-fn-api-processing-a-bundle
* How to send and receive data:
https://s.apache.org/beam-fn-api-send-and-receive-data

By the way, is there a way to find them from the Beam website ?


On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
wrote:

> Now that I'm back from vacation and the 2.0.0 release is not taking all my
> time, I am focusing my attention on working on the Beam Portability
> framework, specifically the Fn API so that we can get Python and other
> language integrations work with any runner.
>
> For new comers, I would like to reshare the overview:
> https://s.apache.org/beam-fn-api
>
> And for those of you who have been following this thread and contributors
> focusing on Runner integration with Apache Beam:
> * How to process a bundle: https://s.apache.org/beam-fn-api-processing-a-
> bundle
> * How to send and receive data: https://s.apache.org/
> beam-fn-api-send-and-receive-data
>
> If you want to dive deeper, you should look at:
> * Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
> common/runner-api/src/main/proto/beam_runner_api.proto
> <https://github.com/apache/beam/blob/master/sdks/common/runner-api/src/main/proto/beam_runner_api.proto>
> * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
> common/fn-api/src/main/proto/beam_fn_api.proto
> <https://github.com/apache/beam/blob/master/sdks/common/fn-api/src/main/proto/beam_fn_api.proto>
> * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
> java/harness
> <https://github.com/apache/beam/tree/master/sdks/java/harness>
> * Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
> python/apache_beam/runners/worker
> <https://github.com/apache/beam/tree/master/sdks/python/apache_beam/runners/worker>
>
> Next I'm planning on talking about Beam Fn State API and will need help
> from Runner contributors to talk about caching semantics and key spaces and
> whether the integrations mesh well with current Runner implementations. The
> State API is meant to support user state, side inputs, and re-iteration for
> large values produced by GroupByKey.
>
> On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:
>
> > Yes, I was using a Pipeline that was:
> > Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a batch
> > pipeline in the global window using the default trigger)
> >
> > In Google Cloud Dataflow, the shuffle step uses the binary representation
> > to compare keys, so the above pipeline would normally be converted to the
> > following two stages:
> > Read -> GBK Writer
> > GBK Reader -> IdentityParDo
> >
> > Note that the GBK Writer and GBK Reader need to use a coder to encode and
> > decode the value.
> >
> > When using the Fn API, those two stages expanded because of the Fn Api
> > crossings using a gRPC Write/Read pair:
> > Read -> gRPC Write -> gRPC Read -> GBK Writer
> > GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo
> >
> > In my naive prototype implementation, the coder was used to encode
> > elements at the gRPC steps. This meant that the coder was
> > encoding/decoding/encoding in the first stage and
> > decoding/encoding/decoding in the second stage. This tripled the amount
> of
> > times the coder was being invoked per element. This additional use of the
> > coder accounted for ~12% (80% of the 15%) of the extra execution time.
> This
> > implementation is quite inefficient and would benefit from merging the
> gRPC
> > Read + GBK Writer into one actor and also the GBK Reader + gRPC Write
> into
> > another actor allowing for the creation of a fast path that can skip
> parts
> > of the decode/encode cycle through the coder. By using a byte array view
> > over the logical stream, one can minimize the number of byte array copies
> > which plagued my naive implementation. This can be done by only parsing
> the
> > element boundaries out of the stream to produce those logical byte array
> > views. I have a very rough estimate that performing this optimization
> would
> > reduce the 12% overhead to somewhere between 4% and 6%.
> >
> > The remaining 3% (15% - 12%) overhead went to many parts of gRPC:
> > marshalling/unmarshalling protos
> > handling/managing the socket
> > flow control
> > ...
> >
> > Finally, I did try experiments with different buffer sizes (10KB, 100KB,
> > 1000KB), flow control (separate thread[1] vs same thread with phaser[2]),
> > and channel type [

Re: Beam Fn API

2017-05-18 Thread Lukasz Cwik
Now that I'm back from vacation and the 2.0.0 release is not taking all my
time, I am focusing my attention on working on the Beam Portability
framework, specifically the Fn API so that we can get Python and other
language integrations work with any runner.

For new comers, I would like to reshare the overview:
https://s.apache.org/beam-fn-api

And for those of you who have been following this thread and contributors
focusing on Runner integration with Apache Beam:
* How to process a bundle: https://s.apache.org/beam-fn-api-processing-a-
bundle
* How to send and receive data: https://s.apache.org/
beam-fn-api-send-and-receive-data

If you want to dive deeper, you should look at:
* Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/runner-api/src/main/proto/beam_runner_api.proto
* Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/fn-api/src/main/proto/beam_fn_api.proto
* Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
java/harness
* Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
python/apache_beam/runners/worker

Next I'm planning on talking about Beam Fn State API and will need help
from Runner contributors to talk about caching semantics and key spaces and
whether the integrations mesh well with current Runner implementations. The
State API is meant to support user state, side inputs, and re-iteration for
large values produced by GroupByKey.

On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:

> Yes, I was using a Pipeline that was:
> Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a batch
> pipeline in the global window using the default trigger)
>
> In Google Cloud Dataflow, the shuffle step uses the binary representation
> to compare keys, so the above pipeline would normally be converted to the
> following two stages:
> Read -> GBK Writer
> GBK Reader -> IdentityParDo
>
> Note that the GBK Writer and GBK Reader need to use a coder to encode and
> decode the value.
>
> When using the Fn API, those two stages expanded because of the Fn Api
> crossings using a gRPC Write/Read pair:
> Read -> gRPC Write -> gRPC Read -> GBK Writer
> GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo
>
> In my naive prototype implementation, the coder was used to encode
> elements at the gRPC steps. This meant that the coder was
> encoding/decoding/encoding in the first stage and
> decoding/encoding/decoding in the second stage. This tripled the amount of
> times the coder was being invoked per element. This additional use of the
> coder accounted for ~12% (80% of the 15%) of the extra execution time. This
> implementation is quite inefficient and would benefit from merging the gRPC
> Read + GBK Writer into one actor and also the GBK Reader + gRPC Write into
> another actor allowing for the creation of a fast path that can skip parts
> of the decode/encode cycle through the coder. By using a byte array view
> over the logical stream, one can minimize the number of byte array copies
> which plagued my naive implementation. This can be done by only parsing the
> element boundaries out of the stream to produce those logical byte array
> views. I have a very rough estimate that performing this optimization would
> reduce the 12% overhead to somewhere between 4% and 6%.
>
> The remaining 3% (15% - 12%) overhead went to many parts of gRPC:
> marshalling/unmarshalling protos
> handling/managing the socket
> flow control
> ...
>
> Finally, I did try experiments with different buffer sizes (10KB, 100KB,
> 1000KB), flow control (separate thread[1] vs same thread with phaser[2]),
> and channel type [3] (NIO, epoll, domain socket), but coder overhead easily
> dominated the differences in these other experiments.
>
> Further analysis would need to be done to more accurately distill this
> down.
>
> 1: https://github.com/lukecwik/incubator-beam/blob/
> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/stream/
> BufferingStreamObserver.java
> 2: https://github.com/lukecwik/incubator-beam/blob/
> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/stream/
> DirectStreamObserver.java
> 3: https://github.com/lukecwik/incubator-beam/blob/
> fn_api/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/channel/
> ManagedChannelFactory.java
>
>
> On Tue, Jan 24, 2017 at 8:04 AM, Ismaël Mejía  wrote:
>
>> Awesome job Lukasz, Excellent, I have to confess the first time I heard
>> about
>> the Fn API idea I was a bit incredulous, but you are making it real,
>> amazing!
>>
>> Just one question from your document, you said that 80% of the extra (15%)
>> time
>> goes into encoding and decoding the data for your test case, ca

Re: Beam Fn API

2017-01-24 Thread Lukasz Cwik
 Java 8 functional interface extensions
> > > >
> > > >
> > > > On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles
> >  > > >
> > > > wrote:
> > > >
> > > > > This is awesome! Any chance you could roadmap the PR for us with
> some
> > > > links
> > > > > into the most interesting bits?
> > > > >
> > > > > On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw <
> > > > > rober...@google.com.invalid> wrote:
> > > > >
> > > > > > Also, note that we can still support the "simple" case. For
> > example,
> > > > > > if the user supplies us with a jar file (as they do now) a runner
> > > > > > could launch it as a subprocesses and communicate with it via
> this
> > > > > > same Fn API or install it in a fixed container itself--the user
> > > > > > doesn't *need* to know about docker or manually manage containers
> > > (and
> > > > > > indeed the Fn API could be used in-process, cross-process,
> > > > > > cross-container, and even cross-machine).
> > > > > >
> > > > > > However docker provides a nice cross-language way of specifying
> the
> > > > > > environment including all dependencies (especially for languages
> > like
> > > > > > Python or C where the equivalent of a cross-platform,
> > self-contained
> > > > > > jar isn't as easy to produce) and is strictly more powerful and
> > > > > > flexible (specifically it isolates the runtime environment and
> one
> > > can
> > > > > > even use it for local testing).
> > > > > >
> > > > > > Slicing a worker up like this without sacrificing performance is
> an
> > > > > > ambitious goal, but essential to the story of being able to mix
> and
> > > > > > match runners and SDKs arbitrarily, and I think this is a great
> > > start.
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik
> > >  > > > >
> > > > > > wrote:
> > > > > > > Your correct, a docker container is created that contains the
> > > > execution
> > > > > > > environment the user wants or the user re-uses an existing one
> > > > > (allowing
> > > > > > > for a user to embed all their code/dependencies or use a
> > container
> > > > that
> > > > > > can
> > > > > > > deploy code/dependencies on demand).
> > > > > > > A user creates a pipeline saying which docker container they
> want
> > > to
> > > > > use
> > > > > > > (this starts to allow for multiple container definitions
> within a
> > > > > single
> > > > > > > pipeline to support multiple languages, versioning, ...).
> > > > > > > A runner would then be responsible for launching one or more of
> > > these
> > > > > > > containers in a cluster manager of their choice (scaling up or
> > down
> > > > the
> > > > > > > number of instances depending on demand/load/...).
> > > > > > > A runner then interacts with the docker containers over the
> gRPC
> > > > > service
> > > > > > > definitions to delegate processing to.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré <
> > > > j...@nanthrax.net
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Luke,
> > > > > > >>
> > > > > > >> that's really great and very promising !
> > > > > > >>
> > > > > > >> It's really ambitious but I like the idea. Just to clarify:
> the
> > > > > purpose
> > > > > > of
> > > > > > >> using gRPC is once the docker container is running, then we
> can
> > > > > > "interact"
> > > > > > >> with the container to spread and delegate processing to the
> > docker
> > > > > > >> container, correct ?
> > > > > > >> The users/devops have to setup the docker containers as
> > > > prerequisite.
> > > > > > >> Then, the "location" of the containers (kind of container
> > > registry)
> > > > is
> > > > > > set
> > > > > > >> via the pipeline options and used by gRPC ?
> > > > > > >>
> > > > > > >> Thanks Luke !
> > > > > > >>
> > > > > > >> Regards
> > > > > > >> JB
> > > > > > >>
> > > > > > >>
> > > > > > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
> > > > > > >>
> > > > > > >>> I have been prototyping several components towards the Beam
> > > > technical
> > > > > > >>> vision of being able to execute an arbitrary language using
> an
> > > > > > arbitrary
> > > > > > >>> runner.
> > > > > > >>>
> > > > > > >>> I would like to share this overview [1] of what I have been
> > > working
> > > > > > >>> towards. I also share this PR [2] with a proposed API,
> service
> > > > > > definitions
> > > > > > >>> and partial implementation.
> > > > > > >>>
> > > > > > >>> 1: https://s.apache.org/beam-fn-api
> > > > > > >>> 2: https://github.com/apache/beam/pull/1801
> > > > > > >>>
> > > > > > >>> Please comment on the overview within this thread, and any
> > > specific
> > > > > > code
> > > > > > >>> comments on the PR directly.
> > > > > > >>>
> > > > > > >>> Luke
> > > > > > >>>
> > > > > > >>>
> > > > > > >> --
> > > > > > >> Jean-Baptiste Onofré
> > > > > > >> jbono...@apache.org
> > > > > > >> http://blog.nanthrax.net
> > > > > > >> Talend - http://www.talend.com
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Beam Fn API

2017-01-24 Thread Ismaël Mejía
; > > support sources and gRPC endpoints.
> > >
> > > Unless your really interested in how domain sockets, epoll, nio channel
> > > factories or how stream readiness callbacks work in gRPC, I would avoid
> > the
> > > packages org.apache.beam.fn.harness.channel and
> > > org.apache.beam.fn.harness.stream. Similarly I would avoid
> > > org.apache.beam.fn.harness.fn and org.apache.beam.fn.harness.fake as
> > they
> > > don't add anything meaningful to the api.
> > >
> > > Code package descriptions:
> > >
> > > org.apache.beam.fn.harness.FnHarness: main entry point
> > > org.apache.beam.fn.harness.control: Control service client and
> > individual
> > > request handlers
> > > org.apache.beam.fn.harness.data: Data service client and logical
> stream
> > > multiplexing
> > > org.apache.beam.runners.core: Additional runners akin to the DoFnRunner
> > > found in runners-core to support sources and gRPC endpoints
> > > org.apache.beam.fn.harness.logging: Logging client implementation and
> > JUL
> > > logging handler adapter
> > > org.apache.beam.fn.harness.channel: gRPC channel management
> > > org.apache.beam.fn.harness.stream: gRPC stream management
> > > org.apache.beam.fn.harness.fn: Java 8 functional interface extensions
> > >
> > >
> > > On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles
>  > >
> > > wrote:
> > >
> > > > This is awesome! Any chance you could roadmap the PR for us with some
> > > links
> > > > into the most interesting bits?
> > > >
> > > > On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw <
> > > > rober...@google.com.invalid> wrote:
> > > >
> > > > > Also, note that we can still support the "simple" case. For
> example,
> > > > > if the user supplies us with a jar file (as they do now) a runner
> > > > > could launch it as a subprocesses and communicate with it via this
> > > > > same Fn API or install it in a fixed container itself--the user
> > > > > doesn't *need* to know about docker or manually manage containers
> > (and
> > > > > indeed the Fn API could be used in-process, cross-process,
> > > > > cross-container, and even cross-machine).
> > > > >
> > > > > However docker provides a nice cross-language way of specifying the
> > > > > environment including all dependencies (especially for languages
> like
> > > > > Python or C where the equivalent of a cross-platform,
> self-contained
> > > > > jar isn't as easy to produce) and is strictly more powerful and
> > > > > flexible (specifically it isolates the runtime environment and one
> > can
> > > > > even use it for local testing).
> > > > >
> > > > > Slicing a worker up like this without sacrificing performance is an
> > > > > ambitious goal, but essential to the story of being able to mix and
> > > > > match runners and SDKs arbitrarily, and I think this is a great
> > start.
> > > > >
> > > > >
> > > > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik
> >  > > >
> > > > > wrote:
> > > > > > Your correct, a docker container is created that contains the
> > > execution
> > > > > > environment the user wants or the user re-uses an existing one
> > > > (allowing
> > > > > > for a user to embed all their code/dependencies or use a
> container
> > > that
> > > > > can
> > > > > > deploy code/dependencies on demand).
> > > > > > A user creates a pipeline saying which docker container they want
> > to
> > > > use
> > > > > > (this starts to allow for multiple container definitions within a
> > > > single
> > > > > > pipeline to support multiple languages, versioning, ...).
> > > > > > A runner would then be responsible for launching one or more of
> > these
> > > > > > containers in a cluster manager of their choice (scaling up or
> down
> > > the
> > > > > > number of instances depending on demand/load/...).
> > > > > > A runner then interacts with the docker containers over the gRPC
> > > > service
> > > > > > definitions to delegate processing to.
> > > > >

Re: Beam Fn API

2017-01-23 Thread Lukasz Cwik
; found in runners-core to support sources and gRPC endpoints
> > org.apache.beam.fn.harness.logging: Logging client implementation and
> JUL
> > logging handler adapter
> > org.apache.beam.fn.harness.channel: gRPC channel management
> > org.apache.beam.fn.harness.stream: gRPC stream management
> > org.apache.beam.fn.harness.fn: Java 8 functional interface extensions
> >
> >
> > On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles  >
> > wrote:
> >
> > > This is awesome! Any chance you could roadmap the PR for us with some
> > links
> > > into the most interesting bits?
> > >
> > > On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw <
> > > rober...@google.com.invalid> wrote:
> > >
> > > > Also, note that we can still support the "simple" case. For example,
> > > > if the user supplies us with a jar file (as they do now) a runner
> > > > could launch it as a subprocesses and communicate with it via this
> > > > same Fn API or install it in a fixed container itself--the user
> > > > doesn't *need* to know about docker or manually manage containers
> (and
> > > > indeed the Fn API could be used in-process, cross-process,
> > > > cross-container, and even cross-machine).
> > > >
> > > > However docker provides a nice cross-language way of specifying the
> > > > environment including all dependencies (especially for languages like
> > > > Python or C where the equivalent of a cross-platform, self-contained
> > > > jar isn't as easy to produce) and is strictly more powerful and
> > > > flexible (specifically it isolates the runtime environment and one
> can
> > > > even use it for local testing).
> > > >
> > > > Slicing a worker up like this without sacrificing performance is an
> > > > ambitious goal, but essential to the story of being able to mix and
> > > > match runners and SDKs arbitrarily, and I think this is a great
> start.
> > > >
> > > >
> > > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik
>  > >
> > > > wrote:
> > > > > Your correct, a docker container is created that contains the
> > execution
> > > > > environment the user wants or the user re-uses an existing one
> > > (allowing
> > > > > for a user to embed all their code/dependencies or use a container
> > that
> > > > can
> > > > > deploy code/dependencies on demand).
> > > > > A user creates a pipeline saying which docker container they want
> to
> > > use
> > > > > (this starts to allow for multiple container definitions within a
> > > single
> > > > > pipeline to support multiple languages, versioning, ...).
> > > > > A runner would then be responsible for launching one or more of
> these
> > > > > containers in a cluster manager of their choice (scaling up or down
> > the
> > > > > number of instances depending on demand/load/...).
> > > > > A runner then interacts with the docker containers over the gRPC
> > > service
> > > > > definitions to delegate processing to.
> > > > >
> > > > >
> > > > > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Luke,
> > > > >>
> > > > >> that's really great and very promising !
> > > > >>
> > > > >> It's really ambitious but I like the idea. Just to clarify: the
> > > purpose
> > > > of
> > > > >> using gRPC is once the docker container is running, then we can
> > > > "interact"
> > > > >> with the container to spread and delegate processing to the docker
> > > > >> container, correct ?
> > > > >> The users/devops have to setup the docker containers as
> > prerequisite.
> > > > >> Then, the "location" of the containers (kind of container
> registry)
> > is
> > > > set
> > > > >> via the pipeline options and used by gRPC ?
> > > > >>
> > > > >> Thanks Luke !
> > > > >>
> > > > >> Regards
> > > > >> JB
> > > > >>
> > > > >>
> > > > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
> > > > >>
> > > > >>> I have been prototyping several components towards the Beam
> > technical
> > > > >>> vision of being able to execute an arbitrary language using an
> > > > arbitrary
> > > > >>> runner.
> > > > >>>
> > > > >>> I would like to share this overview [1] of what I have been
> working
> > > > >>> towards. I also share this PR [2] with a proposed API, service
> > > > definitions
> > > > >>> and partial implementation.
> > > > >>>
> > > > >>> 1: https://s.apache.org/beam-fn-api
> > > > >>> 2: https://github.com/apache/beam/pull/1801
> > > > >>>
> > > > >>> Please comment on the overview within this thread, and any
> specific
> > > > code
> > > > >>> comments on the PR directly.
> > > > >>>
> > > > >>> Luke
> > > > >>>
> > > > >>>
> > > > >> --
> > > > >> Jean-Baptiste Onofré
> > > > >> jbono...@apache.org
> > > > >> http://blog.nanthrax.net
> > > > >> Talend - http://www.talend.com
> > > > >>
> > > >
> > >
> >
>


Re: Beam Fn API

2017-01-21 Thread Amit Sela
ut sacrificing performance is an
> > > ambitious goal, but essential to the story of being able to mix and
> > > match runners and SDKs arbitrarily, and I think this is a great start.
> > >
> > >
> > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik  >
> > > wrote:
> > > > Your correct, a docker container is created that contains the
> execution
> > > > environment the user wants or the user re-uses an existing one
> > (allowing
> > > > for a user to embed all their code/dependencies or use a container
> that
> > > can
> > > > deploy code/dependencies on demand).
> > > > A user creates a pipeline saying which docker container they want to
> > use
> > > > (this starts to allow for multiple container definitions within a
> > single
> > > > pipeline to support multiple languages, versioning, ...).
> > > > A runner would then be responsible for launching one or more of these
> > > > containers in a cluster manager of their choice (scaling up or down
> the
> > > > number of instances depending on demand/load/...).
> > > > A runner then interacts with the docker containers over the gRPC
> > service
> > > > definitions to delegate processing to.
> > > >
> > > >
> > > > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > >> Hi Luke,
> > > >>
> > > >> that's really great and very promising !
> > > >>
> > > >> It's really ambitious but I like the idea. Just to clarify: the
> > purpose
> > > of
> > > >> using gRPC is once the docker container is running, then we can
> > > "interact"
> > > >> with the container to spread and delegate processing to the docker
> > > >> container, correct ?
> > > >> The users/devops have to setup the docker containers as
> prerequisite.
> > > >> Then, the "location" of the containers (kind of container registry)
> is
> > > set
> > > >> via the pipeline options and used by gRPC ?
> > > >>
> > > >> Thanks Luke !
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >>
> > > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
> > > >>
> > > >>> I have been prototyping several components towards the Beam
> technical
> > > >>> vision of being able to execute an arbitrary language using an
> > > arbitrary
> > > >>> runner.
> > > >>>
> > > >>> I would like to share this overview [1] of what I have been working
> > > >>> towards. I also share this PR [2] with a proposed API, service
> > > definitions
> > > >>> and partial implementation.
> > > >>>
> > > >>> 1: https://s.apache.org/beam-fn-api
> > > >>> 2: https://github.com/apache/beam/pull/1801
> > > >>>
> > > >>> Please comment on the overview within this thread, and any specific
> > > code
> > > >>> comments on the PR directly.
> > > >>>
> > > >>> Luke
> > > >>>
> > > >>>
> > > >> --
> > > >> Jean-Baptiste Onofré
> > > >> jbono...@apache.org
> > > >> http://blog.nanthrax.net
> > > >> Talend - http://www.talend.com
> > > >>
> > >
> >
>


Re: Beam Fn API

2017-01-20 Thread Lukasz Cwik
gt; >> Hi Luke,
> > >>
> > >> that's really great and very promising !
> > >>
> > >> It's really ambitious but I like the idea. Just to clarify: the
> purpose
> > of
> > >> using gRPC is once the docker container is running, then we can
> > "interact"
> > >> with the container to spread and delegate processing to the docker
> > >> container, correct ?
> > >> The users/devops have to setup the docker containers as prerequisite.
> > >> Then, the "location" of the containers (kind of container registry) is
> > set
> > >> via the pipeline options and used by gRPC ?
> > >>
> > >> Thanks Luke !
> > >>
> > >> Regards
> > >> JB
> > >>
> > >>
> > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
> > >>
> > >>> I have been prototyping several components towards the Beam technical
> > >>> vision of being able to execute an arbitrary language using an
> > arbitrary
> > >>> runner.
> > >>>
> > >>> I would like to share this overview [1] of what I have been working
> > >>> towards. I also share this PR [2] with a proposed API, service
> > definitions
> > >>> and partial implementation.
> > >>>
> > >>> 1: https://s.apache.org/beam-fn-api
> > >>> 2: https://github.com/apache/beam/pull/1801
> > >>>
> > >>> Please comment on the overview within this thread, and any specific
> > code
> > >>> comments on the PR directly.
> > >>>
> > >>> Luke
> > >>>
> > >>>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> >
>


Re: Beam Fn API

2017-01-20 Thread Kenneth Knowles
This is awesome! Any chance you could roadmap the PR for us with some links
into the most interesting bits?

On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> Also, note that we can still support the "simple" case. For example,
> if the user supplies us with a jar file (as they do now) a runner
> could launch it as a subprocesses and communicate with it via this
> same Fn API or install it in a fixed container itself--the user
> doesn't *need* to know about docker or manually manage containers (and
> indeed the Fn API could be used in-process, cross-process,
> cross-container, and even cross-machine).
>
> However docker provides a nice cross-language way of specifying the
> environment including all dependencies (especially for languages like
> Python or C where the equivalent of a cross-platform, self-contained
> jar isn't as easy to produce) and is strictly more powerful and
> flexible (specifically it isolates the runtime environment and one can
> even use it for local testing).
>
> Slicing a worker up like this without sacrificing performance is an
> ambitious goal, but essential to the story of being able to mix and
> match runners and SDKs arbitrarily, and I think this is a great start.
>
>
> On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik 
> wrote:
> > Your correct, a docker container is created that contains the execution
> > environment the user wants or the user re-uses an existing one (allowing
> > for a user to embed all their code/dependencies or use a container that
> can
> > deploy code/dependencies on demand).
> > A user creates a pipeline saying which docker container they want to use
> > (this starts to allow for multiple container definitions within a single
> > pipeline to support multiple languages, versioning, ...).
> > A runner would then be responsible for launching one or more of these
> > containers in a cluster manager of their choice (scaling up or down the
> > number of instances depending on demand/load/...).
> > A runner then interacts with the docker containers over the gRPC service
> > definitions to delegate processing to.
> >
> >
> > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi Luke,
> >>
> >> that's really great and very promising !
> >>
> >> It's really ambitious but I like the idea. Just to clarify: the purpose
> of
> >> using gRPC is once the docker container is running, then we can
> "interact"
> >> with the container to spread and delegate processing to the docker
> >> container, correct ?
> >> The users/devops have to setup the docker containers as prerequisite.
> >> Then, the "location" of the containers (kind of container registry) is
> set
> >> via the pipeline options and used by gRPC ?
> >>
> >> Thanks Luke !
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
> >>
> >>> I have been prototyping several components towards the Beam technical
> >>> vision of being able to execute an arbitrary language using an
> arbitrary
> >>> runner.
> >>>
> >>> I would like to share this overview [1] of what I have been working
> >>> towards. I also share this PR [2] with a proposed API, service
> definitions
> >>> and partial implementation.
> >>>
> >>> 1: https://s.apache.org/beam-fn-api
> >>> 2: https://github.com/apache/beam/pull/1801
> >>>
> >>> Please comment on the overview within this thread, and any specific
> code
> >>> comments on the PR directly.
> >>>
> >>> Luke
> >>>
> >>>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>


Re: Beam Fn API

2017-01-20 Thread Robert Bradshaw
Also, note that we can still support the "simple" case. For example,
if the user supplies us with a jar file (as they do now) a runner
could launch it as a subprocesses and communicate with it via this
same Fn API or install it in a fixed container itself--the user
doesn't *need* to know about docker or manually manage containers (and
indeed the Fn API could be used in-process, cross-process,
cross-container, and even cross-machine).

However docker provides a nice cross-language way of specifying the
environment including all dependencies (especially for languages like
Python or C where the equivalent of a cross-platform, self-contained
jar isn't as easy to produce) and is strictly more powerful and
flexible (specifically it isolates the runtime environment and one can
even use it for local testing).

Slicing a worker up like this without sacrificing performance is an
ambitious goal, but essential to the story of being able to mix and
match runners and SDKs arbitrarily, and I think this is a great start.


On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik  wrote:
> Your correct, a docker container is created that contains the execution
> environment the user wants or the user re-uses an existing one (allowing
> for a user to embed all their code/dependencies or use a container that can
> deploy code/dependencies on demand).
> A user creates a pipeline saying which docker container they want to use
> (this starts to allow for multiple container definitions within a single
> pipeline to support multiple languages, versioning, ...).
> A runner would then be responsible for launching one or more of these
> containers in a cluster manager of their choice (scaling up or down the
> number of instances depending on demand/load/...).
> A runner then interacts with the docker containers over the gRPC service
> definitions to delegate processing to.
>
>
> On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Luke,
>>
>> that's really great and very promising !
>>
>> It's really ambitious but I like the idea. Just to clarify: the purpose of
>> using gRPC is once the docker container is running, then we can "interact"
>> with the container to spread and delegate processing to the docker
>> container, correct ?
>> The users/devops have to setup the docker containers as prerequisite.
>> Then, the "location" of the containers (kind of container registry) is set
>> via the pipeline options and used by gRPC ?
>>
>> Thanks Luke !
>>
>> Regards
>> JB
>>
>>
>> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
>>
>>> I have been prototyping several components towards the Beam technical
>>> vision of being able to execute an arbitrary language using an arbitrary
>>> runner.
>>>
>>> I would like to share this overview [1] of what I have been working
>>> towards. I also share this PR [2] with a proposed API, service definitions
>>> and partial implementation.
>>>
>>> 1: https://s.apache.org/beam-fn-api
>>> 2: https://github.com/apache/beam/pull/1801
>>>
>>> Please comment on the overview within this thread, and any specific code
>>> comments on the PR directly.
>>>
>>> Luke
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>


Re: Beam Fn API

2017-01-20 Thread Lukasz Cwik
Your correct, a docker container is created that contains the execution
environment the user wants or the user re-uses an existing one (allowing
for a user to embed all their code/dependencies or use a container that can
deploy code/dependencies on demand).
A user creates a pipeline saying which docker container they want to use
(this starts to allow for multiple container definitions within a single
pipeline to support multiple languages, versioning, ...).
A runner would then be responsible for launching one or more of these
containers in a cluster manager of their choice (scaling up or down the
number of instances depending on demand/load/...).
A runner then interacts with the docker containers over the gRPC service
definitions to delegate processing to.


On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré 
wrote:

> Hi Luke,
>
> that's really great and very promising !
>
> It's really ambitious but I like the idea. Just to clarify: the purpose of
> using gRPC is once the docker container is running, then we can "interact"
> with the container to spread and delegate processing to the docker
> container, correct ?
> The users/devops have to setup the docker containers as prerequisite.
> Then, the "location" of the containers (kind of container registry) is set
> via the pipeline options and used by gRPC ?
>
> Thanks Luke !
>
> Regards
> JB
>
>
> On 01/19/2017 03:56 PM, Lukasz Cwik wrote:
>
>> I have been prototyping several components towards the Beam technical
>> vision of being able to execute an arbitrary language using an arbitrary
>> runner.
>>
>> I would like to share this overview [1] of what I have been working
>> towards. I also share this PR [2] with a proposed API, service definitions
>> and partial implementation.
>>
>> 1: https://s.apache.org/beam-fn-api
>> 2: https://github.com/apache/beam/pull/1801
>>
>> Please comment on the overview within this thread, and any specific code
>> comments on the PR directly.
>>
>> Luke
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Beam Fn API

2017-01-20 Thread Jean-Baptiste Onofré

Hi Luke,

that's really great and very promising !

It's really ambitious but I like the idea. Just to clarify: the purpose 
of using gRPC is once the docker container is running, then we can 
"interact" with the container to spread and delegate processing to the 
docker container, correct ?
The users/devops have to setup the docker containers as prerequisite. 
Then, the "location" of the containers (kind of container registry) is 
set via the pipeline options and used by gRPC ?


Thanks Luke !

Regards
JB

On 01/19/2017 03:56 PM, Lukasz Cwik wrote:

I have been prototyping several components towards the Beam technical
vision of being able to execute an arbitrary language using an arbitrary
runner.

I would like to share this overview [1] of what I have been working
towards. I also share this PR [2] with a proposed API, service definitions
and partial implementation.

1: https://s.apache.org/beam-fn-api
2: https://github.com/apache/beam/pull/1801

Please comment on the overview within this thread, and any specific code
comments on the PR directly.

Luke



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam Fn API

2017-01-19 Thread Dan Halperin
"relatively little extra work" once the base APIs are implemented.

On Thu, Jan 19, 2017 at 11:26 PM, Dan Halperin  wrote:

> This is an extremely ambitious part of the technical vision. I think it's
> a lot of work, but well worth it -- Python-SDK-on-Java-runner with
> relatively extra work? I don't care what the overhead is, this is making
> the impossible possible.
>
> On Thu, Jan 19, 2017 at 3:56 PM, Lukasz Cwik 
> wrote:
>
>> I have been prototyping several components towards the Beam technical
>> vision of being able to execute an arbitrary language using an arbitrary
>> runner.
>>
>> I would like to share this overview [1] of what I have been working
>> towards. I also share this PR [2] with a proposed API, service definitions
>> and partial implementation.
>>
>> 1: https://s.apache.org/beam-fn-api
>> 2: https://github.com/apache/beam/pull/1801
>>
>> Please comment on the overview within this thread, and any specific code
>> comments on the PR directly.
>>
>> Luke
>>
>
>


Re: Beam Fn API

2017-01-19 Thread Dan Halperin
This is an extremely ambitious part of the technical vision. I think it's a
lot of work, but well worth it -- Python-SDK-on-Java-runner with relatively
extra work? I don't care what the overhead is, this is making the
impossible possible.

On Thu, Jan 19, 2017 at 3:56 PM, Lukasz Cwik 
wrote:

> I have been prototyping several components towards the Beam technical
> vision of being able to execute an arbitrary language using an arbitrary
> runner.
>
> I would like to share this overview [1] of what I have been working
> towards. I also share this PR [2] with a proposed API, service definitions
> and partial implementation.
>
> 1: https://s.apache.org/beam-fn-api
> 2: https://github.com/apache/beam/pull/1801
>
> Please comment on the overview within this thread, and any specific code
> comments on the PR directly.
>
> Luke
>


Beam Fn API

2017-01-19 Thread Lukasz Cwik
I have been prototyping several components towards the Beam technical
vision of being able to execute an arbitrary language using an arbitrary
runner.

I would like to share this overview [1] of what I have been working
towards. I also share this PR [2] with a proposed API, service definitions
and partial implementation.

1: https://s.apache.org/beam-fn-api
2: https://github.com/apache/beam/pull/1801

Please comment on the overview within this thread, and any specific code
comments on the PR directly.

Luke