Re: Documentation for Cross-Language Transforms

2020-11-20 Thread Chamikara Jayalath
PR went in and documentation is live now:
https://beam.apache.org/documentation/programming-guide/#mulit-language-pipelines

Thanks,
Cham

On Wed, Nov 18, 2020 at 10:05 AM Chamikara Jayalath 
wrote:

> This was mentioned in a separate thread but thought it would be good to
> highlight here in case more folks wish to take a look before the PR is
> merged.
>
> PR is https://github.com/apache/beam/pull/13317
>
> Thanks,
> Cham
>
> On Thu, Nov 12, 2020 at 1:17 PM Chamikara Jayalath 
> wrote:
>
>> Seems like a good place to promote this PR that adds documentation for
>> cross-language transforms :)
>> https://github.com/apache/beam/pull/13317
>>
>> This covers the following for both Java and Python SDKs.
>> * Creating new cross-language transforms - primary audience will be
>> transform authors who wish to make existing Java/Python transforms
>> available to other SDKs.
>> * Using cross-language transforms - primary audience will be pipeline
>> authors that wish to use existing cross-language transforms with or without
>> language specific wrappers.
>>
>> Also this introduces the term "Multi-Language Pipelines" to denote
>> pipelines that use cross-language transforms (and hence utilize more than
>> one SDK language).
>>
>> Thanks +Dave Wrede  for working on this.
>>
>> - Cham
>>
>> On Thu, Nov 12, 2020 at 4:56 AM Ismaël Mejía  wrote:
>>
>>> I was not aware of these examples Brian, thanks for sharing. Maybe we
>>> should
>>> make these examples more discoverable on the website or as part of Beam's
>>> programming guide.
>>>
>>> It would be nice to have an example of the opposite too, calling a Python
>>> transform from Java.
>>>
>>> Additionally Java users who want to integrate python might be lost
>>> because
>>> External is NOT part of Beam's Java SDK (the transform is hidden inside
>>> of a
>>> different module core-construction-java), so it does not even appear in
>>> the
>>> website SDK javadoc.
>>> https://issues.apache.org/jira/browse/BEAM-8546
>>>
>>>
>>> On Wed, Nov 11, 2020 at 8:41 PM Brian Hulette 
>>> wrote:
>>> >
>>> > Hi Ke,
>>> >
>>> > A cross-language pipeline looks a lot like a pipeline written natively
>>> in one of the Beam SDKs, the difference is that some of the transforms in
>>> the pipeline may be "external transforms" that actually have
>>> implementations in a different language. There are a few examples in the
>>> beam repo that use Java transforms from Python pipelines:
>>> > - kafkataxi [1]: Uses Java's KafkaIO from Python
>>> > - wordcount_xlang_sql [2] and sql_taxi [3]: Use Java's SqlTransform
>>> from Python
>>> >
>>> > To create your own cross-language pipeline, you'll need to decide
>>> which SDK you want to use primarily, and then create an expansion service
>>> to expose the transforms you want to use from the other SDK (if one doesn't
>>> exist already).
>>> >
>>> > [1]
>>> https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/kafkataxi
>>> > [2]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang_sql.py
>>> > [3]
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/sql_taxi.py
>>> >
>>> > On Wed, Nov 11, 2020 at 11:07 AM Ke Wu  wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> Is there an example demonstrating how a cross language pipeline look
>>> like? e.g. a pipeline where it is composes of Java and Python
>>> code/transforms.
>>> >>
>>> >> Best,
>>> >> Ke
>>>
>>


Re: PTransform Annotations Proposal

2020-11-20 Thread Mirac Vuslat Basaran
Thanks everyone so much for their input and for the insightful discussion.

Not being knowledgeable about Beam's internals, I have to say I am a bit lost 
on the PTransform vs. environment discussion.

I do agree with Burke's notion that merge rules are very annotation dependent, 
I don't think we can find a one-size-fits-all solution for that. So this might 
be actually be an argument in favour of having annotations on PTransforms, 
since it avoids the conflation with environments.

Also in general, I feel that having annotations per single transform (rather 
than composite) and on PTransforms could lead to a simpler design.

Seeing as there are valuable arguments in favour of both (PTransform and 
environments) with no clear(?) "best solution", I would propose moving forward 
with the initial (PTransform) design to ship the feature and unblock teams 
asking for it. If it turns out that there was indeed a need to have annotations 
in environments, we could always refactor it.

On 2020/11/17 19:07:22, Robert Bradshaw  wrote: 
> So far we have two distinct usecases for annotations: resource hints
> and privacy directives, and I've been trying to figure out how to
> reconcile them, but they seem to have very different characteristics.
> (It would be nice to come up with other uses as well to see if we're
> really coming up with a generally useful mode--I think display data
> could fit into this as a new kind of annotation rather than being a
> top-level property, and it could make sense on both leaf and composite
> transforms.)
> 
> To me, resource hints like GPU are inextricably tied to the
> environment. A transform tagged with GPU should reference a Fn that
> invokes GPU-accelerated code that lives in a particular environment.
> Something like high-mem is a bit squishier. Some DoFns take a lot of
> memory, but on the other hand one could imagine labeling a CoGBK as
> high-mem due to knowing that, in this particular usage, there will be
> lots of values with the same key. Ideally runners would be intelligent
> enough to automatically learn memory usage, but even in this case it
> may be a good hint to try and learn the requirements for DoFn A and
> DoFn B separately (which is difficult if they are always colocated,
> but valuable if, e.g. A takes a huge amount of memory and B takes a
> huge amount of wall time).
> 
> Note that tying things to the environment does not preclude using them
> in non-portable runners as they'll still have an SDK-level
> representation (though I don't think we should have an explicit goal
> of feature parity for non-portable runners, e.g. multi-language isn't
> happening, and hope that non-portable runners go away soon anyway).
> 
> Now let's consider privacy annotations. To make things very concrete,
> imagine a transform AverageSpendPerZipCode which takes as input (user,
> zip, spend), all users unique, and returns (zip, avg(spend)). In
> Python, this is GroupBy('zip').aggregate_field('spend',
> MeanCombineFn()). This is not very privacy preserving to those users
> who are the only (or one of a few) in a zip code. So we could define a
> transform PrivacyPreservingAverageSpendPerZipCode as
> 
> @ptransform_fn
> def PrivacyPreservingAverageSpendPerZipCode(spend_per_user, threshold)
> counts_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('user', CountCombineFn())
> spend_per_zip = spend_per_user |
> GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> filtered = spend_per_zip | beam.Filter(
> lambda x, counts: counts[x.zip] > threshold,
> counts=AsMap(counts_per_zip))
> return filtered
> 
> We now have a composite that has privacy preserving properties (i.e.
> the input may be quite sensitive, but the output is not, depending on
> the value of threshold). What is interesting here is that it is only
> the composite that has this property--no individual sub-transform is
> itself privacy preserving. Furthermore, an optimizer may notice we're
> doing aggregation on the same key twice and rewrite this using
> (logically)
> 
> GroupBy('zip').aggregate_field('user',
> CountCombineFn()).aggregate_field('spend', MeanCombineFn())
> 
> and then applying the filter, which is semantically equivalent and
> satisfies the privacy annotations (and notably that does not even
> require the optimizer to interpret the annotations, just pass them
> on). To me, this implies that these annotations belong on the
> composites, and *not* on the leaf nodes (where they would be
> incorrect).
> 
> I'll leave aside most questions of API until we figure out the model
> semantics, but wanted to throw one possible idea out (though I am
> ambivalent about it). Instead of attaching things to transforms, we
> can just wrap transforms in composites that have no role other than
> declaring information about their contents. E.g. we could have a
> composite transform whose payload is simply an assertion of the
> privacy (or resource?) properties of its inner structure. 

Re: [REMOTE WORKSHOPS] Introduction to Apache Beam - remote workshops Dec 3rd and Dec 10th

2020-11-20 Thread Kamil Wasilewski
Yes, that's correct. Our plan is to also prepare some exercises to be done
by participants. I hope that will be interesting!
Thanks for sharing this.

Kamil

On Tue, Nov 17, 2020 at 8:41 PM Pablo Estrada  wrote:

> +dev  so everyone will know.
> This is cool. Thanks Karolina! Will these be an introduction to basic Beam
> concepts?
> Thanks!
> -P.
>
> On Mon, Nov 16, 2020 at 11:52 AM Karolina Rosół <
> karolina.ro...@polidea.com> wrote:
>
>> Hello everyone,
>>
>> You may not know me but I'm Karolina Rosół, Head of Cloud & OSS at
>> Polidea and I'm working with great Apache Beam committers Michał Walenia &
>> Kamil Wasilewski who will be carrying out the introductory remote workshops
>> to Apache Beam on *Dec 3rd* and *Dec 10th*.
>>
>> If you're interested in taking part in the workshop, feel free to have a
>> look at the Warsaw Beam Meetup
>>  page
>> or enroll directly -> bit.ly/BeamWorkshops
>> 
>>
>> Thanks,
>>
>> Karolina Rosół
>> Polidea  | Head of Cloud & OSS
>>
>> M: +48 606 630 236 <+48606630236>
>> E: karolina.ro...@polidea.com
>> [image: Polidea] 
>>
>> Check out our projects! 
>> [image: Github]  [image: Facebook]
>>  [image: Twitter]
>>  [image: Linkedin]
>>  [image: Instagram]
>>  [image: Behance]
>>  [image: dribbble]
>> 
>>
>


Developing I/O connectors for Java

2020-11-20 Thread Filip Krakowski

Hi,

I followed the "Developing I/O connectors for Java" guide 
(https://beam.apache.org/documentation/io/developing-io-java) and 
implemented what I think is the simplest unbounded source, which only 
emits increasing long values (without deduplication).


You can find the code within the following GitHub gist.

   https://gist.github.com/krakowski/e289a0057bf65b08a0c09ae32c52c3e7

What I notice here is that the reader is continuously (every few 
milliseconds) closed (I inserted a print statement inside the 
close()-method) and the elements of the stream are not written to the 
output file. Am I missing something here?


Best regards
Filip


Re: beam flink-runner distribution implementation

2020-11-20 Thread Maximilian Michels

Hi Richard,

The rational was to preserve Beam's DistributionResult through the use 
of Flink Gauges. Whoever implemented this, wasn't fully aware that Flink 
 Histograms would be a better fit.


Feel free to open a PR. You can mention us here for a review.

Thanks,
Max

On 20.11.20 03:06, Alex Amato wrote:
Are you referring to a "Flink Gauge" or a "Beam Gauge"? Are you 
suggesting to package it as a "Flink Histogram?" (i.e. A Flink runner 
specific concept of Histograms) If so, seems fine and I have no comment 
here.


FWIW,
I proposed a "Beam Histogram" metric (bucket counts).
https://s.apache.org/beam-histogram-metrics 



(No runner, implements this, and most likely I will not be pursuing this 
further, due to a change of priority/interest around the meric I was 
interested in using this for).
I was intending to use it for a specific set of metrics metric (No plans 
to provide a User defined Histogram Metric API)
https://s.apache.org/beam-gcp-debuggability 



I don't think we should pursue any plans to package "Beam Distributions" 
as "Beam Histograms". As a "Beam Histogram" is essential several 
counters (one for each bucket). Changing all usage of beam.distribution 
to a "Beam Histograms" would have performance implications, and is not 
advised. If at some point "Beam Histograms" are implemented, migrating 
the usage of Metrics.distribution to histogram should be done on an 
individual basis.






On Thu, Nov 19, 2020 at 5:47 PM Robert Bradshaw > wrote:


Guage certainly seems wrong for DistributionResult. Yes, using a
Histogram would be a welcome PR.

On Thu, Nov 19, 2020 at 12:58 PM Kyle Weaver mailto:kcwea...@google.com>> wrote:
 >
 > What are the advantages of using a Histogram instead of a Gauge?
 >
 > Also, check out this design doc for adding histogram metrics to
Beam if you haven't already: http://s.apache.org/beam-metrics-api
 (Not sure what the current
status is.)
 >
 > On Wed, Nov 18, 2020 at 1:37 PM Richard Moorhead
mailto:richard.moorh...@gmail.com>> wrote:
 >>
 >> Beam's DistributionResult is implemented as a Gauge within the
Flink runner. Can someone explain the rationale behind this? Would a
PR to utilize a Histogram be acceptable?