Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-23 Thread Ahmet Altay
+1 (binding).

I validated python quickstarts. Thank you Pablo.

On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
wrote:

> +1 (binding)
>
> Regards
> JB
>
> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 2.27.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1
>  if no issues are found.
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.27.0-RC1" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.27.0 release to help with validation
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours, but given the holidays, we
> will likely extend for a few more days. The release will be adopted by
> majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> -P.
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
>
> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1145/
> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
> [6] https://github.com/apache/beam/pull/13602
> [7] https://github.com/apache/beam-site/pull/610
> [8] https://github.com/apache/beam/pull/13603
> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
>
> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>
>
>


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-23 Thread Kyle Weaver
+1 (non-binding) Validated wordcount with Python source + Flink and Spark
job server jars. Also checked that the ...:sql:udf jar was added and
includes our cherry-picks. Thanks Pablo :)

On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:

> +1 (binding).
>
> I validated python quickstarts. Thank you Pablo.
>
> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
> wrote:
>
>> +1 (binding)
>>
>> Regards
>> JB
>>
>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version 2.27.0,
>> as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1
>>  if no issues are found.
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org [2],
>> which is signed with the key with fingerprint
>> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.27.0-RC1" [5],
>> * website pull request listing the release [6], publishing the API
>> reference manual [7], and the blog post [8].
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>> * Validation sheet with a tab for 2.27.0 release to help with validation
>> [9].
>> * Docker images published to Docker Hub [10].
>>
>> The vote will be open for at least 72 hours, but given the holidays, we
>> will likely extend for a few more days. The release will be adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> -P.
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
>>
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1145/
>> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
>> [6] https://github.com/apache/beam/pull/13602
>> [7] https://github.com/apache/beam-site/pull/610
>> [8] https://github.com/apache/beam/pull/13603
>> [9]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
>>
>> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>
>>
>>


Triggering 4 different process from same pipeline

2020-12-23 Thread Sofia’s World
HI all
 i was wondering how is it possible to force Beam to run 4 separate
processes for this pipeline

Currently i have this setup

with beam.Pipeline(options=pipeline_options) as p:
source = (p  | 'Startup' >> beam.Create([1,2,3,4])
  )
lines = run_my_pipeline(source)


Now, as per setup above, run_my_pipeline(source) will run only one process
which wil perform, sequentially, the run_my_pipeline processing [1,2,3,4]

what i want in stead would be to kick off  run_my_pipeline(1),
run_my_pipeline(2),  run_my_pipeline(3),  run_my_pipeline(4),

Is that achievable>? Which tweaks do i need to do to accomplish this?

kind regards
 Marco


Re: Combine with multiple outputs case Sample and the rest

2020-12-23 Thread Ismaël Mejía
Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw  wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct 
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a 
> single PCollection (the difference between Map and FlatMap). In full 
> generality, one can always have a CombineFn that outputs lists (say  result>*) followed by a DoFn that emits to multiple places based on this 
> result.
>
> One other cons of emitting multiple values from a CombineFn is that they are 
> used in other contexts as well, e.g. combining state, and trying to make 
> sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw 
> most of the data away. If we kept most of the data, it likely wouldn't fit 
> into one machine to do the final sampling. The idea of using a side input to 
> filter after the fact should work well (unless there's duplicate elements, in 
> which case you'd have to uniquify them somehow to filter out only the "right" 
> copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía  wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-23 Thread Brian Hulette
-1 (non-binding)
Good news: I validated a dataframe pipeline on Dataflow which looked good
(with expected performance improvements!)
Bad news: I also tried to run the sql_taxi example pipeline (streaming SQL
in python) on Dataflow and ran into PubSub IO related issues. The example
fails in the same way with 2.26.0, but it works in 2.25.0. It's possible
this is a Dataflow bug and not a Beam one, but I'd like to investigate
further to make sure.

On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:

> +1 (non-binding) Validated wordcount with Python source + Flink and Spark
> job server jars. Also checked that the ...:sql:udf jar was added and
> includes our cherry-picks. Thanks Pablo :)
>
> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:
>
>> +1 (binding).
>>
>> I validated python quickstarts. Thank you Pablo.
>>
>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Regards
>>> JB
>>>
>>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #1 for the version 2.27.
>>> 0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1
>>>  if no issues are found.
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org [2],
>>> which is signed with the key with fingerprint
>>> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.27.0-RC1" [5],
>>> * website pull request listing the release [6], publishing the API
>>> reference manual [7], and the blog post [8].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>> * Validation sheet with a tab for 2.27.0 release to help with
>>> validation [9].
>>> * Docker images published to Docker Hub [10].
>>>
>>> The vote will be open for at least 72 hours, but given the holidays, we
>>> will likely extend for a few more days. The release will be adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> -P.
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380
>>>
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1145/
>>> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
>>> [6] https://github.com/apache/beam/pull/13602
>>> [7] https://github.com/apache/beam-site/pull/610
>>> [8] https://github.com/apache/beam/pull/13603
>>> [9]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
>>>
>>> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>>
>>>
>>>


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-23 Thread Valentyn Tymofieiev
We discovered a regression on CombineFn.from_callable() started in 2.26.0.
Even though it's not a regression in 2.27.0, I strongly prefer we fix it in
2.27.0 as it leads to buggy behavior, so I vote -1.

The fix to release branch is in flight:
https://github.com/apache/beam/pull/13613.



On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette  wrote:

> -1 (non-binding)
> Good news: I validated a dataframe pipeline on Dataflow which looked good
> (with expected performance improvements!)
> Bad news: I also tried to run the sql_taxi example pipeline (streaming SQL
> in python) on Dataflow and ran into PubSub IO related issues. The example
> fails in the same way with 2.26.0, but it works in 2.25.0. It's possible
> this is a Dataflow bug and not a Beam one, but I'd like to investigate
> further to make sure.
>
> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:
>
>> +1 (non-binding) Validated wordcount with Python source + Flink and Spark
>> job server jars. Also checked that the ...:sql:udf jar was added and
>> includes our cherry-picks. Thanks Pablo :)
>>
>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:
>>
>>> +1 (binding).
>>>
>>> I validated python quickstarts. Thank you Pablo.
>>>
>>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
>>> wrote:
>>>
 +1 (binding)

 Regards
 JB

 Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :

 Hi everyone,
 Please review and vote on the release candidate #1 for the version 2.27
 .0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)


 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1
  if no issues are found.

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org [2],
 which is signed with the key with fingerprint
 C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.27.0-RC1" [5],
 * website pull request listing the release [6], publishing the API
 reference manual [7], and the blog post [8].
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 * Validation sheet with a tab for 2.27.0 release to help with
 validation [9].
 * Docker images published to Docker Hub [10].

 The vote will be open for at least 72 hours, but given the holidays,
 we will likely extend for a few more days. The release will be adopted by
 majority approval, with at least 3 PMC affirmative votes.

 Thanks,
 -P.

 [1]
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349380

 [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1145/
 [5] https://github.com/apache/beam/tree/v2.27.0-RC1
 [6] https://github.com/apache/beam/pull/13602
 [7] https://github.com/apache/beam-site/pull/610
 [8] https://github.com/apache/beam/pull/13603
 [9]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106

 [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image