Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Daniel Collins
Looking. This is a very surprising test to be failing, the underlying class
just isn't doing that much.

On Mon, May 24, 2021 at 9:21 PM Reuven Lax  wrote:

> Hmmm... I'm seeing this fail during Java PreCommit.
>
> On Mon, May 24, 2021 at 6:19 PM Evan Galpin  wrote:
>
>> It did, yes :-)
>>
>> On Mon, May 24, 2021 at 21:17 Reuven Lax  wrote:
>>
>>> Did Java PreCommit pass on your PR?
>>>
>>> On Mon, May 24, 2021 at 5:42 PM Evan Galpin 
>>> wrote:
>>>
 I’m not certain that it’s related based on a quick scan of the test
 output that you linked, but I do know that I recently made a change[1] to
 Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub
 Lite[2].

 Given that the change is recent and the test is failing on remote but
 not locally, I thought maybe the remote test env might be using the
 AssignToShard change and your local code might not? (not certain if the
 remote tests use exact SHA or rebase code on master first possibly?)

 [1] https://github.com/apache/beam/pull/14720

 [2]
 https://lists.apache.org/x/thread.html/r62b191e8318413739520b67dd3a4dfa788cbbc7b8d91ad9a80720dc6@%3Cdev.beam.apache.org%3E


 On Mon, May 24, 2021 at 20:27 Reuven Lax  wrote:

> This test keeps failing on my PR. For example:
>
>
> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/
>
> I haven't changed anything related to this test, and it passes for me
> locally. Is anyone else seeing this test fail?
>
> Reuven
>



Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Reuven Lax
Hmmm... I'm seeing this fail during Java PreCommit.

On Mon, May 24, 2021 at 6:19 PM Evan Galpin  wrote:

> It did, yes :-)
>
> On Mon, May 24, 2021 at 21:17 Reuven Lax  wrote:
>
>> Did Java PreCommit pass on your PR?
>>
>> On Mon, May 24, 2021 at 5:42 PM Evan Galpin 
>> wrote:
>>
>>> I’m not certain that it’s related based on a quick scan of the test
>>> output that you linked, but I do know that I recently made a change[1] to
>>> Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub
>>> Lite[2].
>>>
>>> Given that the change is recent and the test is failing on remote but
>>> not locally, I thought maybe the remote test env might be using the
>>> AssignToShard change and your local code might not? (not certain if the
>>> remote tests use exact SHA or rebase code on master first possibly?)
>>>
>>> [1] https://github.com/apache/beam/pull/14720
>>>
>>> [2]
>>> https://lists.apache.org/x/thread.html/r62b191e8318413739520b67dd3a4dfa788cbbc7b8d91ad9a80720dc6@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Mon, May 24, 2021 at 20:27 Reuven Lax  wrote:
>>>
 This test keeps failing on my PR. For example:


 https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/

 I haven't changed anything related to this test, and it passes for me
 locally. Is anyone else seeing this test fail?

 Reuven

>>>


Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Evan Galpin
It did, yes :-)

On Mon, May 24, 2021 at 21:17 Reuven Lax  wrote:

> Did Java PreCommit pass on your PR?
>
> On Mon, May 24, 2021 at 5:42 PM Evan Galpin  wrote:
>
>> I’m not certain that it’s related based on a quick scan of the test
>> output that you linked, but I do know that I recently made a change[1] to
>> Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub
>> Lite[2].
>>
>> Given that the change is recent and the test is failing on remote but not
>> locally, I thought maybe the remote test env might be using the
>> AssignToShard change and your local code might not? (not certain if the
>> remote tests use exact SHA or rebase code on master first possibly?)
>>
>> [1] https://github.com/apache/beam/pull/14720
>>
>> [2]
>> https://lists.apache.org/x/thread.html/r62b191e8318413739520b67dd3a4dfa788cbbc7b8d91ad9a80720dc6@%3Cdev.beam.apache.org%3E
>>
>>
>> On Mon, May 24, 2021 at 20:27 Reuven Lax  wrote:
>>
>>> This test keeps failing on my PR. For example:
>>>
>>>
>>> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/
>>>
>>> I haven't changed anything related to this test, and it passes for me
>>> locally. Is anyone else seeing this test fail?
>>>
>>> Reuven
>>>
>>


Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Evan Galpin
I’m not certain that it’s related based on a quick scan of the test output
that you linked, but I do know that I recently made a change[1] to
Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub
Lite[2].

Given that the change is recent and the test is failing on remote but not
locally, I thought maybe the remote test env might be using the
AssignToShard change and your local code might not? (not certain if the
remote tests use exact SHA or rebase code on master first possibly?)

[1] https://github.com/apache/beam/pull/14720

[2]
https://lists.apache.org/x/thread.html/r62b191e8318413739520b67dd3a4dfa788cbbc7b8d91ad9a80720dc6@%3Cdev.beam.apache.org%3E


On Mon, May 24, 2021 at 20:27 Reuven Lax  wrote:

> This test keeps failing on my PR. For example:
>
>
> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/
>
> I haven't changed anything related to this test, and it passes for me
> locally. Is anyone else seeing this test fail?
>
> Reuven
>


sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Reuven Lax
This test keeps failing on my PR. For example:

https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/

I haven't changed anything related to this test, and it passes for me
locally. Is anyone else seeing this test fail?

Reuven


Re: Apply a Beam PTransform per key

2021-05-24 Thread Stephan Hoyer
Happy to give a concrete example, I even have open source code I can share
in this case :)
https://github.com/google/xarray-beam/tree/9728970aa18abddafec22a23cad92b5d4a1e11e5/examples
https://github.com/google/xarray-beam/blob/9728970aa18abddafec22a23cad92b5d4a1e11e5/examples/era5_rechunk.py

This particular example reads and writes a 25 TB weather dataset stored in
Google Cloud Storage. The dataset consists of 19 variables, each of which
is logically a 3D array of shape (350640, 721, 1440), stored in blocks of
shape (31, 721, 1440) via Zarr .
Now I want to "rechunk" them into blocks of shape (350640, 5, 5), which is
more convenient for queries like "Return the past 40 years of weather for
this particular location". To be clear, this particular use-case is
synthetic, but it reflects a common pattern for large-scale processing of
weather and climate datasets.

I originally wrote my pipeline to process all 19 variables at once, but it
looks like it would be more efficient to process them separately. So now I
want to essentially re-run my original pipeline 19 times in parallel.

For this particular codebase, I think the right call probably *is* to
rewrite all my underlying transforms to handle an expanded key, including
the variable name. This will pay other dividends. But if I didn't want to
do that refactor, I would need to duplicate or Partition the PCollection
into 19 parts, which seems like a lot. My xarray_beam.Rechunk() transform
includes a few GroupByKey transforms inside and definitely cannot operate
in-memory.



On Mon, May 24, 2021 at 4:12 PM Reuven Lax  wrote:

> Can you explain a bit more? Where are these data sets coming from?
>
> On Mon, May 24, 2021 at 3:55 PM Stephan Hoyer  wrote:
>
>> I'm not concerned with key-dependent topologies, which I didn't even
>> think was possible to express in Beam.
>>
>> It's more that I already wrote a PTransform for processing a *single* 1
>> TB dataset. Now I want to write a single PTransform that effectively runs
>> the original PTransform in groups over ~20 such datasets (ideally without
>> needing to know that number 20 ahead of time).
>>
>> On Mon, May 24, 2021 at 3:30 PM Reuven Lax  wrote:
>>
>>> Is the issue that you have a different topology depending on the key?
>>>
>>> On Mon, May 24, 2021 at 2:49 PM Stephan Hoyer  wrote:
>>>
 Exactly, my use-case has another nested GroupByKey to apply per key.
 But even if it could be done in a streaming fashion, it's way too much data
 (1 TB) to process on a single worker in a reasonable amount of time.

 On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles 
 wrote:

> I was thinking there was some non-trivial topology (such as further
> GBKs) within the logic to be applied to each key group.
>
> Kenn
>
> On Mon, May 24, 2021 at 2:38 PM Brian Hulette 
> wrote:
>
>> Isn't it possible to read the grouped values produced by a GBK from
>> an Iterable and yield results as you go, without needing to collect all 
>> of
>> each input into memory? Perhaps I'm misunderstanding your use-case.
>>
>> Brian
>>
>> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles 
>> wrote:
>>
>>> I'm just pinging this thread because I think it is an interesting
>>> problem and don't want it to slip by.
>>>
>>> I bet a lot of users have gone through the tedious conversion you
>>> describe. Of course, it may often not be possible if you are using a
>>> library transform. There are a number of aspects of the Beam model that 
>>> are
>>> designed a specific way explicitly *because* we need to assume that a 
>>> large
>>> number of composites in your pipeline are not modifiable by you. Most
>>> closely related: this is why windowing is something carried along
>>> implicitly rather than just a parameter to GBK - that would require all
>>> transforms to expose how they use GBK under the hood and they would all
>>> have to plumb this extra key/WindowFn through every API. Instead, we 
>>> have
>>> this way to implicitly add a second key to any transform :-)
>>>
>>> So in addition to being tedious for you, it would be good to have a
>>> better solution.
>>>
>>> Kenn
>>>
>>> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer 
>>> wrote:
>>>
 I'd like to write a Beam PTransform that applies an *existing* Beam
 transform to each set of grouped values, separately, and combines the
 result. Is anything like this possible with Beam using the Python SDK?

 Here are the closest things I've come up with:
 1. If each set of *inputs* to my transform fit into memory, I
 could use GroupByKey followed by FlatMap.
 2. If each set of *outputs* from my transform fit into memory, I
 could use CombinePerKey.
 3. If I knew the static number of groups ahead of time, I coul

Re: Apache Beam Access Request for Anthony Zhu

2021-05-24 Thread Anthony Zhu
Thank you!

On Mon, May 24, 2021 at 3:40 PM Pablo Estrada  wrote:

> Added! Welcome to Beam!
>
>
> On Mon, May 24, 2021 at 1:39 PM Anthony Zhu  wrote:
>
>> Hi,
>>
>> This is Anthony Zhu from Google again, I am working closely with Rui Wang
>> on the Java SDK harness. Can someone add me as a contributor for the Beam
>> Jira issue tracker?
>> I would like to assign myself the ticket that I will be working on this
>> summer. My Jira username is aqzhu
>>
>> Thanks,
>> Anthony
>>
>> On Thu, May 20, 2021 at 11:16 AM Anthony Zhu  wrote:
>>
>>> Hi,
>>>
>>> I am requesting permission to access the Apache Beam JIRA for Anthony
>>> Zhu, username aqzhu for my work with Rui Wang.
>>>
>>> Thanks,
>>> Anthony
>>>
>>> --
>>> *Anthony Zhu*
>>> University of Michigan Class of 2022
>>> College of Engineering | CSE
>>> 978.496.4537 <(978)%20496-4537> | aq...@umich.edu
>>>
>>>
>>
>> --
>> *Anthony Zhu*
>> University of Michigan Class of 2022
>> College of Engineering | CSE
>> 978.496.4537 <(978)%20496-4537> | aq...@umich.edu
>>
>>

-- 
*Anthony Zhu*
University of Michigan Class of 2022
College of Engineering | CSE
978.496.4537 | aq...@umich.edu


BEAM-12068: Run Dataflow performance benchmarks on Java 11

2021-05-24 Thread Benjamin Gonzalez Delgado
Hi Team,
I'm planning to work with BEAM-12068 [1], but I have some related questions.

   - Is it required to create a new Jenkins Job to run the configuration
   with Dataflow V2 and Java11 or do I need to append a new configuration to
   the existing one (e.g., GBK load tests [2])?
   - To execute a test with Dataflow V2, do I just need to pass as args
   
"experiments=beam_fn_api,use_unified_worker,use_runner_v2,shuffle_mode=service"
   or does it require some extra configuration?
   - I followed [3] to configure the metrics local development environment,
   but I had issues with Grafana access credentials, so I recreated with
   docker-compose and custom password in docker-compose.yml [4], but I'm still
   having the same access troubles. Is this the correct setup to work with
   this kind of change? Does someone know how to configure this setup locally?

Any guidance on this would be appreciated.
Thanks!

[1] https://issues.apache.org/jira/browse/BEAM-12068
[2]
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Java.groovy
[3] https://github.com/apache/beam/tree/master/.test-infra/metrics
[4]
https://github.com/apache/beam/blob/master/.test-infra/metrics/docker-compose.yml#L64

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


Re: Inaccessible CI-built python SDK artifact on GCS staging bucket (individual wheels are accessible)

2021-05-24 Thread Ahmet Altay
Moving to dev list.

This looks like a bug in the build workflow but I cannot pinpoint it. In
the GCS UI all objects have Public Access: Not Authorized property. This
explains why you cannot download them. However workflow uses "gsutil cp -r
-a public-read ..." to copy the object to GCS, and "-a public-read" is
meant to make objects readable to all consumers. I am not sure why that is
not working as expected.

Someone with more experience on GCS ACLs might be able to help.

On Fri, May 21, 2021 at 5:37 AM Aris Tritas  wrote:

> Hi,
>
> In the process of installing a nightly version of the python SDK, I am
> denied access to the SDK artifact archive on the GCS staging bucket.
> Individual files from the CI run are accessible (incl. wheels).
>
> The location to the GCS blob was taken from the latest CI run report at
>
> -
> https://github.com/apache/beam/runs/2636176987?check_suite_focus=true#step:3:7
>  resolving
> to
> -
> https://storage.googleapis.com/beam-wheels-staging/master/40326dd0a2a1c9b5dcbbcd6486a43e3875a64a43-862470639/apache_beam-2.31.0.dev0.zip
>
> Am I doing something wrong or is there a reason the SDK archive is
> inaccessible?
> Best,
>
> --
>
> Aris Tritas
>
> E-mail: a.tri...@gmail.com
> Mobile: +33(0)6 66 95 04 02 <+33%206%2066%2095%2004%2002>
> Linkedin: https://www.linkedin.com/in/aris-tritas/
>


Re: Apply a Beam PTransform per key

2021-05-24 Thread Reuven Lax
Can you explain a bit more? Where are these data sets coming from?

On Mon, May 24, 2021 at 3:55 PM Stephan Hoyer  wrote:

> I'm not concerned with key-dependent topologies, which I didn't even think
> was possible to express in Beam.
>
> It's more that I already wrote a PTransform for processing a *single* 1
> TB dataset. Now I want to write a single PTransform that effectively runs
> the original PTransform in groups over ~20 such datasets (ideally without
> needing to know that number 20 ahead of time).
>
> On Mon, May 24, 2021 at 3:30 PM Reuven Lax  wrote:
>
>> Is the issue that you have a different topology depending on the key?
>>
>> On Mon, May 24, 2021 at 2:49 PM Stephan Hoyer  wrote:
>>
>>> Exactly, my use-case has another nested GroupByKey to apply per key. But
>>> even if it could be done in a streaming fashion, it's way too much data (1
>>> TB) to process on a single worker in a reasonable amount of time.
>>>
>>> On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles  wrote:
>>>
 I was thinking there was some non-trivial topology (such as further
 GBKs) within the logic to be applied to each key group.

 Kenn

 On Mon, May 24, 2021 at 2:38 PM Brian Hulette 
 wrote:

> Isn't it possible to read the grouped values produced by a GBK from an
> Iterable and yield results as you go, without needing to collect all of
> each input into memory? Perhaps I'm misunderstanding your use-case.
>
> Brian
>
> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles 
> wrote:
>
>> I'm just pinging this thread because I think it is an interesting
>> problem and don't want it to slip by.
>>
>> I bet a lot of users have gone through the tedious conversion you
>> describe. Of course, it may often not be possible if you are using a
>> library transform. There are a number of aspects of the Beam model that 
>> are
>> designed a specific way explicitly *because* we need to assume that a 
>> large
>> number of composites in your pipeline are not modifiable by you. Most
>> closely related: this is why windowing is something carried along
>> implicitly rather than just a parameter to GBK - that would require all
>> transforms to expose how they use GBK under the hood and they would all
>> have to plumb this extra key/WindowFn through every API. Instead, we have
>> this way to implicitly add a second key to any transform :-)
>>
>> So in addition to being tedious for you, it would be good to have a
>> better solution.
>>
>> Kenn
>>
>> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer 
>> wrote:
>>
>>> I'd like to write a Beam PTransform that applies an *existing* Beam
>>> transform to each set of grouped values, separately, and combines the
>>> result. Is anything like this possible with Beam using the Python SDK?
>>>
>>> Here are the closest things I've come up with:
>>> 1. If each set of *inputs* to my transform fit into memory, I could
>>> use GroupByKey followed by FlatMap.
>>> 2. If each set of *outputs* from my transform fit into memory, I
>>> could use CombinePerKey.
>>> 3. If I knew the static number of groups ahead of time, I could use
>>> Partition, followed by applying my transform multiple times, followed by
>>> Flatten.
>>>
>>> In my scenario, none of these holds true. For example, currently I
>>> have ~20 groups of values, with each group holding ~1 TB of data. My 
>>> custom
>>> transform simply shuffles this TB of data around, so each set of 
>>> outputs is
>>> also 1TB in size.
>>>
>>> In my particular case, it seems my options are to either relax these
>>> constraints, or to manually convert each step of my existing transform 
>>> to
>>> apply per key. This conversion process is tedious, but very
>>> straightforward, e.g., the GroupByKey and ParDo that my transform is 
>>> built
>>> out of just need to deal with an expanded key.
>>>
>>> I wonder, could this be something built into Beam itself, e.g,. as
>>> TransformPerKey? The ptranforms that result from combining other Beam
>>> transforms (e.g., _ChainPTransform in Python) are private, so this seems
>>> like something that would need to exist in Beam itself, if it could 
>>> exist
>>> at all.
>>>
>>> Cheers,
>>> Stephan
>>>
>>


Re: Apply a Beam PTransform per key

2021-05-24 Thread Stephan Hoyer
I'm not concerned with key-dependent topologies, which I didn't even think
was possible to express in Beam.

It's more that I already wrote a PTransform for processing a *single* 1 TB
dataset. Now I want to write a single PTransform that effectively runs the
original PTransform in groups over ~20 such datasets (ideally without
needing to know that number 20 ahead of time).

On Mon, May 24, 2021 at 3:30 PM Reuven Lax  wrote:

> Is the issue that you have a different topology depending on the key?
>
> On Mon, May 24, 2021 at 2:49 PM Stephan Hoyer  wrote:
>
>> Exactly, my use-case has another nested GroupByKey to apply per key. But
>> even if it could be done in a streaming fashion, it's way too much data (1
>> TB) to process on a single worker in a reasonable amount of time.
>>
>> On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles  wrote:
>>
>>> I was thinking there was some non-trivial topology (such as further
>>> GBKs) within the logic to be applied to each key group.
>>>
>>> Kenn
>>>
>>> On Mon, May 24, 2021 at 2:38 PM Brian Hulette 
>>> wrote:
>>>
 Isn't it possible to read the grouped values produced by a GBK from an
 Iterable and yield results as you go, without needing to collect all of
 each input into memory? Perhaps I'm misunderstanding your use-case.

 Brian

 On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles 
 wrote:

> I'm just pinging this thread because I think it is an interesting
> problem and don't want it to slip by.
>
> I bet a lot of users have gone through the tedious conversion you
> describe. Of course, it may often not be possible if you are using a
> library transform. There are a number of aspects of the Beam model that 
> are
> designed a specific way explicitly *because* we need to assume that a 
> large
> number of composites in your pipeline are not modifiable by you. Most
> closely related: this is why windowing is something carried along
> implicitly rather than just a parameter to GBK - that would require all
> transforms to expose how they use GBK under the hood and they would all
> have to plumb this extra key/WindowFn through every API. Instead, we have
> this way to implicitly add a second key to any transform :-)
>
> So in addition to being tedious for you, it would be good to have a
> better solution.
>
> Kenn
>
> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer 
> wrote:
>
>> I'd like to write a Beam PTransform that applies an *existing* Beam
>> transform to each set of grouped values, separately, and combines the
>> result. Is anything like this possible with Beam using the Python SDK?
>>
>> Here are the closest things I've come up with:
>> 1. If each set of *inputs* to my transform fit into memory, I could
>> use GroupByKey followed by FlatMap.
>> 2. If each set of *outputs* from my transform fit into memory, I
>> could use CombinePerKey.
>> 3. If I knew the static number of groups ahead of time, I could use
>> Partition, followed by applying my transform multiple times, followed by
>> Flatten.
>>
>> In my scenario, none of these holds true. For example, currently I
>> have ~20 groups of values, with each group holding ~1 TB of data. My 
>> custom
>> transform simply shuffles this TB of data around, so each set of outputs 
>> is
>> also 1TB in size.
>>
>> In my particular case, it seems my options are to either relax these
>> constraints, or to manually convert each step of my existing transform to
>> apply per key. This conversion process is tedious, but very
>> straightforward, e.g., the GroupByKey and ParDo that my transform is 
>> built
>> out of just need to deal with an expanded key.
>>
>> I wonder, could this be something built into Beam itself, e.g,. as
>> TransformPerKey? The ptranforms that result from combining other Beam
>> transforms (e.g., _ChainPTransform in Python) are private, so this seems
>> like something that would need to exist in Beam itself, if it could exist
>> at all.
>>
>> Cheers,
>> Stephan
>>
>


Re: Apache Beam Access Request for Anthony Zhu

2021-05-24 Thread Pablo Estrada
Added! Welcome to Beam!


On Mon, May 24, 2021 at 1:39 PM Anthony Zhu  wrote:

> Hi,
>
> This is Anthony Zhu from Google again, I am working closely with Rui Wang
> on the Java SDK harness. Can someone add me as a contributor for the Beam
> Jira issue tracker?
> I would like to assign myself the ticket that I will be working on this
> summer. My Jira username is aqzhu
>
> Thanks,
> Anthony
>
> On Thu, May 20, 2021 at 11:16 AM Anthony Zhu  wrote:
>
>> Hi,
>>
>> I am requesting permission to access the Apache Beam JIRA for Anthony
>> Zhu, username aqzhu for my work with Rui Wang.
>>
>> Thanks,
>> Anthony
>>
>> --
>> *Anthony Zhu*
>> University of Michigan Class of 2022
>> College of Engineering | CSE
>> 978.496.4537 <(978)%20496-4537> | aq...@umich.edu
>>
>>
>
> --
> *Anthony Zhu*
> University of Michigan Class of 2022
> College of Engineering | CSE
> 978.496.4537 <(978)%20496-4537> | aq...@umich.edu
>
>


Re: Apply a Beam PTransform per key

2021-05-24 Thread Reuven Lax
Is the issue that you have a different topology depending on the key?

On Mon, May 24, 2021 at 2:49 PM Stephan Hoyer  wrote:

> Exactly, my use-case has another nested GroupByKey to apply per key. But
> even if it could be done in a streaming fashion, it's way too much data (1
> TB) to process on a single worker in a reasonable amount of time.
>
> On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles  wrote:
>
>> I was thinking there was some non-trivial topology (such as further GBKs)
>> within the logic to be applied to each key group.
>>
>> Kenn
>>
>> On Mon, May 24, 2021 at 2:38 PM Brian Hulette 
>> wrote:
>>
>>> Isn't it possible to read the grouped values produced by a GBK from an
>>> Iterable and yield results as you go, without needing to collect all of
>>> each input into memory? Perhaps I'm misunderstanding your use-case.
>>>
>>> Brian
>>>
>>> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles 
>>> wrote:
>>>
 I'm just pinging this thread because I think it is an interesting
 problem and don't want it to slip by.

 I bet a lot of users have gone through the tedious conversion you
 describe. Of course, it may often not be possible if you are using a
 library transform. There are a number of aspects of the Beam model that are
 designed a specific way explicitly *because* we need to assume that a large
 number of composites in your pipeline are not modifiable by you. Most
 closely related: this is why windowing is something carried along
 implicitly rather than just a parameter to GBK - that would require all
 transforms to expose how they use GBK under the hood and they would all
 have to plumb this extra key/WindowFn through every API. Instead, we have
 this way to implicitly add a second key to any transform :-)

 So in addition to being tedious for you, it would be good to have a
 better solution.

 Kenn

 On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer 
 wrote:

> I'd like to write a Beam PTransform that applies an *existing* Beam
> transform to each set of grouped values, separately, and combines the
> result. Is anything like this possible with Beam using the Python SDK?
>
> Here are the closest things I've come up with:
> 1. If each set of *inputs* to my transform fit into memory, I could
> use GroupByKey followed by FlatMap.
> 2. If each set of *outputs* from my transform fit into memory, I
> could use CombinePerKey.
> 3. If I knew the static number of groups ahead of time, I could use
> Partition, followed by applying my transform multiple times, followed by
> Flatten.
>
> In my scenario, none of these holds true. For example, currently I
> have ~20 groups of values, with each group holding ~1 TB of data. My 
> custom
> transform simply shuffles this TB of data around, so each set of outputs 
> is
> also 1TB in size.
>
> In my particular case, it seems my options are to either relax these
> constraints, or to manually convert each step of my existing transform to
> apply per key. This conversion process is tedious, but very
> straightforward, e.g., the GroupByKey and ParDo that my transform is built
> out of just need to deal with an expanded key.
>
> I wonder, could this be something built into Beam itself, e.g,. as
> TransformPerKey? The ptranforms that result from combining other Beam
> transforms (e.g., _ChainPTransform in Python) are private, so this seems
> like something that would need to exist in Beam itself, if it could exist
> at all.
>
> Cheers,
> Stephan
>



Re: Apply a Beam PTransform per key

2021-05-24 Thread Stephan Hoyer
Exactly, my use-case has another nested GroupByKey to apply per key. But
even if it could be done in a streaming fashion, it's way too much data (1
TB) to process on a single worker in a reasonable amount of time.

On Mon, May 24, 2021 at 2:46 PM Kenneth Knowles  wrote:

> I was thinking there was some non-trivial topology (such as further GBKs)
> within the logic to be applied to each key group.
>
> Kenn
>
> On Mon, May 24, 2021 at 2:38 PM Brian Hulette  wrote:
>
>> Isn't it possible to read the grouped values produced by a GBK from an
>> Iterable and yield results as you go, without needing to collect all of
>> each input into memory? Perhaps I'm misunderstanding your use-case.
>>
>> Brian
>>
>> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles  wrote:
>>
>>> I'm just pinging this thread because I think it is an interesting
>>> problem and don't want it to slip by.
>>>
>>> I bet a lot of users have gone through the tedious conversion you
>>> describe. Of course, it may often not be possible if you are using a
>>> library transform. There are a number of aspects of the Beam model that are
>>> designed a specific way explicitly *because* we need to assume that a large
>>> number of composites in your pipeline are not modifiable by you. Most
>>> closely related: this is why windowing is something carried along
>>> implicitly rather than just a parameter to GBK - that would require all
>>> transforms to expose how they use GBK under the hood and they would all
>>> have to plumb this extra key/WindowFn through every API. Instead, we have
>>> this way to implicitly add a second key to any transform :-)
>>>
>>> So in addition to being tedious for you, it would be good to have a
>>> better solution.
>>>
>>> Kenn
>>>
>>> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer  wrote:
>>>
 I'd like to write a Beam PTransform that applies an *existing* Beam
 transform to each set of grouped values, separately, and combines the
 result. Is anything like this possible with Beam using the Python SDK?

 Here are the closest things I've come up with:
 1. If each set of *inputs* to my transform fit into memory, I could
 use GroupByKey followed by FlatMap.
 2. If each set of *outputs* from my transform fit into memory, I could
 use CombinePerKey.
 3. If I knew the static number of groups ahead of time, I could use
 Partition, followed by applying my transform multiple times, followed by
 Flatten.

 In my scenario, none of these holds true. For example, currently I have
 ~20 groups of values, with each group holding ~1 TB of data. My custom
 transform simply shuffles this TB of data around, so each set of outputs is
 also 1TB in size.

 In my particular case, it seems my options are to either relax these
 constraints, or to manually convert each step of my existing transform to
 apply per key. This conversion process is tedious, but very
 straightforward, e.g., the GroupByKey and ParDo that my transform is built
 out of just need to deal with an expanded key.

 I wonder, could this be something built into Beam itself, e.g,. as
 TransformPerKey? The ptranforms that result from combining other Beam
 transforms (e.g., _ChainPTransform in Python) are private, so this seems
 like something that would need to exist in Beam itself, if it could exist
 at all.

 Cheers,
 Stephan

>>>


Re: Apply a Beam PTransform per key

2021-05-24 Thread Kenneth Knowles
I was thinking there was some non-trivial topology (such as further GBKs)
within the logic to be applied to each key group.

Kenn

On Mon, May 24, 2021 at 2:38 PM Brian Hulette  wrote:

> Isn't it possible to read the grouped values produced by a GBK from an
> Iterable and yield results as you go, without needing to collect all of
> each input into memory? Perhaps I'm misunderstanding your use-case.
>
> Brian
>
> On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles  wrote:
>
>> I'm just pinging this thread because I think it is an interesting problem
>> and don't want it to slip by.
>>
>> I bet a lot of users have gone through the tedious conversion you
>> describe. Of course, it may often not be possible if you are using a
>> library transform. There are a number of aspects of the Beam model that are
>> designed a specific way explicitly *because* we need to assume that a large
>> number of composites in your pipeline are not modifiable by you. Most
>> closely related: this is why windowing is something carried along
>> implicitly rather than just a parameter to GBK - that would require all
>> transforms to expose how they use GBK under the hood and they would all
>> have to plumb this extra key/WindowFn through every API. Instead, we have
>> this way to implicitly add a second key to any transform :-)
>>
>> So in addition to being tedious for you, it would be good to have a
>> better solution.
>>
>> Kenn
>>
>> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer  wrote:
>>
>>> I'd like to write a Beam PTransform that applies an *existing* Beam
>>> transform to each set of grouped values, separately, and combines the
>>> result. Is anything like this possible with Beam using the Python SDK?
>>>
>>> Here are the closest things I've come up with:
>>> 1. If each set of *inputs* to my transform fit into memory, I could use
>>> GroupByKey followed by FlatMap.
>>> 2. If each set of *outputs* from my transform fit into memory, I could
>>> use CombinePerKey.
>>> 3. If I knew the static number of groups ahead of time, I could use
>>> Partition, followed by applying my transform multiple times, followed by
>>> Flatten.
>>>
>>> In my scenario, none of these holds true. For example, currently I have
>>> ~20 groups of values, with each group holding ~1 TB of data. My custom
>>> transform simply shuffles this TB of data around, so each set of outputs is
>>> also 1TB in size.
>>>
>>> In my particular case, it seems my options are to either relax these
>>> constraints, or to manually convert each step of my existing transform to
>>> apply per key. This conversion process is tedious, but very
>>> straightforward, e.g., the GroupByKey and ParDo that my transform is built
>>> out of just need to deal with an expanded key.
>>>
>>> I wonder, could this be something built into Beam itself, e.g,. as
>>> TransformPerKey? The ptranforms that result from combining other Beam
>>> transforms (e.g., _ChainPTransform in Python) are private, so this seems
>>> like something that would need to exist in Beam itself, if it could exist
>>> at all.
>>>
>>> Cheers,
>>> Stephan
>>>
>>


Re: Apply a Beam PTransform per key

2021-05-24 Thread Brian Hulette
Isn't it possible to read the grouped values produced by a GBK from an
Iterable and yield results as you go, without needing to collect all of
each input into memory? Perhaps I'm misunderstanding your use-case.

Brian

On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles  wrote:

> I'm just pinging this thread because I think it is an interesting problem
> and don't want it to slip by.
>
> I bet a lot of users have gone through the tedious conversion you
> describe. Of course, it may often not be possible if you are using a
> library transform. There are a number of aspects of the Beam model that are
> designed a specific way explicitly *because* we need to assume that a large
> number of composites in your pipeline are not modifiable by you. Most
> closely related: this is why windowing is something carried along
> implicitly rather than just a parameter to GBK - that would require all
> transforms to expose how they use GBK under the hood and they would all
> have to plumb this extra key/WindowFn through every API. Instead, we have
> this way to implicitly add a second key to any transform :-)
>
> So in addition to being tedious for you, it would be good to have a better
> solution.
>
> Kenn
>
> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer  wrote:
>
>> I'd like to write a Beam PTransform that applies an *existing* Beam
>> transform to each set of grouped values, separately, and combines the
>> result. Is anything like this possible with Beam using the Python SDK?
>>
>> Here are the closest things I've come up with:
>> 1. If each set of *inputs* to my transform fit into memory, I could use
>> GroupByKey followed by FlatMap.
>> 2. If each set of *outputs* from my transform fit into memory, I could
>> use CombinePerKey.
>> 3. If I knew the static number of groups ahead of time, I could use
>> Partition, followed by applying my transform multiple times, followed by
>> Flatten.
>>
>> In my scenario, none of these holds true. For example, currently I have
>> ~20 groups of values, with each group holding ~1 TB of data. My custom
>> transform simply shuffles this TB of data around, so each set of outputs is
>> also 1TB in size.
>>
>> In my particular case, it seems my options are to either relax these
>> constraints, or to manually convert each step of my existing transform to
>> apply per key. This conversion process is tedious, but very
>> straightforward, e.g., the GroupByKey and ParDo that my transform is built
>> out of just need to deal with an expanded key.
>>
>> I wonder, could this be something built into Beam itself, e.g,. as
>> TransformPerKey? The ptranforms that result from combining other Beam
>> transforms (e.g., _ChainPTransform in Python) are private, so this seems
>> like something that would need to exist in Beam itself, if it could exist
>> at all.
>>
>> Cheers,
>> Stephan
>>
>


Re: Proposal: Generalize S3FileSystem

2021-05-24 Thread Matt Rudary
Thanks for the comments all. I forgot to subscribe to dev before I sent out the 
email, so this response isn't threaded properly.



My proposed design is to do the following (for both aws and aws2 packages):

1.   Add a public class, S3FileSystemConfiguration, that mostly maps to the 
S3Options, plus a Scheme field.

2.   Add a public interface, S3FileSystemSchemeRegistrar, designed for use 
with AutoService. It will have a method that takes a PipelineOptions and 
returns an Iterable of S3FileSystemConfiguration. This will be the way that 
users register their S3 uri schemes with the system.

3.   Add an implementation of S3FileSystemSchemeRegistrar for the s3 scheme 
that uses the S3Options from PipelineOptions to populate its 
S3FileSystemConfiguration, maintaining the current behavior by default.

4.   Modify S3FileSystem's constructor to take an S3FileSystemConfiguration 
object instead of an S3Options, and make the relevant changes.

5.   Modify S3FileSystemRegistrar to load all the AutoService'd file system 
configurations, raising an exception if multiple scheme registrars attempt to 
register the same scheme.



I considered alternative methods of configuration, in particular by using some 
configuration file as in HadoopFileSystemOptions. In the end, I decided that 
the AutoService approach was better. First, it seems to me more common to do 
things this way within Beam. Second, unlike with Hadoop, there's no commonly 
used configuration for these types of file systems already in use, and it's not 
clear the best way to deal with this (YAML? JSON? Java Properties? XML?). 
Finally, I think the story for composing multiple registrars is better than the 
story for composing multiple configuration files; for example, this use case 
may make sense in case you are dealing with multiple storage vendors.



Matt



On 2021/05/19 13:27:16, Matt Rudary 
mailto:m...@twosigma.com>> wrote:

> Hi,>

>

> This is a quick sketch of a proposal - I wanted to get a sense of whether 
> there's general support for this idea before fleshing it out further, getting 
> internal approvals, etc.>

>

> I'm working with multiple storage systems that speak the S3 api. I would like 
> to support FileIO operations for these storage systems, but S3FileSystem 
> hardcodes the s3 scheme (the various systems use different URI schemes) and 
> it is in any case impossible to instantiate more than one in the current 
> design.>

>

> I'd like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and maybe 
> ...aws.options) somewhat to enable this use-case. I haven't worked out the 
> details yet, but it will take some thought to make this work in a non-hacky 
> way.>

>

> Thanks>

> Matt Rudary>

>


Re: Apache Beam Access Request for Anthony Zhu

2021-05-24 Thread Anthony Zhu
Hi,

This is Anthony Zhu from Google again, I am working closely with Rui Wang
on the Java SDK harness. Can someone add me as a contributor for the Beam
Jira issue tracker?
I would like to assign myself the ticket that I will be working on this
summer. My Jira username is aqzhu

Thanks,
Anthony

On Thu, May 20, 2021 at 11:16 AM Anthony Zhu  wrote:

> Hi,
>
> I am requesting permission to access the Apache Beam JIRA for Anthony Zhu,
> username aqzhu for my work with Rui Wang.
>
> Thanks,
> Anthony
>
> --
> *Anthony Zhu*
> University of Michigan Class of 2022
> College of Engineering | CSE
> 978.496.4537 | aq...@umich.edu
>
>

-- 
*Anthony Zhu*
University of Michigan Class of 2022
College of Engineering | CSE
978.496.4537 | aq...@umich.edu


Re: beam dev community

2021-05-24 Thread Boyuan Zhang
Welcome! Please checkout the contribution guide if you plan to contribute
to beam: https://beam.apache.org/contribute/

On Mon, May 24, 2021 at 11:30 AM shahnawaz aziz 
wrote:

> Please add me to this community as we are using the beam for our big data
> project.
>
> Thanks
>


Flaky test issue report (40)

2021-05-24 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-12322: 
FnApiRunnerTestWithGrpcAndMultiWorkers flaky (py precommit) (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12309: 
PubSubIntegrationTest.test_streaming_data_only flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12307: 
PubSubBigQueryIT.test_file_loads flake (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12303: Flake in 
PubSubIntegrationTest.test_streaming_with_attributes (created 2021-05-06)
https://issues.apache.org/jira/browse/BEAM-12293: 
FlinkSavepointTest.testSavepointRestoreLegacy flakes due to 
FlinkJobNotFoundException (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (created 2021-03-18)
https://issues.apache.org/jira/browse/BEAM-11792: Python precommit failed 
(flaked?) installing package  (created 2021-02-10)
https://issues.apache.org/jira/browse/BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (created 2021-01-20)
https://issues.apache.org/jira/browse/BEAM-11662: elasticsearch tests 
failing (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11645: beam_PostCommit_XVR_Flink 
failing (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-11540: Linter sometimes flakes 
on apache_beam.dataframe.frames_test (created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10995: Java + Universal Local 
Runner: WindowingTest.testWindowPreservation fails (created 2020-09-30)
https://issues.apache.org/jira/browse/BEAM-10987: 
stager_test.py::StagerTest::test_with_main_session flaky on windows py3.6,3.7 
(created 2020-09-29)
https://issues.apache.org/jira/browse/BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (created 2020-09-25)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job  (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10504: Failure / flake in 
ElasticSearchIOTest > testWriteFullAddressing and testWriteWithIndexFn (created 
2020-07-15)
https://issues.apache.org/jira/browse/BEAM-10501: 
CheckGrafanaStalenessAlerts and PingGrafanaHttpApi fail with Connection refused 
(created 2020-07-15)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-9392: TestStream tests are all 
flaky (created 2020-02-27)
https://issues.apache.org/jira/browse/BEAM-9232: 
BigQueryWriteIntegrationTests is flaky coercing to Unicode (created 2020-01-31)
https://issues.apache.org/jira/browse/BEAM-9119: 
apache_beam.runners.portability.fn_api_runner_test.FnApiRunnerTest[...].test_large_elements
 is flaky (created 2020-01-14)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
[beam_PreCommit_Java_Phrase] [WatchTest.testMultiplePollsWithManyResults]  
Flake: Outputs must be in timestamp order (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7992: Unhandled type_constraint 
in 
apache_b

beam dev community

2021-05-24 Thread shahnawaz aziz
Please add me to this community as we are using the beam for our big data
project.

Thanks


P1 issues report (38)

2021-05-24 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-12389: 
beam_PostCommit_XVR_Dataflow flaky: Expand method not found (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12387: beam_PostCommit_Python* 
timing out (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12386: 
beam_PostCommit_Py_VR_Dataflow(_V2) failing metrics tests (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12380: Go SDK Kafka IO Transform 
implemented via XLang (created 2021-05-21)
https://issues.apache.org/jira/browse/BEAM-12374: Spark postcommit failing 
ResumeFromCheckpointStreamingTest (created 2021-05-20)
https://issues.apache.org/jira/browse/BEAM-12316: LGPL in bundled 
dependencies (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12310: 
beam_PostCommit_Java_DataflowV2 failing (created 2021-05-07)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12258: SQL postcommit timing out 
(created 2021-04-30)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-12231: 
beam_PostRelease_NightlySnapshot failing (created 2021-04-27)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)
https://issues.apache.org/jira/browse/BEAM-11828: JmsIO is not 
acknowledging messages correctly (created 2021-02-17)
https://issues.apache.org/jira/browse/BEAM-11755: Cross-language 
consistency (RequiresStableInputs) is quietly broken (at least on portable 
flink runner) (created 2021-02-05)
https://issues.apache.org/jira/browse/BEAM-11578: `dataflow_metrics` 
(python) fails with TypeError (when int overflowing?) (created 2021-01-06)
https://issues.apache.org/jira/browse/BEAM-11434: Expose Spanner 
admin/batch clients in Spanner Accessor (created 2020-12-10)
https://issues.apache.org/jira/browse/BEAM-11227: Upgrade 
beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 (created 2020-11-10)
https://issues.apache.org/jira/browse/BEAM-11148: Kafka 
commitOffsetsInFinalize OOM on Flink (created 2020-10-28)
https://issues.apache.org/jira/browse/BEAM-11017: Timer with dataflow 
runner can be set multiple times (dataflow runner) (created 2020-10-05)
https://issues.apache.org/jira/browse/BEAM-10670: Make non-portable 
Splittable DoFn the only option when executing Java "Read" transforms (created 
2020-08-10)
https://issues.apache.org/jira/browse/BEAM-10617: python 
CombineGlobally().with_fanout() cause duplicate combine results for sliding 
windows (created 2020-07-31)
https://issues.apache.org/jira/browse/BEAM-10569: SpannerIO tests don't 
actually assert anything. (created 2020-07-23)
https://issues.apache.org/jira/browse/BEAM-10288: Quickstart documents are 
out of date (created 2020-06-19)
https://issues.apache.org/jira/browse/BEAM-10244: Populate requirements 
cache fails on poetry-based packages (created 2020-06-11)
https://issues.apache.org/jira/browse/BEAM-10100: FileIO writeDynamic with 
AvroIO.sink not writing all data (created 2020-05-27)
https://issues.apache.org/jira/browse/BEAM-9564: Remove insecure ssl 
options from MongoDBIO (created 2020-03-20)
https://issues.apache.org/jira/browse/BEAM-9455: Environment-sensitive 
provisioning for Dataflow (created 2020-03-05)
https://issues.apache.org/jira/browse/BEAM-9293: Python direct runner 
doesn't emit empty pane when it should (created 2020-02-11)
https://issues.apache.org/jira/browse/BEAM-8986: SortValues may not work 
correct for numerical types (created 2019-12-17)
https://issues.apache.org/jira/browse/BEAM-8985: SortValues should fail if 
SecondaryKey coder is not deterministic (created 2019-12-17)
https://issues.apache.org/jira/browse/BEAM-8407: [SQL] Some Hive tests 
throw NullPointerException, but get marked as passing (Direct Runner) (created 
2019-10-15)
https://issues.apache.org/jira/browse/BEAM-7717: PubsubIO watermark 
tracking hovers near start of epoch (created 2019-07-10)
https://issues.apache.org/jira/browse/BEAM-7716: PubsubIO 

Re: Apply a Beam PTransform per key

2021-05-24 Thread Kenneth Knowles
I'm just pinging this thread because I think it is an interesting problem
and don't want it to slip by.

I bet a lot of users have gone through the tedious conversion you describe.
Of course, it may often not be possible if you are using a library
transform. There are a number of aspects of the Beam model that are
designed a specific way explicitly *because* we need to assume that a large
number of composites in your pipeline are not modifiable by you. Most
closely related: this is why windowing is something carried along
implicitly rather than just a parameter to GBK - that would require all
transforms to expose how they use GBK under the hood and they would all
have to plumb this extra key/WindowFn through every API. Instead, we have
this way to implicitly add a second key to any transform :-)

So in addition to being tedious for you, it would be good to have a better
solution.

Kenn

On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer  wrote:

> I'd like to write a Beam PTransform that applies an *existing* Beam
> transform to each set of grouped values, separately, and combines the
> result. Is anything like this possible with Beam using the Python SDK?
>
> Here are the closest things I've come up with:
> 1. If each set of *inputs* to my transform fit into memory, I could use
> GroupByKey followed by FlatMap.
> 2. If each set of *outputs* from my transform fit into memory, I could
> use CombinePerKey.
> 3. If I knew the static number of groups ahead of time, I could use
> Partition, followed by applying my transform multiple times, followed by
> Flatten.
>
> In my scenario, none of these holds true. For example, currently I have
> ~20 groups of values, with each group holding ~1 TB of data. My custom
> transform simply shuffles this TB of data around, so each set of outputs is
> also 1TB in size.
>
> In my particular case, it seems my options are to either relax these
> constraints, or to manually convert each step of my existing transform to
> apply per key. This conversion process is tedious, but very
> straightforward, e.g., the GroupByKey and ParDo that my transform is built
> out of just need to deal with an expanded key.
>
> I wonder, could this be something built into Beam itself, e.g,. as
> TransformPerKey? The ptranforms that result from combining other Beam
> transforms (e.g., _ChainPTransform in Python) are private, so this seems
> like something that would need to exist in Beam itself, if it could exist
> at all.
>
> Cheers,
> Stephan
>


Spark Structured Streaming Runner Roadmap

2021-05-24 Thread Yu Zhang
Hi Beam Community, 

Would there be any roadmap for Spark Structured Runner to support streaming and 
Splittable DoFn API? Like the specific timeline or release version. 

Thanks, 
Yu

Re: GCS API call testing

2021-05-24 Thread Miguel Hernández Sandoval
Sorry, forgot to include the ticket
[1] https://issues.apache.org/jira/browse/BEAM-11984

On Mon, May 24, 2021 at 7:54 AM Miguel Hernández Sandoval <
rogelio.hernan...@wizeline.com> wrote:

> Hi team,
> I've been working on this ticket[1] where some API requests to GCS are
> counted and kept as process-wide metrics.
>
> I was hoping you could give me some advice on how to test this kind of
> feature to test it using a real response from GCS instead of mocked
> responses and how to test it using a pipeline.
>
> Thank you all for your help
> -Mike
>


-- 

Miguel Hernández Sandoval | WIZELINE

Software Engineer

rogelio.hernan...@wizeline.com

Amado Nervo 2200, Esfera P6, Col. Jardines del Sol, 45050 Zapopan, Jal.

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


GCS API call testing

2021-05-24 Thread Miguel Hernández Sandoval
Hi team,
I've been working on this ticket[1] where some API requests to GCS are
counted and kept as process-wide metrics.

I was hoping you could give me some advice on how to test this kind of
feature to test it using a real response from GCS instead of mocked
responses and how to test it using a pipeline.

Thank you all for your help
-Mike

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


Beam Dependency Check Report (2021-05-24)

2021-05-24 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
chromedriver-binary
88.0.4324.96.0
91.0.4472.19.0
2021-01-25
2021-04-26BEAM-10426
dill
0.3.1.1
0.3.3
2019-10-07
2020-11-02BEAM-11167
google-cloud-bigtable
1.7.0
2.2.0
2021-04-12
2021-05-03BEAM-8127
google-cloud-datastore
1.15.3
2.1.2
2020-11-16
2021-05-10BEAM-8443
google-cloud-dlp
1.0.0
3.0.1
2020-06-29
2021-02-01BEAM-10344
google-cloud-language
1.3.0
2.0.0
2020-10-26
2020-10-26BEAM-8
google-cloud-pubsub
1.7.0
2.5.0
2020-07-20
2021-05-24BEAM-5539
google-cloud-spanner
1.19.1
3.4.0
2020-11-16
2021-05-03BEAM-10345
google-cloud-videointelligence
1.16.1
2.1.0
2020-11-23
2021-04-05BEAM-11319
google-cloud-vision
1.0.0
2.3.1
2020-03-24
2021-04-19BEAM-9581
idna
2.10
3.1
2021-01-04
2021-01-11BEAM-9328
mock
2.0.0
4.0.3
2019-05-20
2020-12-14BEAM-7369
mypy-protobuf
1.18
2.4
2020-03-24
2021-02-08BEAM-10346
nbconvert
5.6.1
6.0.7
2020-10-05
2020-10-05BEAM-11007
Pillow
7.2.0
8.2.0
2020-10-19
2021-04-05BEAM-11071
PyHamcrest
1.10.1
2.0.2
2020-01-20
2020-07-08BEAM-9155
pytest
4.6.11
6.2.4
2020-07-08
2021-05-10BEAM-8606
pytest-xdist
1.34.0
2.2.1
2020-08-17
2021-02-15BEAM-10713
setuptools
56.0.0
57.0.0
2021-04-12
2021-05-24BEAM-10714
tenacity
5.1.5
7.0.0
2019-11-11
2021-03-08BEAM-8607
typing-extensions
3.7.4.3
3.10.0.0
2021-05-03
2021-05-03BEAM-12267
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.azure:azure-core
1.6.0
1.16.0
2020-07-02
2021-05-07BEAM-11888
com.azure:azure-identity
1.0.8
1.3.0
2020-07-07
2021-05-12BEAM-11814
com.azure:azure-storage-common
12.8.0
12.12.0-beta.1
2020-08-13
2021-05-13BEAM-11889
com.datastax.cassandra:cassandra-driver-core
3.10.2
4.0.0
2020-08-26
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.1.1
2018-03-20
2021-05-02BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.33.0
0.38.0
2020-09-14
2021-03-08BEAM-6645
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1
1.17.0
1.21.1
2021-03-30
2021-05-19BEAM-11890
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1beta2
0.117.0
0.121.1
2021-03-30
2021-05-19BEAM-11891
com.google.api.grpc:proto-google-cloud-bigtable-admin-v2
1.22.0
1.25.0
None
2021-05-24BEAM-12152
com.google.api.grpc:proto-google-cloud-bigtable-v2
1.22.0
1.25.0
2021-04-07
2021-05-19BEAM-8679
com.google.api.grpc:proto-google-cloud-dlp-v2
1.1.4
2.3.4
2020-05-04
2021-05-14BEAM-11892
com.google.api.grpc:proto-google-cloud-video-intelligence-v1
1.2.0
1.6.5
2020-03-10
2021-05-14BEAM-11894
com.google.api.grpc:proto-google-cloud-vision-v1
1.81.3
1.102.3
2020-04-07
2021-05-14BEAM-11895
com.google.apis:google-api-services-bigquery
v2-rev20210410-1.31.0
v2-rev20210430-1.31.0
2021-04-16
2021-05-07BEAM-8684
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20210331-1.31.0
v3-rev20210411-1.31.0
2021-04-09
2021-04-20BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20210408-1.31.0
v1beta3-rev12-1.20.0
2021-04-17
2015-04-29BEAM-8752
com.google.apis:google-api-services-healthcare
v1beta1-rev20210407-1.31.0
v1-rev20210507-1.31.0

KafkaIO SSL issue

2021-05-24 Thread Ilya Kozyrev
Hi community,

We have an issue with KafkaIO in the case of using a secure connection SASL SSL 
to the Confluent Kafka 5.5.1. When we trying to configure the Kafka consumer 
using consumerFactoryFn, we have an irregular issue related to certificate 
reads from the file system. Irregular means, that different Dataflow jobs with 
the same parameters and certs might be failed and succeeded. Store cert types 
for Keystore and Truststore are specified explicitly in consumer config. In our 
case, it's JKS for both certs.

Stacktrase:
Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL keystore 
/tmp/kafka.truststore.jks of type JKS
  at 
org.apache.kafka.common.security.ssl.SslEngineBuilder$SecurityStore.load(SslEngineBuilder.java:289)
  at 
org.apache.kafka.common.security.ssl.SslEngineBuilder.createSSLContext(SslEngineBuilder.java:153)
  ... 23 more
Caused by: java.security.cert.CertificateException: Unable to initialize, 
java.io.IOException: DerInputStream.getLength(): lengthTag=65, too big.
  at sun.security.x509.X509CertImpl.(X509CertImpl.java:198)
  at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:102)
  at 
java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
  at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:755)
  at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:56)
  at 
sun.security.provider.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:224)
  at 
sun.security.provider.JavaKeyStore$DualFormatJKS.engineLoad(JavaKeyStore.java:70)
  at java.security.KeyStore.load(KeyStore.java:1445)
  at 
org.apache.kafka.common.security.ssl.SslEngineBuilder$SecurityStore.load(SslEngineBuilder.java:286)
  ... 24 more

/tmp/kafka.truststore.jks is a path that’s used in consumerFactoryFn to load 
certs from GCP to the worker's local file system.

Does anyone have any ideas on how to fix this issue?


Thank you,
Ilya