Re: [ANNOUNCE] New committer: Yi Hu

2022-11-09 Thread Brian Hulette via dev
Well deserved! Congratulations Yi

On Wed, Nov 9, 2022 at 11:25 AM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> I am with the Beam PMC on this, congratulations and very well deserved, Yi!
>
> On Wed, Nov 9, 2022 at 11:08 AM Byron Ellis via dev 
> wrote:
>
>> Congratulations!
>>
>> On Wed, Nov 9, 2022 at 11:00 AM Pablo Estrada via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 thanks Yi : D
>>>
>>> On Wed, Nov 9, 2022 at 10:47 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Yi! I've really appreciated the ways you've consistently taken
 responsibility for improving our team's infra and working through sharp
 edges in the codebase that others have ignored. This is definitely well
 deserved!

 Thanks,
 Danny

 On Wed, Nov 9, 2022 at 1:37 PM Anand Inguva via dev <
 dev@beam.apache.org> wrote:

> Congratulations Yi!
>
> On Wed, Nov 9, 2022 at 1:35 PM Ritesh Ghorse via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Yi!
>>
>> On Wed, Nov 9, 2022 at 1:34 PM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Yi!
>>>
>>> On Wed, Nov 9, 2022 at 1:33 PM Sachin Agarwal via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations Yi!

 On Wed, Nov 9, 2022 at 10:32 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Yi Hu (y...@apache.org)
>
> Yi started contributing to Beam in early 2022. Yi's contributions
> are very diverse! I/Os, performance tests, Jenkins, support for Schema
> logical types. Not only code but a very large amount of code review. 
> Yi is
> also noted for picking up smaller issues that normally would be left 
> on the
> backburner and filing issues that he finds rather than ignoring them.
>
> Considering their contributions to the project over this
> timeframe, the Beam PMC trusts Yi with the responsibilities of a Beam
> committer. [1]
>
> Thank you Yi! And we are looking to see more of your contributions!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>



Re: Beam starter projects dependency updates

2022-11-07 Thread Brian Hulette via dev
These have all been addressed. I went through and merged all of them,
except for the slf4j-jdk14 dependency in Java and Kotlin. After consulting
with Luke [1] I told dependabot to ignore this dependency.

[1]
https://github.com/apache/beam-starter-java/pull/26#issuecomment-130263Java9941


On Wed, Nov 2, 2022 at 10:58 AM David Cavazos  wrote:

> Hi, I just opened some PRs to auto-assign dependabot PRs.
>
> Java: https://github.com/apache/beam-starter-java/pull/29
> Python: https://github.com/apache/beam-starter-python/pull/11
> Go: https://github.com/apache/beam-starter-go/pull/7
> Kotlin: https://github.com/apache/beam-starter-kotlin/pull/9
>
> For the existing dependabot PRs, can someone help me batch merge them? All
> tests are passing on all of them, so they should all be safe to merge.
>
> On Thu, Oct 27, 2022 at 1:56 PM Brian Hulette  wrote:
>
>> Could we just use the same set of reviewers as pr-bot in the main repo
>> [1]? I don't think that we could avoid duplicating the data though.
>>
>> [1]
>> https://github.com/apache/beam/blob/728e8ecc8a40d3d578ada7773b77eca2b3c68d03/.github/REVIEWERS.yml
>>
>> On Thu, Oct 27, 2022 at 12:20 PM David Cavazos via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone!
>>>
>>> We want to make sure the Beam starter projects always come with the
>>> latest (compatible) versions for every dependency. I enabled Dependabot on
>>> all of them to automate this as much as possible, and we have automated
>>> tests to make sure everything works as expected.
>>>
>>> However, we still need someone to merge Dependabot's PRs. The good news
>>> is that since the starter projects are so simple, if tests pass they're
>>> most likely safe to merge, and tests only take a couple minutes to run.
>>>
>>> We could either batch update all dependencies as part of the release
>>> process, or have people check them periodically (like an owner per
>>> language).
>>>
>>> These are all the repos we have to keep an eye to:
>>>
>>>- https://github.com/apache/beam-starter-java -- 9 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-python -- 2 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-go -- 0 updates
>>>- https://github.com/apache/beam-starter-kotlin -- 3 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-scala -- not done yet, but
>>>keep an eye
>>>
>>>


[Python][Bikeshed] typehint vs. type-hint vs. "type hint"

2022-11-07 Thread Brian Hulette via dev
Hi everyone,

In a recent code review we noticed that we are not consistent when
describing python type hints in documentation. Depending on who wrote the
patch, we switch between typehint, type-hint, and "type hint" [1].

I think we should standardize on "type hint" as this is what Guido used in
PEP 484 [2]. Please comment on the issue in the next few days if you
disagree with this approach.

Note this is orthogonal to how we refer to type hints in _code_, in our
public APIs. In general we use "type" in that context (e.g.
`with_input_types`), and there doesn't seem to be a consistency issue.

[1] https://github.com/apache/beam/issues/23950
[2] https://peps.python.org/pep-0484/


Re: Beam Website Feedback

2022-10-27 Thread Brian Hulette via dev
I proposed https://github.com/apache/beam/pull/23877 to address this.

On Thu, Oct 27, 2022 at 2:12 PM Sachin Agarwal  wrote:

> No objections here.  The latter (the surviving) is the one linked in
> the top navigation bar and has the x-lang details that help.
>
> On Thu, Oct 27, 2022 at 2:09 PM Brian Hulette  wrote:
>
>> Hm, it seems like we need to drop
>> https://beam.apache.org/documentation/io/built-in/ as it's been
>> superseded by https://beam.apache.org/documentation/io/connectors/
>>
>> Would there be any objections to that?
>>
>> On Thu, Oct 27, 2022 at 2:04 PM Sachin Agarwal via dev <
>> dev@beam.apache.org> wrote:
>>
>>> JDBCIO is available as a Java-based IO.  It is also listed on
>>> https://beam.apache.org/documentation/io/connectors/
>>>
>>> On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai <
>>> char...@charleskangai.co.uk> wrote:
>>>
 What about jdbc?
 I want to use Beam to read/write to/from a relational database, e.g.
 Oracle or Microsoft SQL Server.
 I don’t see a connector on your page:
 *https://beam.apache.org/documentation/io/built-in*
 

 Thanks,
 Charles Kangai



>>>


Re: Beam Website Feedback

2022-10-27 Thread Brian Hulette via dev
Hm, it seems like we need to drop
https://beam.apache.org/documentation/io/built-in/ as it's been superseded
by https://beam.apache.org/documentation/io/connectors/

Would there be any objections to that?

On Thu, Oct 27, 2022 at 2:04 PM Sachin Agarwal via dev 
wrote:

> JDBCIO is available as a Java-based IO.  It is also listed on
> https://beam.apache.org/documentation/io/connectors/
>
> On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai <
> char...@charleskangai.co.uk> wrote:
>
>> What about jdbc?
>> I want to use Beam to read/write to/from a relational database, e.g.
>> Oracle or Microsoft SQL Server.
>> I don’t see a connector on your page:
>> *https://beam.apache.org/documentation/io/built-in*
>> 
>>
>> Thanks,
>> Charles Kangai
>>
>>
>>
>


Re: Beam starter projects dependency updates

2022-10-27 Thread Brian Hulette via dev
Could we just use the same set of reviewers as pr-bot in the main repo [1]?
I don't think that we could avoid duplicating the data though.

[1]
https://github.com/apache/beam/blob/728e8ecc8a40d3d578ada7773b77eca2b3c68d03/.github/REVIEWERS.yml

On Thu, Oct 27, 2022 at 12:20 PM David Cavazos via dev 
wrote:

> Hi everyone!
>
> We want to make sure the Beam starter projects always come with the latest
> (compatible) versions for every dependency. I enabled Dependabot on all of
> them to automate this as much as possible, and we have automated tests to
> make sure everything works as expected.
>
> However, we still need someone to merge Dependabot's PRs. The good news is
> that since the starter projects are so simple, if tests pass they're most
> likely safe to merge, and tests only take a couple minutes to run.
>
> We could either batch update all dependencies as part of the release
> process, or have people check them periodically (like an owner per
> language).
>
> These are all the repos we have to keep an eye to:
>
>- https://github.com/apache/beam-starter-java -- 9 updates, all tests
>passing
>- https://github.com/apache/beam-starter-python -- 2 updates, all
>tests passing
>- https://github.com/apache/beam-starter-go -- 0 updates
>- https://github.com/apache/beam-starter-kotlin -- 3 updates, all
>tests passing
>- https://github.com/apache/beam-starter-scala -- not done yet, but
>keep an eye
>
>


Re: Beam Website Feedback

2022-10-04 Thread Brian Hulette via dev
On Tue, Oct 4, 2022 at 8:58 AM Alexey Romanenko 
wrote:

> Thanks for your feedback.
>
> At the time, using a Google website search was a simplest solution since,
> before, we didn’t have a search at all. I agree that it could be
> frustrating to have ad links before the actual results (not sure that we
> can avoid them there) but "it is what it is” and it's still possible to
> have the correct links further which is better than nothing.
>
> Beam community is always welcome for suggestions and, especially,
> contributions to improve the project in any possible way. I’d be happy to
> assist on this topic if someone will decide to improve Beam website search.
>

+1, PRs welcome :)
I put some specific suggestions for a replacement in the issue, based on
recommendations from the hugo docs [1].

[1] https://gohugo.io/tools/search/


>
> —
> Alexey
>
> On 3 Oct 2022, at 23:21, Borris  wrote:
>
> This is my experience of trying the search capability.
>
>- I know I want to read about dataframes (I was reading this 10
>minutes ago but browsing history didn't take me back to where I wanted)
>- I search for "dataframes"
>- I am presented with a whole load of pages that are elsewhere (other
>sites) - maybe what I want is some pages below, but I stop at this point as
>I think its a fundamental failure of what I expect from the search dialogue
>- If I enter "beam.apache.org: dataframe" to the search dialogue then
>the sensible relevant page is now visible, only 5 links down
>- I know this may be a penalty of getting a "free" search service from
>your viewpoint
>- But from my viewpoint this is a failure. Your search capability
>fails to understand that by searching for something on your site, rather
>than generically through a search engine, I am massively predisposed to the
>pages on your site, whereas the search results are more predisposed to
>offering advertising opportunities.
>- It is very frustrating that something as simple as, on the Beam
>site, going to the page about Beam Dataframes takes such a level of hoop
>jumping
>
> That is my feedback offering. Thank you for taking the time to read it.
>
>
>
>
>


Re: Beam Website Feedback

2022-10-03 Thread Brian Hulette via dev
Thanks Borris, that is helpful feedback. I filed an issue [1] to track
improving this.

[1] https://github.com/apache/beam/issues/23472

On Mon, Oct 3, 2022 at 2:32 PM Borris  wrote:

> This is my experience of trying the search capability.
>
>- I know I want to read about dataframes (I was reading this 10
>minutes ago but browsing history didn't take me back to where I wanted)
>- I search for "dataframes"
>- I am presented with a whole load of pages that are elsewhere (other
>sites) - maybe what I want is some pages below, but I stop at this point as
>I think its a fundamental failure of what I expect from the search dialogue
>- If I enter "beam.apache.org: dataframe" to the search dialogue then
>the sensible relevant page is now visible, only 5 links down
>- I know this may be a penalty of getting a "free" search service from
>your viewpoint
>- But from my viewpoint this is a failure. Your search capability
>fails to understand that by searching for something on your site, rather
>than generically through a search engine, I am massively predisposed to the
>pages on your site, whereas the search results are more predisposed to
>offering advertising opportunities.
>- It is very frustrating that something as simple as, on the Beam
>site, going to the page about Beam Dataframes takes such a level of hoop
>jumping
>
> That is my feedback offering. Thank you for taking the time to read it.
>
>
>
>


Re: Out of band pickling in Python (pickle5)

2022-09-19 Thread Brian Hulette via dev
I got to thinking about this again and ran some benchmarks. The result is
documented in the GitHub issue [1].

tl;dr: we can't realize a huge benefit since we don't actually have an
out-of-band path for exchanging the buffers. However, pickle 5 can yield
improved in-band performance as well, and I think we can take advantage of
this with some relatively simple adjustments to PickleCoder and
OutputStream.

[1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001
[2] https://peps.python.org/pep-0574/#improved-in-band-performance

On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer  wrote:

> I'm unlikely to have bandwidth to take this one on, but I do think it
> would be quite valuable!
>
> On Thu, May 27, 2021 at 4:42 PM Brian Hulette  wrote:
>
>> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
>> you have any interest in taking it on?
>>
>> On Tue, May 25, 2021 at 3:09 PM Brian Hulette 
>> wrote:
>>
>>> Hm this would definitely be of interest for the DataFrame API, which is
>>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>>> that pandas supports out-of-band pickling since DataFrames are mostly just
>>> collections of numpy arrays.
>>>
>>> Brian
>>>
>>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>>
>>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>>>
 Beam's PickleCoder would need to be updated to pass the
 "buffer_callback" argument into pickle.dumps() and the "buffers" argument
 into pickle.loads(). I expect this would be relatively straightforward.

 Then it should "just work", assuming that data is stored in objects
 (like NumPy arrays or wrappers of NumPy arrays) that implement the
 out-of-band Pickle protocol.


 On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
 wrote:

> I'm not aware of anyone looking at it.
>
> Will out-of-band pickling "just work" in Beam for types that implement
> the correct interface in Python 3.8?
>
> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
> wrote:
>
>> +1
>>
>> FWIW I recently ran into the exact case you described (high
>> serialization cost). The solution was to implement some not-so-intuitive
>> alternative transforms in my case, but I would have very much appreciated
>> faster serialization performance.
>>
>> Thanks,
>> Evan
>>
>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer 
>> wrote:
>>
>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>> i.e., Pickle protocol version 5?
>>> https://www.python.org/dev/peps/pep-0574/
>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>
>>> For Beam pipelines passing around NumPy arrays (or collections of
>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization 
>>> costs
>>> can be significant. Beam seems to currently incur at least one one 
>>> (maybe
>>> two) unnecessary memory copies.
>>>
>>> Pickle protocol version 5 exists for solving exactly this problem.
>>> You can serialize collections of arbitrary Python objects in a fully
>>> streaming fashion using memory buffers. This is a Python 3.8 feature, 
>>> but
>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>>> been supported by NumPy since version 1.16, released in January 2019.
>>>
>>> Cheers,
>>> Stephan
>>>
>>


Re: Cartesian product of PCollections

2022-09-19 Thread Brian Hulette via dev
In SQL we just don't support cross joins currently [1]. I'm not aware of an
existing implementation of a cross join/cartesian product.

> My team has an internal implementation of a CartesianProduct transform,
based on using hashing to split a pcollection into a finite number of
groups and CoGroupByKey.

Could this be contributed to Beam?

> On the other hand, if any of the input pcollections are small, using side
inputs would probably be the way to go to avoid the need for a shuffle.

We run into this problem frequently in Beam SQL. Our optimizer could be
much more effective with accurate size estimates, but we rarely have
them, and they may never be good enough for us to select a side input
implementation over CoGroupByKey. I've had some offline discussions in this
space, the best solution we've come up with is to allow hints in SQL (or
just arguments in join transforms) that allow users to select a side input
implementation. We could also add some logging when a pipeline uses a
CoGroupByKey and PCollection sizes could be handled by a side input
implementation, to nudge users that way for future runs.

Brian

[1] https://beam.apache.org/documentation/dsls/sql/extensions/joins/

On Mon, Sep 19, 2022 at 8:01 AM Stephan Hoyer via dev 
wrote:

> I'm wondering if it would make sense to have a built-in Beam
> transformation for calculating the Cartesian product of PCollections.
>
> Just this past week, I've encountered two separate cases where calculating
> a Cartesian product was a bottleneck. The in-memory option of using
> something like Python's itertools.product() is convenient, but it only
> scales to a single node.
>
> Unfortunately, implementing a scalable Cartesian product seems to be
> somewhat non-trivial. I found two version of this question on
> StackOverflow, but neither contains a code solution:
>
> https://stackoverflow.com/questions/35008721/how-to-get-the-cartesian-product-of-two-pcollections
>
> https://stackoverflow.com/questions/41050477/how-to-do-a-cartesian-product-of-two-pcollections-in-dataflow/
>
> There's a fair amount of nuance in an efficient and scalable
> implementation. My team has an internal implementation of a
> CartesianProduct transform, based on using hashing to split a pcollection
> into a finite number of groups and CoGroupByKey. On the other hand, if any
> of the input pcollections are small, using side inputs would probably be
> the way to go to avoid the need for a shuffle.
>
> Any thoughts?
>
> Cheers,
> Stephan
>


Re: [Infrastructure] Periodically run Java microbenchmarks on Jenkins

2022-09-15 Thread Brian Hulette via dev
Is there somewhere we could document this?

On Thu, Sep 15, 2022 at 6:45 AM Moritz Mack  wrote:

> Thank you, Andrew!
>
> Exactly what I was looking for, that’s awesome!
>
>
>
> On 15.09.22, 06:37, "Alexey Romanenko"  wrote:
>
>
>
>
>
> Ahh, great! I didn’t know that 'beam-perf’ label is used for that.
> Thanks!
>
> > On 14 Sep 2022, at 17:47, Andrew Pilloud  wrote:
> >
> > We do have a dedicated machine for benchmarks. This is a single
> > machine limited to running one test at a time. Set the
> > jenkinsExecutorLabel for the job to 'beam-perf' to use it. For
> > example:
> >
> https://urldefense.com/v3/__https://github.com/apache/beam/blob/66bbee84ed477d86008905646e68b100591b6f78/.test-infra/jenkins/job_PostCommit_Java_Nexmark_Direct.groovy*L36__;Iw!!CiXD_PY!Qat2J4NAyHVo4Cc32PKMn50yw8LgWHmEOm4Ltb7aRV-7KCfNamu3tGOiSYKDUZhLHKu3zlqbBXzJNiX_f_Qteg$
> 
>
> >
> > Andrew
> >
> > On Wed, Sep 14, 2022 at 8:28 AM Alexey Romanenko
> >  wrote:
> >>
> >> I think it depends on the goal why to run that benchmarks. In ideal
> case, we need to run them on the same dedicated machine(s) and with the
> same configuration all the time but I’m not sure that it can be achieved in
> current infrastructure reality.
> >>
> >> On the other hand, IIRC, the initial goal of benchmarks, like Nexmark,
> was to detect fast any major regressions, especially between releases, that
> are not so sensitive to ideal conditions. And here we a field for
> improvements.
> >>
> >> —
> >> Alexey
> >>
> >> On 13 Sep 2022, at 22:57, Kenneth Knowles  wrote:
> >>
> >> Good idea. I'm curious about our current benchmarks. Some of them run
> on clusters, but I think some of them are running locally and just being
> noisy. Perhaps this could improve that. (or if they are running on local
> Spark/Flink then maybe the results are not really meaningful anyhow)
> >>
> >> On Tue, Sep 13, 2022 at 2:54 AM Moritz Mack  wrote:
> >>>
> >>> Hi team,
> >>>
> >>>
> >>>
> >>> I’m looking for some help to setup infrastructure to periodically run
> Java microbenchmarks (JMH).
> >>>
> >>> Results of these runs will be added to our community metrics
> (InfluxDB) to help us track performance, see [1].
> >>>
> >>>
> >>>
> >>> To prevent noisy runs this would require a dedicated Jenkins machine
> that runs at most one job (benchmark) at a time. Benchmark runs take quite
> some time, but on the other hand they don’t have to run very frequently
> (once a week should be fine initially).
> >>>
> >>>
> >>>
> >>> Thanks so much,
> >>>
> >>> Moritz
> >>>
> >>>
> >>>
> >>> [1]
> https://urldefense.com/v3/__https://github.com/apache/beam/pull/23041__;!!CiXD_PY!Qat2J4NAyHVo4Cc32PKMn50yw8LgWHmEOm4Ltb7aRV-7KCfNamu3tGOiSYKDUZhLHKu3zlqbBXzJNiUkaqlEKQ$
> 
>
> >>>
> >>> As a recipient of an email from Talend, your contact personal data
> will be on our systems. Please see our privacy notice.
> >>>
> >>>
> >>>
> >>
>
> *As a recipient of an email from Talend, your contact personal data will
> be on our systems. Please see our privacy notice.
> *
>
>
>


Re: What to do about issues that track flaky tests?

2022-09-15 Thread Brian Hulette via dev
I agree with Austin on this one, it makes sense to be realistic, but I'm
concerned about just blanket reducing the priority on all flakes. Two
classes of issues that could certainly be dropped to P2:
- Issues tracking flakes that have not been sickbayed yet (e.g.
https://github.com/apache/beam/issues/21266). These tests are still
providing signal (we should notice if it goes perma-red), and clearly the
flakes aren't so painful that someone felt the need to sickbay it.
- A sickbayed test, iff a breakage in the functionality it's testing would
be P2. This is admittedly difficult to identify.

It looks like we don't have a way to label sickbayed tests (or the inverse,
currently-failing), maybe we should have one?

Another thing to note: this email is reporting _unassigned_ P1 issues,
another way to remove issues from the search results would be to ensure
each flake has an owner (somehow). Maybe that's just shifting the problem,
but it could avoid the tragedy of the commons. To Manu's point, maybe those
new owners will happily discover their flake is no longer a problem.

Brian

On Wed, Sep 14, 2022 at 5:58 PM Manu Zhang  wrote:

> Agreed. I also mentioned in a previous email that some issues have been
> open for a long time (before being migrated to GitHub) and it's possible
> that those tests can pass constantly now.
> We may double check and close them since reopening is just one click.
>
> Manu
>
> On Thu, Sep 15, 2022 at 6:58 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> +1 to being realistic -- proper labels are worthwhile.  Though, some
>> flaky tests probably should be P1, and just because isn't addressed in a
>> timely manner doesn't mean it isn't a P1 - though, it does mean it wasn't
>> addressed.
>>
>>
>>
>> On Wed, Sep 14, 2022 at 1:19 PM Kenneth Knowles  wrote:
>>
>>> I would like to make this alert email actionable.
>>>
>>> I went through most of these issues. About half are P1 "flake" issues. I
>>> don't think magically expecting them to be deflaked is helpful. So I have a
>>> couple ideas:
>>>
>>> 1. Exclude "flake" P1s from this email. This is what we used to do. But
>>> then... are they really P1s?
>>> 2. Make "flake" bugs P2 if they are not currently impacting our test
>>> signal. But then... we may have a gap in test coverage that could cause
>>> severe problems. But anyhow something that is P1 for a long time is not
>>> *really* P1, so it is just being realistic.
>>>
>>> What do you all think?
>>>
>>> Kenn
>>>
>>> On Wed, Sep 14, 2022 at 3:03 AM  wrote:
>>>
 This is your daily summary of Beam's current high priority issues that
 may need attention.

 See https://beam.apache.org/contribute/issue-priorities for the
 meaning and expectations around issue priorities.

 Unassigned P1 Issues:

 https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
 installation cannot generate proto with protobuf 3.20.2
 https://github.com/apache/beam/issues/23179 [Bug]: Parquet size
 exploded for no apparent reason
 https://github.com/apache/beam/issues/22913 [Bug]:
 beam_PostCommit_Java_ValidatesRunner_Flink is flakey
 https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
 SDF and fix known and discovered issues
 https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze
 at getConnection() in WriteFn
 https://github.com/apache/beam/issues/21794 Dataflow runner creates a
 new timer whenever the output timestamp is change
 https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't
 get output to Failed Inserts PCollection
 https://github.com/apache/beam/issues/21704
 beam_PostCommit_Java_DataflowV2 failures parent bug
 https://github.com/apache/beam/issues/21701
 beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
 https://github.com/apache/beam/issues/21700
 --dataflowServiceOptions=use_runner_v2 is broken
 https://github.com/apache/beam/issues/21696 Flink Tests failure :
 java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.beam.runners.core.construction.SerializablePipelineOptions
 https://github.com/apache/beam/issues/21695 DataflowPipelineResult
 does not raise exception for unsuccessful states.
 https://github.com/apache/beam/issues/21694 BigQuery Storage API
 insert with writeResult retry and write to error table
 https://github.com/apache/beam/issues/21480 flake:
 FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
 https://github.com/apache/beam/issues/21472 Dataflow streaming tests
 failing new AfterSynchronizedProcessingTime test
 https://github.com/apache/beam/issues/21471 Flakes: Failed to load
 cache entry
 https://github.com/apache/beam/issues/21470 Test flake:
 test_split_half_sdf
 https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
 flaky: Connection refused
 

Re: Cannot find beam in project list on jira when I create issue

2022-09-07 Thread Brian Hulette via dev
Thank you Moritz for updating the docs!

On Wed, Sep 7, 2022 at 3:06 AM Moritz Mack  wrote:

> Sorry for the confusion. Beam migrated to using Github issues just
> recently and the confluence docs haven’t been updated yet.
>
>
>
> Please create a new issue under https://github.com/apache/beam/issues and
> then reference it in your commit message using the issue id, e.g.
>
> git commit -am “Description of change (closes #12345)”
>
>
>
> Regards,
>
> Moritz
>
>
>
> On 07.09.22, 11:20, "张涛"  wrote:
>
>
>
> Hi,I followed the step to create a pull request: https: //cwiki. apache.
> org/confluence/display/BEAM/Git+Tips in step 8, need a jira issue: but I
> cannot find beam in project list on jira when I create issue : I was
> confused about what I should
>
> Hi,I followed the step to create a pull request:
>
> https://cwiki.apache.org/confluence/display/BEAM/Git+Tips
> 
>
> in step 8, need a jira issue:
>
> but  I cannot find beam in  project list on jira when I create issue :
>
>
>
> I was confused about what I should do.  I'm looking forward to getting
> your help,thanks very much!
>
> *As a recipient of an email from Talend, your contact personal data will
> be on our systems. Please see our privacy notice.
> *
>
>
>


Re: A lesson about DoFn retries

2022-09-01 Thread Brian Hulette via dev
Thanks for sharing the learnings Ahmed!

> The solution lies in keeping the retry of each step separate. A good
example of this is in how steps 2 and 3 are implemented [3]. They are
separated into different DoFns and step 3 can start only after step 2
completes successfully. This way, any failure in step 3 does not go back to
affect step 2. Is it enough just that they're in different DoFns? I thought
the key was that the DoFns are separated by a GroupByKey, so they will be
in different fused stages, which are retried independently.

Brian

On Thu, Sep 1, 2022 at 1:43 PM Ahmed Abualsaud via dev 
wrote:

> Hi all,
>
> TLDR: When writing IO connectors, be wary of how bundle retries can affect
> the work flow.
>
> A faulty implementation of a step in BigQuery batch loads was discovered
> recently. I raised an issue [1] but also wanted to mention it here as a
> potentially helpful lesson for those developing new/existing IO connectors.
>
> For those unfamiliar with BigQueryIO file loads, a write that is too large
> for a single load job [2] looks roughly something like this:
>
>
>1.
>
>Take input rows and write them to temporary files.
>2.
>
>Load temporary files to temporary BQ tables.
>3.
>
>Delete temporary files.
>4.
>
>Copy the contents of temporary tables over to the final table.
>5.
>
>Delete temporary tables.
>
>
> The faulty part here is that steps 4 and 5 are done in the same DoFn (4 in
> processElement and 5 in finishBundle). In the case a bundle fails in the
> middle of table deletion, let’s say an error occurs when deleting the nth
> table, the whole bundle will retry and we will perform the copy again. But
> tables 1~n have already been deleted and so we get stuck trying to copy
> from non-existent sources.
>
> The solution lies in keeping the retry of each step separate. A good
> example of this is in how steps 2 and 3 are implemented [3]. They are
> separated into different DoFns and step 3 can start only after step 2
> completes successfully. This way, any failure in step 3 does not go back to
> affect step 2.
>
> That's all, thanks for your attention :)
>
> Ahmed
>
> [1] https://github.com/apache/beam/issues/22920
>
> [2]
> https://github.com/apache/beam/blob/f921a2f1996cf906d994a9d62aeb6978bab09dd5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L100-L105
>
>
> [3]
> https://github.com/apache/beam/blob/149ed074428ff9b5344169da7d54e8ee271aaba1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L437-L454
>
>
>


Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-25 Thread Brian Hulette via dev
Thanks for writing this up Valentyn!

I'm curious Jarek, does Airflow take any dependencies on popular libraries
like pandas, numpy, pyarrow, scipy, etc... which users are likely to have
their own dependency on? I think these dependencies are challenging in a
different way than the client libraries - ideally we would support a wide
version range so as not to require users to upgrade those libraries in
lockstep with Beam. However in some cases our dependency is pretty tight
(e.g. the DataFrame API's dependency on pandas), so we need to make sure to
explicitly test with multiple different versions. Does Airflow have any
similar issues?

Thanks!
Brian

On Thu, Aug 25, 2022 at 5:36 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Hi Jarek,
>
> Thanks a lot for detailed feedback and sharing the Airflow story, this is
> exactly what I was hoping to hear in response from the mailing list!
>
> 600+ dependencies is very impressive, so I'd be happy to chat more and
> learn from your experience.
>
> On Wed, Aug 24, 2022 at 5:50 AM Jarek Potiuk  wrote:
>
>> Comment (from a bit outsider)
>>
>> Fantastic document Valentyn.
>>
>> Very, very insightful and interesting. We feel a lot of the same pain in
>> Apache Airflow (actually even more because we have not 20 but 620+
>> dependencies) but we are also a bit more advanced in the way how we are
>> managing the dependencies - some of the ideas you had there are already
>> tested and tried in Airflow, some of them are a bit different but we can
>> definitely share "principles" and we are a little higher in the "supply
>> chain" (i.e. Apache Beam Python SDK is our dependency).
>>
>> I left some suggestions and some comments describing in detail how the
>> same problems look like in Airflow and how we addressed them (if we did)
>> and I am happy to participate in further discussions. I am "the dependency
>> guy" in Airflow and happy to share my experiences and help to work out some
>> problems - and especially help to solve problems coming from using multiple
>> google-client libraries and diamond dependencies (we are just now dealing
>> with similar issue - where likely we will have to do a massive update of
>> several of our clients - hopefully with the involvement of Composer team.
>> And I'd love to be involved in a joint discussion with the google client
>> team to work out some common and expectations that we can rely on when we
>> define our future upgrade strategy for google clients.
>>
>> I will watch it here and be happy to spend quite some time on helping to
>> hash it out.
>>
>> BTW. You can also watch my talk I gave last year at PyWaw about "Managing
>> Python dependencies at Scale"
>> https://www.youtube.com/watch?v=_SjMdQLP30s=2549s where I explain the
>> approach we took, reasoning behind it etc.
>>
>> J.
>>
>>
>> On Wed, Aug 24, 2022 at 2:45 AM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> Recently, several issues [1-3]  have highlighted outage risks and
>>> developer inconveniences due to  dependency management practices in Beam
>>> Python.
>>>
>>> With dependabot and other tooling  that we have integrated with Beam,
>>> one of the missing pieces seems to be having a clear guideline of how we
>>> should be specifying requirements for our dependencies and when and how we
>>> should be updating them to have a sustainable process.
>>>
>>> As a conversation starter, I put together a retrospective
>>> [4]
>>> covering a recent incident and would like to get community opinions on the
>>> open questions.
>>>
>>> In particular, if you have experience managing dependencies for other
>>> Python libraries with rich dependency chains, knowledge of available
>>> tooling or first hand experience dealing with other dependency issues in
>>> Beam, your input would be greatly appreciated.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> [1] https://github.com/apache/beam/issues/22218
>>> [2] https://github.com/apache/beam/pull/22550#issuecomment-1217348455
>>> [3] https://github.com/apache/beam/issues/22533
>>> [4]
>>> https://docs.google.com/document/d/1gxQF8mciRYgACNpCy1wlR7TBa8zN-Tl6PebW-U8QvBk/edit
>>>
>>


Re: Incomplete Beam Schema -> Avro Schema conversion

2022-08-22 Thread Brian Hulette via dev
I don't think there's a reason for this, it's just that these logical types
were defined after the Avro <-> Beam schema conversion. I think it would be
worthwhile to add support for them, but we'd also need to look at the
reverse (avro to beam) direction, which would map back to the catch-all
DATETIME primitive type [1]. Changing that could break backwards
compatibility.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L771-L776

On Wed, Aug 17, 2022 at 2:53 PM Balázs Németh  wrote:

> java.lang.RuntimeException: Unhandled logical type
> beam:logical_type:date:v1
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:943)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java
>
> In
> https://github.com/apache/beam/blob/7bb755906c350d77ba175e1bd990096fbeaf8e44/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L902-L944
> it seems to me there are some missing options.
>
> For example
> - FixedBytes.IDENTIFIER,
> - EnumerationType.IDENTIFIER,
> - OneOfType.IDENTIFIER
> is there, but:
> - org.apache.beam.sdk.schemas.logicaltypes.Date.IDENTIFIER
> ("beam:logical_type:date:v1")
> - org.apache.beam.sdk.schemas.logicaltypes.DateTime.IDENTIFIER
> ("beam:logical_type:datetime:v1")
> - org.apache.beam.sdk.schemas.logicaltypes.Time.IDENTIFIER
> ("beam:logical_type:time:v1")
> is missing.
>
> This in an example that fails:
>
>> import java.time.LocalDate;
>> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryUtils;
>> import org.apache.beam.sdk.schemas.Schema;
>> import org.apache.beam.sdk.schemas.Schema.FieldType;
>> import org.apache.beam.sdk.schemas.logicaltypes.SqlTypes;
>> import org.apache.beam.sdk.schemas.utils.AvroUtils;
>> import org.apache.beam.sdk.values.Row;
>
> // ...
>
> final Schema schema =
>> Schema.builder()
>> .addField("ymd",
>> FieldType.logicalType(SqlTypes.DATE))
>> .build();
>>
>> final Row row =
>> Row.withSchema(schema)
>> .withFieldValue("ymd", LocalDate.now())
>> .build();
>>
>> System.out.println(BigQueryUtils.toTableSchema(schema)); // works
>> System.out.println(BigQueryUtils.toTableRow(row)); // works
>>
>> System.out.println(AvroUtils.toAvroSchema(schema)); // fails
>> System.out.println(AvroUtils.toGenericRecord(row)); // fails
>
>
> Am I missing a reason for that or is it just not done properly yet? If
> this is the case, am I right to assume that they should be represented in
> the Avro format as the already existing cases?
> "beam:logical_type:date:v1" vs "DATE"
> "beam:logical_type:time:v1" vs "TIME"
>
>
>


Re: Representation of logical type beam:logical_type:datetime:v1

2022-08-12 Thread Brian Hulette via dev
Ah sorry, I forgot that INT64 is encoded with VarIntCoder, so we can't
simulate TimestampCoder with a logical type.

I think the ideal end state would be to have a well-defined
beam:logical_type:millis_instant that we use for cross-language (when
appropriate), and never use DATETIME at cross-language boundaries. Would it
be possible to add millis_instant, and use that for JDBC read/write instead
of DATETIME?

Separately we could consider how to resolve the conflicting definitions of
beam:logical_type:datetime:v1. I'm not quite sure how/if we can do that
without breaking pipeline update.

Brian


On Fri, Aug 12, 2022 at 7:50 AM Yi Hu via dev  wrote:

> Hi Cham,
>
> Thanks for the comments.
>
>
>>
>>>
>>> ii. "beam:logical_type:instant:v1" is still backed by INT64, but in
>>> implementation it will use BigEndianLongCoder to encode/decode the stream.
>>>
>>>
>> Is this to be compatible with the current Java implementation ? And we
>> have to update other SDKs to use big endian coders when encoding/decoding
>> the "beam:logical_type:instant:v1" logical type ?
>>
>>
> Yes, and the proposal is aimed to keep the Java SDK change minimal; we
> have to update other SDKs to make it work. Currently python and go sdk does
> not implement "beam:logical_type:datetime:v1" (will
> be "beam:logical_type:instant:v1") at all.
>
>
>>
>>
>>> For the second step ii, the problem is that there is a primitive type
>>> backed by a fixed length integer coder. Currently INT8, INT16, INT32,
>>> INT64... are all backed by VarInt (and there is ongoing work to use fixed
>>> size big endian to encode INT8, INT16 (
>>> https://github.com/apache/beam/issues/19815)). Ideally I would think
>>> (INT8, INT16, INT32, INT64) are all fixed and having a generic (INT)
>>> primitive type is backed by VarInt. But this may be a more substantial
>>> change for the current code base.
>>>
>>
>> I'm a bit confused by this. Did you mean that there's *no* primitive
>> type backed by a fixed length integer coder ? Also, by primitive, I'm
>> assuming you mean Beam Schema types here.
>>
>>
> Yes I mean Beam Schema types here. The proto for datetime(instant) logical
> type is constructed here:
> https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaTranslation.java#L202
> It is represented by an INT64 atomic type. In cross-language case, another
> SDK receives proto and decodes the stream according to the proto. Currently
> I do not see an atomic type that will be decoded using a fixed-length
> BigEndianLong coder. INT8, ..., INT64 will all be decoded with VarInt.
>
> As a workaround in the PR (#22561), in python's RowCoder I explicitly set
> the coder for "beam:logical_type:datetime:v1" (will
> be "beam:logical_type:instant:v1") to be TimestampCoder. I do not find a
> way to keep the logic contained in the logical type implementation, e.g. in
> to_language_type and to_representation_type method. To do this I will need
> an atomic type that is decoded using the BigEndianLong coder.
> Please point out if I was wrong.
>
> Best,
> Yi
>


Re: Design Doc for Controlling Batching in RunInference

2022-08-12 Thread Brian Hulette via dev
Hi Andy,

Thanks for writing this up! This seems like something that Batched DoFns
could help with. Could we make a BatchConverter [1] that represents the
necessary transformations here, and define RunInference as a Batched DoFn?
Note that the Numpy BatchConverter already enables users to specify a batch
dimension using a custom typehint, like NumpyArray[np.int64, (N, 10)] (the
N identifies the batch dimension) [2]. I think we could do something
similar, but with pytorch types. It's likely we'd need to define our own
typehints though, I suspect pytorch typehints aren't already parameterized
by size.

Brian


[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py
[2]
https://github.com/apache/beam/blob/3173b503beaf30c4d32a4a39c709fd81e8161907/sdks/python/apache_beam/typehints/batch_test.py#L42

On Fri, Aug 12, 2022 at 12:36 PM Andy Ye via dev 
wrote:

> Hi everyone,
>
> I've written up a design doc
> 
>  [1]
> on controlling batching in RunInference. I'd appreciate any feedback.
> Thanks!
>
> Summary:
> Add a custom stacking function to RunInference to enable users to control
> how they want their data to be stacked. This addresses issues regarding
> data that have existing batching dimensions, or different sizes.
>
> Best,
> Andy
>
> [1]
> https://docs.google.com/document/d/1l40rOTOEqrQAkto3r_AYq8S_L06dDgoZu-4RLKAE6bo/edit#
>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Brian Hulette via dev
Thanks Cham! I really like the proposal, I left a few comments. I also had
one higher-level point I wanted to elevate here:

> Pipeline SDKs can generate user-friendly stub-APIs based on transforms
registered with an expansion service, eliminating the need to develop
language-specific wrappers.
This would be great! I think one point to consider is whether we can do
this statically. We could package up these stubs with releases and include
them in API docs for each language, making them much more discoverable.
That could be an extension on top of your proposal (e.g. as part of its
build, each SDK spins up other known expansion services and generates code
based on the discovery responses), but maybe it could be cleaner if we
don't really need the dynamic version?

Brian


On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Hi All,
>
> I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
>
>- Expansion service can be used to easily initialize and expand
>transforms without need for additional code.
>- Expansion service can be used to easily discover already registered
>transforms.
>- Pipeline SDKs can generate user-friendly stub-APIs based on
>transforms registered with an expansion service, eliminating the need to
>develop language-specific wrappers.
>
> Please see here for my proposal: https://s.apache.org/easy-multi-language
>
> Lemme know if you have any comments/questions/suggestions :)
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
>


Re: Join a meeting to help coordinate implementing a Dask Runner for Beam

2022-08-03 Thread Brian Hulette via dev
I wanted to share that Ryan gave a presentation about his (and Charles')
work on Pangeo Forge at Scipy 2022 (in Austin just before Beam Summit!),
with a couple mentions of their transition to Beam [1]. There were also a
couple of other talks about Pangeo [2,3] with some Beam/xarray-beam
references in there.

[1]
https://www.youtube.com/watch?v=sY20UpYCAEE=PLYx7XA2nY5Gde0WF1yswQw5InhmSNED8o=9
[2]
https://www.youtube.com/watch?v=7niNfs3ZpfQ=PLYx7XA2nY5Gfb0tQyezb4Gsf1nVsy86zt=2
[3]
https://www.youtube.com/watch?v=ftlgOESINvo=PLYx7XA2nY5Gfb0tQyezb4Gsf1nVsy86zt=3

On Tue, Jun 21, 2022 at 9:29 AM Ahmet Altay  wrote:

> Were you able to meet? If yes, I would be very interested in a summary if
> someone would like to share that :)
>
> On Mon, Jun 13, 2022 at 9:16 AM Pablo Estrada  wrote:
>
>> Also added my availability... please do invite me as well : )
>> -P.
>>
>> On Mon, Jun 13, 2022 at 6:57 AM Kenneth Knowles  wrote:
>>
>>> I would love to try to join any meetings if you add me. My calendar is
>>> too chaotic to be useful on the when2meet :-) but I can often move things
>>> around.
>>>
>>> Kenn
>>>
>>> On Wed, Jun 8, 2022 at 2:50 PM Brian Hulette 
>>> wrote:
>>>
 Thanks for reaching out, Ryan, this sounds really cool. I added my
 availability to the calendar since I'm interested in this space, but I'm
 not sure I can offer much help - I don't have any experience building a
 runner, to date I've worked exclusively on the SDK side of Beam. So I hope
 some other folks can join as well :)

 @Pablo Estrada  might have some useful insight -
 he's been working on a spike to build a Ray runner.


 On Wed, Jun 8, 2022 at 12:53 PM Robert Bradshaw 
 wrote:

> This sounds like a great project. Unfortunately I wouldn't be able to
> meet next week, but would be happy to meet some other time and if that
> doesn't work answer questions over email, etc. Looking forward to a
> Dask runner.
>
> On Wed, Jun 8, 2022 at 9:04 AM Ryan Abernathey
>  wrote:
> >
> > Dear Beamer,
> >
> > Thank you for all of your work on this amazing project. I am new to
> Beam and am quite excited about its potential to help with some data
> processing challenges in my field of climate science.
> >
> > Our community is interested in running Beam on Dask Distributed
> clusters, which we already know how to deploy. This has been discussed at
> https://issues.apache.org/jira/browse/BEAM-5336 and
> https://github.com/apache/beam/issues/18962. It seems technically
> feasible.
> >
> > We are trying to organize a meeting next week to kickstart and
> coordinate this effort. It would be great if we could entrain some Beam
> maintainers into this meeting. If you have interest in this topic and are
> available next week, please share your availability here -
> https://www.when2meet.com/?15861604-jLnA4
> >
> > Alternatively, if you have any guidance or suggestions you wish to
> provide by email or GitHub discussion, we welcome your input.
> >
> > Thanks again for your open source work.
> >
> > Best,
> > Ryan Abernathey
> >
>