from:"Brian Hulette"

bhulette stepping back (for now)

2022-11-10 Thread Brian Hulette

Hi dev@beam,

I just wanted to let the community know that I will be stepping back from
Beam development for now. I'm switching to a different team within Google
next week - I will be working on BigQuery.

I'm removing myself from automated code review assignments [1], and won't
actively monitor the beam lists anymore. That being said, I'm happy to
contribute to discussions or code reviews when it would be particularly
helpful, e.g. for anything relating to DataFrames/Schemas/SQL. I can always
be reached at bhule...@apache.org, and @TheNeuralBit [2] on GitHub.

Brian

[1] https://github.com/apache/beam/pull/24108
[2] https://github.com/TheNeuralBit

Re: [ANNOUNCE] New committer: Yi Hu

2022-11-09 Thread Brian Hulette via dev

Well deserved! Congratulations Yi

On Wed, Nov 9, 2022 at 11:25 AM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> I am with the Beam PMC on this, congratulations and very well deserved, Yi!
>
> On Wed, Nov 9, 2022 at 11:08 AM Byron Ellis via dev 
> wrote:
>
>> Congratulations!
>>
>> On Wed, Nov 9, 2022 at 11:00 AM Pablo Estrada via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 thanks Yi : D
>>>
>>> On Wed, Nov 9, 2022 at 10:47 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Yi! I've really appreciated the ways you've consistently taken
 responsibility for improving our team's infra and working through sharp
 edges in the codebase that others have ignored. This is definitely well
 deserved!

 Thanks,
 Danny

 On Wed, Nov 9, 2022 at 1:37 PM Anand Inguva via dev <
 dev@beam.apache.org> wrote:

> Congratulations Yi!
>
> On Wed, Nov 9, 2022 at 1:35 PM Ritesh Ghorse via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Yi!
>>
>> On Wed, Nov 9, 2022 at 1:34 PM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Yi!
>>>
>>> On Wed, Nov 9, 2022 at 1:33 PM Sachin Agarwal via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations Yi!

 On Wed, Nov 9, 2022 at 10:32 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Yi Hu (y...@apache.org)
>
> Yi started contributing to Beam in early 2022. Yi's contributions
> are very diverse! I/Os, performance tests, Jenkins, support for Schema
> logical types. Not only code but a very large amount of code review. 
> Yi is
> also noted for picking up smaller issues that normally would be left 
> on the
> backburner and filing issues that he finds rather than ignoring them.
>
> Considering their contributions to the project over this
> timeframe, the Beam PMC trusts Yi with the responsibilities of a Beam
> committer. [1]
>
> Thank you Yi! And we are looking to see more of your contributions!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

Re: Beam starter projects dependency updates

2022-11-07 Thread Brian Hulette via dev

These have all been addressed. I went through and merged all of them,
except for the slf4j-jdk14 dependency in Java and Kotlin. After consulting
with Luke [1] I told dependabot to ignore this dependency.

[1]
https://github.com/apache/beam-starter-java/pull/26#issuecomment-130263Java9941
<https://github.com/apache/beam-starter-java/pull/26#issuecomment-1302639941>

On Wed, Nov 2, 2022 at 10:58 AM David Cavazos  wrote:

> Hi, I just opened some PRs to auto-assign dependabot PRs.
>
> Java: https://github.com/apache/beam-starter-java/pull/29
> Python: https://github.com/apache/beam-starter-python/pull/11
> Go: https://github.com/apache/beam-starter-go/pull/7
> Kotlin: https://github.com/apache/beam-starter-kotlin/pull/9
>
> For the existing dependabot PRs, can someone help me batch merge them? All
> tests are passing on all of them, so they should all be safe to merge.
>
> On Thu, Oct 27, 2022 at 1:56 PM Brian Hulette  wrote:
>
>> Could we just use the same set of reviewers as pr-bot in the main repo
>> [1]? I don't think that we could avoid duplicating the data though.
>>
>> [1]
>> https://github.com/apache/beam/blob/728e8ecc8a40d3d578ada7773b77eca2b3c68d03/.github/REVIEWERS.yml
>>
>> On Thu, Oct 27, 2022 at 12:20 PM David Cavazos via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone!
>>>
>>> We want to make sure the Beam starter projects always come with the
>>> latest (compatible) versions for every dependency. I enabled Dependabot on
>>> all of them to automate this as much as possible, and we have automated
>>> tests to make sure everything works as expected.
>>>
>>> However, we still need someone to merge Dependabot's PRs. The good news
>>> is that since the starter projects are so simple, if tests pass they're
>>> most likely safe to merge, and tests only take a couple minutes to run.
>>>
>>> We could either batch update all dependencies as part of the release
>>> process, or have people check them periodically (like an owner per
>>> language).
>>>
>>> These are all the repos we have to keep an eye to:
>>>
>>>- https://github.com/apache/beam-starter-java -- 9 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-python -- 2 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-go -- 0 updates
>>>- https://github.com/apache/beam-starter-kotlin -- 3 updates, all
>>>tests passing
>>>- https://github.com/apache/beam-starter-scala -- not done yet, but
>>>keep an eye
>>>
>>>

[Python][Bikeshed] typehint vs. type-hint vs. "type hint"

2022-11-07 Thread Brian Hulette via dev

Hi everyone,

In a recent code review we noticed that we are not consistent when
describing python type hints in documentation. Depending on who wrote the
patch, we switch between typehint, type-hint, and "type hint" [1].

I think we should standardize on "type hint" as this is what Guido used in
PEP 484 [2]. Please comment on the issue in the next few days if you
disagree with this approach.

Note this is orthogonal to how we refer to type hints in _code_, in our
public APIs. In general we use "type" in that context (e.g.
`with_input_types`), and there doesn't seem to be a consistency issue.

[1] https://github.com/apache/beam/issues/23950
[2] https://peps.python.org/pep-0484/

Re: Beam Website Feedback

2022-10-27 Thread Brian Hulette via dev

I proposed https://github.com/apache/beam/pull/23877 to address this.

On Thu, Oct 27, 2022 at 2:12 PM Sachin Agarwal  wrote:

> No objections here.  The latter (the surviving) is the one linked in
> the top navigation bar and has the x-lang details that help.
>
> On Thu, Oct 27, 2022 at 2:09 PM Brian Hulette  wrote:
>
>> Hm, it seems like we need to drop
>> https://beam.apache.org/documentation/io/built-in/ as it's been
>> superseded by https://beam.apache.org/documentation/io/connectors/
>>
>> Would there be any objections to that?
>>
>> On Thu, Oct 27, 2022 at 2:04 PM Sachin Agarwal via dev <
>> dev@beam.apache.org> wrote:
>>
>>> JDBCIO is available as a Java-based IO.  It is also listed on
>>> https://beam.apache.org/documentation/io/connectors/
>>>
>>> On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai <
>>> char...@charleskangai.co.uk> wrote:
>>>
>>>> What about jdbc?
>>>> I want to use Beam to read/write to/from a relational database, e.g.
>>>> Oracle or Microsoft SQL Server.
>>>> I don’t see a connector on your page:
>>>> *https://beam.apache.org/documentation/io/built-in*
>>>> <https://beam.apache.org/documentation/io/built-in>
>>>>
>>>> Thanks,
>>>> Charles Kangai
>>>>
>>>>
>>>>
>>>

Re: Beam Website Feedback

2022-10-27 Thread Brian Hulette via dev

Hm, it seems like we need to drop
https://beam.apache.org/documentation/io/built-in/ as it's been superseded
by https://beam.apache.org/documentation/io/connectors/

Would there be any objections to that?

On Thu, Oct 27, 2022 at 2:04 PM Sachin Agarwal via dev 
wrote:

> JDBCIO is available as a Java-based IO.  It is also listed on
> https://beam.apache.org/documentation/io/connectors/
>
> On Thu, Oct 27, 2022 at 2:01 PM Charles Kangai <
> char...@charleskangai.co.uk> wrote:
>
>> What about jdbc?
>> I want to use Beam to read/write to/from a relational database, e.g.
>> Oracle or Microsoft SQL Server.
>> I don’t see a connector on your page:
>> *https://beam.apache.org/documentation/io/built-in*
>> 
>>
>> Thanks,
>> Charles Kangai
>>
>>
>>
>

Re: Beam starter projects dependency updates

2022-10-27 Thread Brian Hulette via dev

Could we just use the same set of reviewers as pr-bot in the main repo [1]?
I don't think that we could avoid duplicating the data though.

[1]
https://github.com/apache/beam/blob/728e8ecc8a40d3d578ada7773b77eca2b3c68d03/.github/REVIEWERS.yml

On Thu, Oct 27, 2022 at 12:20 PM David Cavazos via dev 
wrote:

> Hi everyone!
>
> We want to make sure the Beam starter projects always come with the latest
> (compatible) versions for every dependency. I enabled Dependabot on all of
> them to automate this as much as possible, and we have automated tests to
> make sure everything works as expected.
>
> However, we still need someone to merge Dependabot's PRs. The good news is
> that since the starter projects are so simple, if tests pass they're most
> likely safe to merge, and tests only take a couple minutes to run.
>
> We could either batch update all dependencies as part of the release
> process, or have people check them periodically (like an owner per
> language).
>
> These are all the repos we have to keep an eye to:
>
>- https://github.com/apache/beam-starter-java -- 9 updates, all tests
>passing
>- https://github.com/apache/beam-starter-python -- 2 updates, all
>tests passing
>- https://github.com/apache/beam-starter-go -- 0 updates
>- https://github.com/apache/beam-starter-kotlin -- 3 updates, all
>tests passing
>- https://github.com/apache/beam-starter-scala -- not done yet, but
>keep an eye
>
>

Re: Beam Website Feedback

2022-10-04 Thread Brian Hulette via dev

On Tue, Oct 4, 2022 at 8:58 AM Alexey Romanenko 
wrote:

> Thanks for your feedback.
>
> At the time, using a Google website search was a simplest solution since,
> before, we didn’t have a search at all. I agree that it could be
> frustrating to have ad links before the actual results (not sure that we
> can avoid them there) but "it is what it is” and it's still possible to
> have the correct links further which is better than nothing.
>
> Beam community is always welcome for suggestions and, especially,
> contributions to improve the project in any possible way. I’d be happy to
> assist on this topic if someone will decide to improve Beam website search.
>

+1, PRs welcome :)
I put some specific suggestions for a replacement in the issue, based on
recommendations from the hugo docs [1].

[1] https://gohugo.io/tools/search/


>
> —
> Alexey
>
> On 3 Oct 2022, at 23:21, Borris  wrote:
>
> This is my experience of trying the search capability.
>
>- I know I want to read about dataframes (I was reading this 10
>minutes ago but browsing history didn't take me back to where I wanted)
>- I search for "dataframes"
>- I am presented with a whole load of pages that are elsewhere (other
>sites) - maybe what I want is some pages below, but I stop at this point as
>I think its a fundamental failure of what I expect from the search dialogue
>- If I enter "beam.apache.org: dataframe" to the search dialogue then
>the sensible relevant page is now visible, only 5 links down
>- I know this may be a penalty of getting a "free" search service from
>your viewpoint
>- But from my viewpoint this is a failure. Your search capability
>fails to understand that by searching for something on your site, rather
>than generically through a search engine, I am massively predisposed to the
>pages on your site, whereas the search results are more predisposed to
>offering advertising opportunities.
>- It is very frustrating that something as simple as, on the Beam
>site, going to the page about Beam Dataframes takes such a level of hoop
>jumping
>
> That is my feedback offering. Thank you for taking the time to read it.
>
>
>
>
>

Re: Beam Website Feedback

2022-10-03 Thread Brian Hulette via dev

Thanks Borris, that is helpful feedback. I filed an issue [1] to track
improving this.

[1] https://github.com/apache/beam/issues/23472

On Mon, Oct 3, 2022 at 2:32 PM Borris  wrote:

> This is my experience of trying the search capability.
>
>- I know I want to read about dataframes (I was reading this 10
>minutes ago but browsing history didn't take me back to where I wanted)
>- I search for "dataframes"
>- I am presented with a whole load of pages that are elsewhere (other
>sites) - maybe what I want is some pages below, but I stop at this point as
>I think its a fundamental failure of what I expect from the search dialogue
>- If I enter "beam.apache.org: dataframe" to the search dialogue then
>the sensible relevant page is now visible, only 5 links down
>- I know this may be a penalty of getting a "free" search service from
>your viewpoint
>- But from my viewpoint this is a failure. Your search capability
>fails to understand that by searching for something on your site, rather
>than generically through a search engine, I am massively predisposed to the
>pages on your site, whereas the search results are more predisposed to
>offering advertising opportunities.
>- It is very frustrating that something as simple as, on the Beam
>site, going to the page about Beam Dataframes takes such a level of hoop
>jumping
>
> That is my feedback offering. Thank you for taking the time to read it.
>
>
>
>

Re: Out of band pickling in Python (pickle5)

2022-09-19 Thread Brian Hulette via dev

I got to thinking about this again and ran some benchmarks. The result is
documented in the GitHub issue [1].

tl;dr: we can't realize a huge benefit since we don't actually have an
out-of-band path for exchanging the buffers. However, pickle 5 can yield
improved in-band performance as well, and I think we can take advantage of
this with some relatively simple adjustments to PickleCoder and
OutputStream.

[1] https://github.com/apache/beam/issues/20900#issuecomment-1251658001
[2] https://peps.python.org/pep-0574/#improved-in-band-performance

On Thu, May 27, 2021 at 5:15 PM Stephan Hoyer  wrote:

> I'm unlikely to have bandwidth to take this one on, but I do think it
> would be quite valuable!
>
> On Thu, May 27, 2021 at 4:42 PM Brian Hulette  wrote:
>
>> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
>> you have any interest in taking it on?
>>
>> On Tue, May 25, 2021 at 3:09 PM Brian Hulette 
>> wrote:
>>
>>> Hm this would definitely be of interest for the DataFrame API, which is
>>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>>> that pandas supports out-of-band pickling since DataFrames are mostly just
>>> collections of numpy arrays.
>>>
>>> Brian
>>>
>>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>>
>>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>>>
>>>> Beam's PickleCoder would need to be updated to pass the
>>>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument
>>>> into pickle.loads(). I expect this would be relatively straightforward.
>>>>
>>>> Then it should "just work", assuming that data is stored in objects
>>>> (like NumPy arrays or wrappers of NumPy arrays) that implement the
>>>> out-of-band Pickle protocol.
>>>>
>>>>
>>>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
>>>> wrote:
>>>>
>>>>> I'm not aware of anyone looking at it.
>>>>>
>>>>> Will out-of-band pickling "just work" in Beam for types that implement
>>>>> the correct interface in Python 3.8?
>>>>>
>>>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> FWIW I recently ran into the exact case you described (high
>>>>>> serialization cost). The solution was to implement some not-so-intuitive
>>>>>> alternative transforms in my case, but I would have very much appreciated
>>>>>> faster serialization performance.
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer 
>>>>>> wrote:
>>>>>>
>>>>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>>>>> i.e., Pickle protocol version 5?
>>>>>>> https://www.python.org/dev/peps/pep-0574/
>>>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>>>>
>>>>>>> For Beam pipelines passing around NumPy arrays (or collections of
>>>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization 
>>>>>>> costs
>>>>>>> can be significant. Beam seems to currently incur at least one one 
>>>>>>> (maybe
>>>>>>> two) unnecessary memory copies.
>>>>>>>
>>>>>>> Pickle protocol version 5 exists for solving exactly this problem.
>>>>>>> You can serialize collections of arbitrary Python objects in a fully
>>>>>>> streaming fashion using memory buffers. This is a Python 3.8 feature, 
>>>>>>> but
>>>>>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>>>>>>> been supported by NumPy since version 1.16, released in January 2019.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Stephan
>>>>>>>
>>>>>>

Re: Cartesian product of PCollections

2022-09-19 Thread Brian Hulette via dev

In SQL we just don't support cross joins currently [1]. I'm not aware of an
existing implementation of a cross join/cartesian product.

> My team has an internal implementation of a CartesianProduct transform,
based on using hashing to split a pcollection into a finite number of
groups and CoGroupByKey.

Could this be contributed to Beam?

> On the other hand, if any of the input pcollections are small, using side
inputs would probably be the way to go to avoid the need for a shuffle.

We run into this problem frequently in Beam SQL. Our optimizer could be
much more effective with accurate size estimates, but we rarely have
them, and they may never be good enough for us to select a side input
implementation over CoGroupByKey. I've had some offline discussions in this
space, the best solution we've come up with is to allow hints in SQL (or
just arguments in join transforms) that allow users to select a side input
implementation. We could also add some logging when a pipeline uses a
CoGroupByKey and PCollection sizes could be handled by a side input
implementation, to nudge users that way for future runs.

Brian

[1] https://beam.apache.org/documentation/dsls/sql/extensions/joins/

On Mon, Sep 19, 2022 at 8:01 AM Stephan Hoyer via dev 
wrote:

> I'm wondering if it would make sense to have a built-in Beam
> transformation for calculating the Cartesian product of PCollections.
>
> Just this past week, I've encountered two separate cases where calculating
> a Cartesian product was a bottleneck. The in-memory option of using
> something like Python's itertools.product() is convenient, but it only
> scales to a single node.
>
> Unfortunately, implementing a scalable Cartesian product seems to be
> somewhat non-trivial. I found two version of this question on
> StackOverflow, but neither contains a code solution:
>
> https://stackoverflow.com/questions/35008721/how-to-get-the-cartesian-product-of-two-pcollections
>
> https://stackoverflow.com/questions/41050477/how-to-do-a-cartesian-product-of-two-pcollections-in-dataflow/
>
> There's a fair amount of nuance in an efficient and scalable
> implementation. My team has an internal implementation of a
> CartesianProduct transform, based on using hashing to split a pcollection
> into a finite number of groups and CoGroupByKey. On the other hand, if any
> of the input pcollections are small, using side inputs would probably be
> the way to go to avoid the need for a shuffle.
>
> Any thoughts?
>
> Cheers,
> Stephan
>

Re: [Infrastructure] Periodically run Java microbenchmarks on Jenkins

2022-09-15 Thread Brian Hulette via dev

Is there somewhere we could document this?

On Thu, Sep 15, 2022 at 6:45 AM Moritz Mack  wrote:

> Thank you, Andrew!
>
> Exactly what I was looking for, that’s awesome!
>
>
>
> On 15.09.22, 06:37, "Alexey Romanenko"  wrote:
>
>
>
>
>
> Ahh, great! I didn’t know that 'beam-perf’ label is used for that.
> Thanks!
>
> > On 14 Sep 2022, at 17:47, Andrew Pilloud  wrote:
> >
> > We do have a dedicated machine for benchmarks. This is a single
> > machine limited to running one test at a time. Set the
> > jenkinsExecutorLabel for the job to 'beam-perf' to use it. For
> > example:
> >
> https://urldefense.com/v3/__https://github.com/apache/beam/blob/66bbee84ed477d86008905646e68b100591b6f78/.test-infra/jenkins/job_PostCommit_Java_Nexmark_Direct.groovy*L36__;Iw!!CiXD_PY!Qat2J4NAyHVo4Cc32PKMn50yw8LgWHmEOm4Ltb7aRV-7KCfNamu3tGOiSYKDUZhLHKu3zlqbBXzJNiX_f_Qteg$
> 
>
> >
> > Andrew
> >
> > On Wed, Sep 14, 2022 at 8:28 AM Alexey Romanenko
> >  wrote:
> >>
> >> I think it depends on the goal why to run that benchmarks. In ideal
> case, we need to run them on the same dedicated machine(s) and with the
> same configuration all the time but I’m not sure that it can be achieved in
> current infrastructure reality.
> >>
> >> On the other hand, IIRC, the initial goal of benchmarks, like Nexmark,
> was to detect fast any major regressions, especially between releases, that
> are not so sensitive to ideal conditions. And here we a field for
> improvements.
> >>
> >> —
> >> Alexey
> >>
> >> On 13 Sep 2022, at 22:57, Kenneth Knowles  wrote:
> >>
> >> Good idea. I'm curious about our current benchmarks. Some of them run
> on clusters, but I think some of them are running locally and just being
> noisy. Perhaps this could improve that. (or if they are running on local
> Spark/Flink then maybe the results are not really meaningful anyhow)
> >>
> >> On Tue, Sep 13, 2022 at 2:54 AM Moritz Mack  wrote:
> >>>
> >>> Hi team,
> >>>
> >>>
> >>>
> >>> I’m looking for some help to setup infrastructure to periodically run
> Java microbenchmarks (JMH).
> >>>
> >>> Results of these runs will be added to our community metrics
> (InfluxDB) to help us track performance, see [1].
> >>>
> >>>
> >>>
> >>> To prevent noisy runs this would require a dedicated Jenkins machine
> that runs at most one job (benchmark) at a time. Benchmark runs take quite
> some time, but on the other hand they don’t have to run very frequently
> (once a week should be fine initially).
> >>>
> >>>
> >>>
> >>> Thanks so much,
> >>>
> >>> Moritz
> >>>
> >>>
> >>>
> >>> [1]
> https://urldefense.com/v3/__https://github.com/apache/beam/pull/23041__;!!CiXD_PY!Qat2J4NAyHVo4Cc32PKMn50yw8LgWHmEOm4Ltb7aRV-7KCfNamu3tGOiSYKDUZhLHKu3zlqbBXzJNiUkaqlEKQ$
> 
>
> >>>
> >>> As a recipient of an email from Talend, your contact personal data
> will be on our systems. Please see our privacy notice.
> >>>
> >>>
> >>>
> >>
>
> *As a recipient of an email from Talend, your contact personal data will
> be on our systems. Please see our privacy notice.
> *
>
>
>

Re: What to do about issues that track flaky tests?

2022-09-15 Thread Brian Hulette via dev

I agree with Austin on this one, it makes sense to be realistic, but I'm
concerned about just blanket reducing the priority on all flakes. Two
classes of issues that could certainly be dropped to P2:
- Issues tracking flakes that have not been sickbayed yet (e.g.
https://github.com/apache/beam/issues/21266). These tests are still
providing signal (we should notice if it goes perma-red), and clearly the
flakes aren't so painful that someone felt the need to sickbay it.
- A sickbayed test, iff a breakage in the functionality it's testing would
be P2. This is admittedly difficult to identify.

It looks like we don't have a way to label sickbayed tests (or the inverse,
currently-failing), maybe we should have one?

Another thing to note: this email is reporting _unassigned_ P1 issues,
another way to remove issues from the search results would be to ensure
each flake has an owner (somehow). Maybe that's just shifting the problem,
but it could avoid the tragedy of the commons. To Manu's point, maybe those
new owners will happily discover their flake is no longer a problem.

Brian

On Wed, Sep 14, 2022 at 5:58 PM Manu Zhang  wrote:

> Agreed. I also mentioned in a previous email that some issues have been
> open for a long time (before being migrated to GitHub) and it's possible
> that those tests can pass constantly now.
> We may double check and close them since reopening is just one click.
>
> Manu
>
> On Thu, Sep 15, 2022 at 6:58 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> +1 to being realistic -- proper labels are worthwhile.  Though, some
>> flaky tests probably should be P1, and just because isn't addressed in a
>> timely manner doesn't mean it isn't a P1 - though, it does mean it wasn't
>> addressed.
>>
>>
>>
>> On Wed, Sep 14, 2022 at 1:19 PM Kenneth Knowles  wrote:
>>
>>> I would like to make this alert email actionable.
>>>
>>> I went through most of these issues. About half are P1 "flake" issues. I
>>> don't think magically expecting them to be deflaked is helpful. So I have a
>>> couple ideas:
>>>
>>> 1. Exclude "flake" P1s from this email. This is what we used to do. But
>>> then... are they really P1s?
>>> 2. Make "flake" bugs P2 if they are not currently impacting our test
>>> signal. But then... we may have a gap in test coverage that could cause
>>> severe problems. But anyhow something that is P1 for a long time is not
>>> *really* P1, so it is just being realistic.
>>>
>>> What do you all think?
>>>
>>> Kenn
>>>
>>> On Wed, Sep 14, 2022 at 3:03 AM  wrote:
>>>
 This is your daily summary of Beam's current high priority issues that
 may need attention.

 See https://beam.apache.org/contribute/issue-priorities for the
 meaning and expectations around issue priorities.

 Unassigned P1 Issues:

 https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
 installation cannot generate proto with protobuf 3.20.2
 https://github.com/apache/beam/issues/23179 [Bug]: Parquet size
 exploded for no apparent reason
 https://github.com/apache/beam/issues/22913 [Bug]:
 beam_PostCommit_Java_ValidatesRunner_Flink is flakey
 https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
 SDF and fix known and discovered issues
 https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze
 at getConnection() in WriteFn
 https://github.com/apache/beam/issues/21794 Dataflow runner creates a
 new timer whenever the output timestamp is change
 https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't
 get output to Failed Inserts PCollection
 https://github.com/apache/beam/issues/21704
 beam_PostCommit_Java_DataflowV2 failures parent bug
 https://github.com/apache/beam/issues/21701
 beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
 https://github.com/apache/beam/issues/21700
 --dataflowServiceOptions=use_runner_v2 is broken
 https://github.com/apache/beam/issues/21696 Flink Tests failure :
 java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.beam.runners.core.construction.SerializablePipelineOptions
 https://github.com/apache/beam/issues/21695 DataflowPipelineResult
 does not raise exception for unsuccessful states.
 https://github.com/apache/beam/issues/21694 BigQuery Storage API
 insert with writeResult retry and write to error table
 https://github.com/apache/beam/issues/21480 flake:
 FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
 https://github.com/apache/beam/issues/21472 Dataflow streaming tests
 failing new AfterSynchronizedProcessingTime test
 https://github.com/apache/beam/issues/21471 Flakes: Failed to load
 cache entry
 https://github.com/apache/beam/issues/21470 Test flake:
 test_split_half_sdf
 https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
 flaky: Connection refused
 https://github.com/apache/beam/i

Re: Cannot find beam in project list on jira when I create issue

2022-09-07 Thread Brian Hulette via dev

Thank you Moritz for updating the docs!

On Wed, Sep 7, 2022 at 3:06 AM Moritz Mack  wrote:

> Sorry for the confusion. Beam migrated to using Github issues just
> recently and the confluence docs haven’t been updated yet.
>
>
>
> Please create a new issue under https://github.com/apache/beam/issues and
> then reference it in your commit message using the issue id, e.g.
>
> git commit -am “Description of change (closes #12345)”
>
>
>
> Regards,
>
> Moritz
>
>
>
> On 07.09.22, 11:20, "张涛"  wrote:
>
>
>
> Hi，I followed the step to create a pull request: https: //cwiki. apache.
> org/confluence/display/BEAM/Git+Tips in step 8, need a jira issue: but I
> cannot find beam in project list on jira when I create issue : I was
> confused about what I should
>
> Hi，I followed the step to create a pull request:
>
> https://cwiki.apache.org/confluence/display/BEAM/Git+Tips
> 
>
> in step 8, need a jira issue:
>
> but  I cannot find beam in  project list on jira when I create issue :
>
>
>
> I was confused about what I should do.  I'm looking forward to getting
> your help，thanks very much!
>
> *As a recipient of an email from Talend, your contact personal data will
> be on our systems. Please see our privacy notice.
> *
>
>
>

Re: A lesson about DoFn retries

2022-09-01 Thread Brian Hulette via dev

Thanks for sharing the learnings Ahmed!

> The solution lies in keeping the retry of each step separate. A good
example of this is in how steps 2 and 3 are implemented [3]. They are
separated into different DoFns and step 3 can start only after step 2
completes successfully. This way, any failure in step 3 does not go back to
affect step 2. Is it enough just that they're in different DoFns? I thought
the key was that the DoFns are separated by a GroupByKey, so they will be
in different fused stages, which are retried independently.

Brian

On Thu, Sep 1, 2022 at 1:43 PM Ahmed Abualsaud via dev 
wrote:

> Hi all,
>
> TLDR: When writing IO connectors, be wary of how bundle retries can affect
> the work flow.
>
> A faulty implementation of a step in BigQuery batch loads was discovered
> recently. I raised an issue [1] but also wanted to mention it here as a
> potentially helpful lesson for those developing new/existing IO connectors.
>
> For those unfamiliar with BigQueryIO file loads, a write that is too large
> for a single load job [2] looks roughly something like this:
>
>
>1.
>
>Take input rows and write them to temporary files.
>2.
>
>Load temporary files to temporary BQ tables.
>3.
>
>Delete temporary files.
>4.
>
>Copy the contents of temporary tables over to the final table.
>5.
>
>Delete temporary tables.
>
>
> The faulty part here is that steps 4 and 5 are done in the same DoFn (4 in
> processElement and 5 in finishBundle). In the case a bundle fails in the
> middle of table deletion, let’s say an error occurs when deleting the nth
> table, the whole bundle will retry and we will perform the copy again. But
> tables 1~n have already been deleted and so we get stuck trying to copy
> from non-existent sources.
>
> The solution lies in keeping the retry of each step separate. A good
> example of this is in how steps 2 and 3 are implemented [3]. They are
> separated into different DoFns and step 3 can start only after step 2
> completes successfully. This way, any failure in step 3 does not go back to
> affect step 2.
>
> That's all, thanks for your attention :)
>
> Ahmed
>
> [1] https://github.com/apache/beam/issues/22920
>
> [2]
> https://github.com/apache/beam/blob/f921a2f1996cf906d994a9d62aeb6978bab09dd5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L100-L105
>
>
> [3]
> https://github.com/apache/beam/blob/149ed074428ff9b5344169da7d54e8ee271aaba1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L437-L454
>
>
>

Re: [DISCUSS] Dependency management in Apache Beam Python SDK

2022-08-25 Thread Brian Hulette via dev

Thanks for writing this up Valentyn!

I'm curious Jarek, does Airflow take any dependencies on popular libraries
like pandas, numpy, pyarrow, scipy, etc... which users are likely to have
their own dependency on? I think these dependencies are challenging in a
different way than the client libraries - ideally we would support a wide
version range so as not to require users to upgrade those libraries in
lockstep with Beam. However in some cases our dependency is pretty tight
(e.g. the DataFrame API's dependency on pandas), so we need to make sure to
explicitly test with multiple different versions. Does Airflow have any
similar issues?

Thanks!
Brian

On Thu, Aug 25, 2022 at 5:36 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> Hi Jarek,
>
> Thanks a lot for detailed feedback and sharing the Airflow story, this is
> exactly what I was hoping to hear in response from the mailing list!
>
> 600+ dependencies is very impressive, so I'd be happy to chat more and
> learn from your experience.
>
> On Wed, Aug 24, 2022 at 5:50 AM Jarek Potiuk  wrote:
>
>> Comment (from a bit outsider)
>>
>> Fantastic document Valentyn.
>>
>> Very, very insightful and interesting. We feel a lot of the same pain in
>> Apache Airflow (actually even more because we have not 20 but 620+
>> dependencies) but we are also a bit more advanced in the way how we are
>> managing the dependencies - some of the ideas you had there are already
>> tested and tried in Airflow, some of them are a bit different but we can
>> definitely share "principles" and we are a little higher in the "supply
>> chain" (i.e. Apache Beam Python SDK is our dependency).
>>
>> I left some suggestions and some comments describing in detail how the
>> same problems look like in Airflow and how we addressed them (if we did)
>> and I am happy to participate in further discussions. I am "the dependency
>> guy" in Airflow and happy to share my experiences and help to work out some
>> problems - and especially help to solve problems coming from using multiple
>> google-client libraries and diamond dependencies (we are just now dealing
>> with similar issue - where likely we will have to do a massive update of
>> several of our clients - hopefully with the involvement of Composer team.
>> And I'd love to be involved in a joint discussion with the google client
>> team to work out some common and expectations that we can rely on when we
>> define our future upgrade strategy for google clients.
>>
>> I will watch it here and be happy to spend quite some time on helping to
>> hash it out.
>>
>> BTW. You can also watch my talk I gave last year at PyWaw about "Managing
>> Python dependencies at Scale"
>> https://www.youtube.com/watch?v=_SjMdQLP30s&t=2549s where I explain the
>> approach we took, reasoning behind it etc.
>>
>> J.
>>
>>
>> On Wed, Aug 24, 2022 at 2:45 AM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> Recently, several issues [1-3]  have highlighted outage risks and
>>> developer inconveniences due to  dependency management practices in Beam
>>> Python.
>>>
>>> With dependabot and other tooling  that we have integrated with Beam,
>>> one of the missing pieces seems to be having a clear guideline of how we
>>> should be specifying requirements for our dependencies and when and how we
>>> should be updating them to have a sustainable process.
>>>
>>> As a conversation starter, I put together a retrospective
>>> [4]
>>> covering a recent incident and would like to get community opinions on the
>>> open questions.
>>>
>>> In particular, if you have experience managing dependencies for other
>>> Python libraries with rich dependency chains, knowledge of available
>>> tooling or first hand experience dealing with other dependency issues in
>>> Beam, your input would be greatly appreciated.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> [1] https://github.com/apache/beam/issues/22218
>>> [2] https://github.com/apache/beam/pull/22550#issuecomment-1217348455
>>> [3] https://github.com/apache/beam/issues/22533
>>> [4]
>>> https://docs.google.com/document/d/1gxQF8mciRYgACNpCy1wlR7TBa8zN-Tl6PebW-U8QvBk/edit
>>>
>>

Re: Incomplete Beam Schema -> Avro Schema conversion

2022-08-22 Thread Brian Hulette via dev

I don't think there's a reason for this, it's just that these logical types
were defined after the Avro <-> Beam schema conversion. I think it would be
worthwhile to add support for them, but we'd also need to look at the
reverse (avro to beam) direction, which would map back to the catch-all
DATETIME primitive type [1]. Changing that could break backwards
compatibility.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L771-L776

On Wed, Aug 17, 2022 at 2:53 PM Balázs Németh  wrote:

> java.lang.RuntimeException: Unhandled logical type
> beam:logical_type:date:v1
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.getFieldSchema(AvroUtils.java:943)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroField(AvroUtils.java:306)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java:341)
>   at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toAvroSchema(AvroUtils.java
>
> In
> https://github.com/apache/beam/blob/7bb755906c350d77ba175e1bd990096fbeaf8e44/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L902-L944
> it seems to me there are some missing options.
>
> For example
> - FixedBytes.IDENTIFIER,
> - EnumerationType.IDENTIFIER,
> - OneOfType.IDENTIFIER
> is there, but:
> - org.apache.beam.sdk.schemas.logicaltypes.Date.IDENTIFIER
> ("beam:logical_type:date:v1")
> - org.apache.beam.sdk.schemas.logicaltypes.DateTime.IDENTIFIER
> ("beam:logical_type:datetime:v1")
> - org.apache.beam.sdk.schemas.logicaltypes.Time.IDENTIFIER
> ("beam:logical_type:time:v1")
> is missing.
>
> This in an example that fails:
>
>> import java.time.LocalDate;
>> import org.apache.beam.sdk.io.gcp.bigquery.BigQueryUtils;
>> import org.apache.beam.sdk.schemas.Schema;
>> import org.apache.beam.sdk.schemas.Schema.FieldType;
>> import org.apache.beam.sdk.schemas.logicaltypes.SqlTypes;
>> import org.apache.beam.sdk.schemas.utils.AvroUtils;
>> import org.apache.beam.sdk.values.Row;
>
> // ...
>
> final Schema schema =
>> Schema.builder()
>> .addField("ymd",
>> FieldType.logicalType(SqlTypes.DATE))
>> .build();
>>
>> final Row row =
>> Row.withSchema(schema)
>> .withFieldValue("ymd", LocalDate.now())
>> .build();
>>
>> System.out.println(BigQueryUtils.toTableSchema(schema)); // works
>> System.out.println(BigQueryUtils.toTableRow(row)); // works
>>
>> System.out.println(AvroUtils.toAvroSchema(schema)); // fails
>> System.out.println(AvroUtils.toGenericRecord(row)); // fails
>
>
> Am I missing a reason for that or is it just not done properly yet? If
> this is the case, am I right to assume that they should be represented in
> the Avro format as the already existing cases?
> "beam:logical_type:date:v1" vs "DATE"
> "beam:logical_type:time:v1" vs "TIME"
>
>
>

Re: Representation of logical type beam:logical_type:datetime:v1

2022-08-12 Thread Brian Hulette via dev

Ah sorry, I forgot that INT64 is encoded with VarIntCoder, so we can't
simulate TimestampCoder with a logical type.

I think the ideal end state would be to have a well-defined
beam:logical_type:millis_instant that we use for cross-language (when
appropriate), and never use DATETIME at cross-language boundaries. Would it
be possible to add millis_instant, and use that for JDBC read/write instead
of DATETIME?

Separately we could consider how to resolve the conflicting definitions of
beam:logical_type:datetime:v1. I'm not quite sure how/if we can do that
without breaking pipeline update.

Brian


On Fri, Aug 12, 2022 at 7:50 AM Yi Hu via dev  wrote:

> Hi Cham,
>
> Thanks for the comments.
>
>
>>
>>>
>>> ii. "beam:logical_type:instant:v1" is still backed by INT64, but in
>>> implementation it will use BigEndianLongCoder to encode/decode the stream.
>>>
>>>
>> Is this to be compatible with the current Java implementation ? And we
>> have to update other SDKs to use big endian coders when encoding/decoding
>> the "beam:logical_type:instant:v1" logical type ?
>>
>>
> Yes, and the proposal is aimed to keep the Java SDK change minimal; we
> have to update other SDKs to make it work. Currently python and go sdk does
> not implement "beam:logical_type:datetime:v1" (will
> be "beam:logical_type:instant:v1") at all.
>
>
>>
>>
>>> For the second step ii, the problem is that there is a primitive type
>>> backed by a fixed length integer coder. Currently INT8, INT16, INT32,
>>> INT64... are all backed by VarInt (and there is ongoing work to use fixed
>>> size big endian to encode INT8, INT16 (
>>> https://github.com/apache/beam/issues/19815)). Ideally I would think
>>> (INT8, INT16, INT32, INT64) are all fixed and having a generic (INT)
>>> primitive type is backed by VarInt. But this may be a more substantial
>>> change for the current code base.
>>>
>>
>> I'm a bit confused by this. Did you mean that there's *no* primitive
>> type backed by a fixed length integer coder ? Also, by primitive, I'm
>> assuming you mean Beam Schema types here.
>>
>>
> Yes I mean Beam Schema types here. The proto for datetime(instant) logical
> type is constructed here:
> https://github.com/apache/beam/blob/cf9ea1f442636f781b9f449e953016bb39622781/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaTranslation.java#L202
> It is represented by an INT64 atomic type. In cross-language case, another
> SDK receives proto and decodes the stream according to the proto. Currently
> I do not see an atomic type that will be decoded using a fixed-length
> BigEndianLong coder. INT8, ..., INT64 will all be decoded with VarInt.
>
> As a workaround in the PR (#22561), in python's RowCoder I explicitly set
> the coder for "beam:logical_type:datetime:v1" (will
> be "beam:logical_type:instant:v1") to be TimestampCoder. I do not find a
> way to keep the logic contained in the logical type implementation, e.g. in
> to_language_type and to_representation_type method. To do this I will need
> an atomic type that is decoded using the BigEndianLong coder.
> Please point out if I was wrong.
>
> Best,
> Yi
>

Re: Design Doc for Controlling Batching in RunInference

2022-08-12 Thread Brian Hulette via dev

Hi Andy,

Thanks for writing this up! This seems like something that Batched DoFns
could help with. Could we make a BatchConverter [1] that represents the
necessary transformations here, and define RunInference as a Batched DoFn?
Note that the Numpy BatchConverter already enables users to specify a batch
dimension using a custom typehint, like NumpyArray[np.int64, (N, 10)] (the
N identifies the batch dimension) [2]. I think we could do something
similar, but with pytorch types. It's likely we'd need to define our own
typehints though, I suspect pytorch typehints aren't already parameterized
by size.

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py
[2]
https://github.com/apache/beam/blob/3173b503beaf30c4d32a4a39c709fd81e8161907/sdks/python/apache_beam/typehints/batch_test.py#L42

On Fri, Aug 12, 2022 at 12:36 PM Andy Ye via dev 
wrote:

> Hi everyone,
>
> I've written up a design doc
> 
>  [1]
> on controlling batching in RunInference. I'd appreciate any feedback.
> Thanks!
>
> Summary:
> Add a custom stacking function to RunInference to enable users to control
> how they want their data to be stacked. This addresses issues regarding
> data that have existing batching dimensions, or different sizes.
>
> Best,
> Andy
>
> [1]
> https://docs.google.com/document/d/1l40rOTOEqrQAkto3r_AYq8S_L06dDgoZu-4RLKAE6bo/edit#
>

Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Brian Hulette via dev

Thanks Cham! I really like the proposal, I left a few comments. I also had
one higher-level point I wanted to elevate here:

> Pipeline SDKs can generate user-friendly stub-APIs based on transforms
registered with an expansion service, eliminating the need to develop
language-specific wrappers.
This would be great! I think one point to consider is whether we can do
this statically. We could package up these stubs with releases and include
them in API docs for each language, making them much more discoverable.
That could be an extension on top of your proposal (e.g. as part of its
build, each SDK spins up other known expansion services and generates code
based on the discovery responses), but maybe it could be cleaner if we
don't really need the dynamic version?

Brian


On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Hi All,
>
> I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
>
>- Expansion service can be used to easily initialize and expand
>transforms without need for additional code.
>- Expansion service can be used to easily discover already registered
>transforms.
>- Pipeline SDKs can generate user-friendly stub-APIs based on
>transforms registered with an expansion service, eliminating the need to
>develop language-specific wrappers.
>
> Please see here for my proposal: https://s.apache.org/easy-multi-language
>
> Lemme know if you have any comments/questions/suggestions :)
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
>

Re: Join a meeting to help coordinate implementing a Dask Runner for Beam

2022-08-03 Thread Brian Hulette via dev

I wanted to share that Ryan gave a presentation about his (and Charles')
work on Pangeo Forge at Scipy 2022 (in Austin just before Beam Summit!),
with a couple mentions of their transition to Beam [1]. There were also a
couple of other talks about Pangeo [2,3] with some Beam/xarray-beam
references in there.

[1]
https://www.youtube.com/watch?v=sY20UpYCAEE&list=PLYx7XA2nY5Gde0WF1yswQw5InhmSNED8o&index=9
[2]
https://www.youtube.com/watch?v=7niNfs3ZpfQ&list=PLYx7XA2nY5Gfb0tQyezb4Gsf1nVsy86zt&index=2
[3]
https://www.youtube.com/watch?v=ftlgOESINvo&list=PLYx7XA2nY5Gfb0tQyezb4Gsf1nVsy86zt&index=3

On Tue, Jun 21, 2022 at 9:29 AM Ahmet Altay  wrote:

> Were you able to meet? If yes, I would be very interested in a summary if
> someone would like to share that :)
>
> On Mon, Jun 13, 2022 at 9:16 AM Pablo Estrada  wrote:
>
>> Also added my availability... please do invite me as well : )
>> -P.
>>
>> On Mon, Jun 13, 2022 at 6:57 AM Kenneth Knowles  wrote:
>>
>>> I would love to try to join any meetings if you add me. My calendar is
>>> too chaotic to be useful on the when2meet :-) but I can often move things
>>> around.
>>>
>>> Kenn
>>>
>>> On Wed, Jun 8, 2022 at 2:50 PM Brian Hulette 
>>> wrote:
>>>
>>>> Thanks for reaching out, Ryan, this sounds really cool. I added my
>>>> availability to the calendar since I'm interested in this space, but I'm
>>>> not sure I can offer much help - I don't have any experience building a
>>>> runner, to date I've worked exclusively on the SDK side of Beam. So I hope
>>>> some other folks can join as well :)
>>>>
>>>> @Pablo Estrada  might have some useful insight -
>>>> he's been working on a spike to build a Ray runner.
>>>>
>>>>
>>>> On Wed, Jun 8, 2022 at 12:53 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> This sounds like a great project. Unfortunately I wouldn't be able to
>>>>> meet next week, but would be happy to meet some other time and if that
>>>>> doesn't work answer questions over email, etc. Looking forward to a
>>>>> Dask runner.
>>>>>
>>>>> On Wed, Jun 8, 2022 at 9:04 AM Ryan Abernathey
>>>>>  wrote:
>>>>> >
>>>>> > Dear Beamer,
>>>>> >
>>>>> > Thank you for all of your work on this amazing project. I am new to
>>>>> Beam and am quite excited about its potential to help with some data
>>>>> processing challenges in my field of climate science.
>>>>> >
>>>>> > Our community is interested in running Beam on Dask Distributed
>>>>> clusters, which we already know how to deploy. This has been discussed at
>>>>> https://issues.apache.org/jira/browse/BEAM-5336 and
>>>>> https://github.com/apache/beam/issues/18962. It seems technically
>>>>> feasible.
>>>>> >
>>>>> > We are trying to organize a meeting next week to kickstart and
>>>>> coordinate this effort. It would be great if we could entrain some Beam
>>>>> maintainers into this meeting. If you have interest in this topic and are
>>>>> available next week, please share your availability here -
>>>>> https://www.when2meet.com/?15861604-jLnA4
>>>>> >
>>>>> > Alternatively, if you have any guidance or suggestions you wish to
>>>>> provide by email or GitHub discussion, we welcome your input.
>>>>> >
>>>>> > Thanks again for your open source work.
>>>>> >
>>>>> > Best,
>>>>> > Ryan Abernathey
>>>>> >
>>>>>
>>>>

[ANNOUNCE] Apache Beam 2.37.0 Released

2022-03-09 Thread Brian Hulette

The Apache Beam team is pleased to announce the release of version 2.37.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam blog: https://beam.apache.org/blog/beam-2.37.0/

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.37.0.

-- Brian Hulette, on behalf of The Apache Beam team

Re: [VOTE] Release 2.33.0, release candidate 2

2021-10-07 Thread Brian Hulette

+1 (non-binding)

I ran a few pipelines with DataflowRunner, including a DataFrame API
(Python 3.8.6, pandas 1.2.4) pipeline and Java SDK wordcount pipelines with
Java 8 and Java 11.

On Thu, Oct 7, 2021 at 8:47 AM Jan Lukavský  wrote:

> +1 (non-binding)
>
> I verified build and tests of my use-cases based on Java SDK with
> DirectRunner and FlinkRunner.
>
>  Jan
> On 10/6/21 6:48 PM, Lukáš Drbal wrote:
>
> +1 (non-binding) validated by running our pipelines (Java SDK, Flink
> Runner, Flink 1.13.2)
>
> Thanks!
>
> On Mon, Oct 4, 2021 at 11:27 PM Udi Meiri  wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #2 for the version 2.33.0,
>> as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1 if
>> no issues are found.
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org [2],
>> which is signed with the key with fingerprint 587B049C36DAAFE6 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.33.0-RC2" [5],
>> * website pull request listing the release [6], the blog post [6], and
>> publishing the API reference manual [7].
>> * Java artifacts were built with Maven 3.6.3 and OpenJDK 1.8.0_181.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and pypi[8].
>> * Validation sheet with a tab for 2.33.0 release to help with validation
>> [9].
>> * Docker images published to Docker Hub [10].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> For guidelines on how to try the release in your projects, check out our
>> blog post at https://beam.apache.org/blog/validate-beam-release/.
>>
>> Thanks,
>> Release Manager
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12350404
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.33.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1235/
>> [5] https://github.com/apache/beam/tree/v2.33.0-RC2
>> [6] https://github.com/apache/beam/pull/15543
>> [7] https://github.com/apache/beam-site/pull/619
>> [8] https://pypi.org/project/apache-beam/2.33.0rc2/
>> [9]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1705275493
>> [10] https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>
>>

Re: Will Apache Beam adopt a Pandas-like syntax to program in Python?

2021-09-29 Thread Brian Hulette

Hi David,

Yes! Apache Beam now has a DataFrame API [1], which provides similar
functionality. It exited experimental in Beam 2.32.0 [2]. You can see some
example pipelines that use it here [3].

Brian

[1] https://beam.apache.org/documentation/dsls/dataframes/overview/
[2] https://beam.apache.org/blog/beam-2.32.0/
[3]
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/dataframe

On Wed, Sep 29, 2021 at 12:14 PM David Ciudad Gomez <
david.ciudad.go...@gmail.com> wrote:

> Hi,
>
> Apache Spark is adopting a new Pandas-like syntax (
> https://github.com/databricks/koalas) for programming in Python. Will
> Apache Beam adopt a similar syntax in the future?
>
> Thanks and best regards.
>
> David Ciudad
>

Re: Java PreCommit hanging

2021-06-24 Thread Brian Hulette

All of the ci-beam workers are offline, I'm not sure why. I just raised
this with infra in slack [1]. It seems likely that it is related to this
issue.

[1] https://the-asf.slack.com/archives/CBX4TSBQ8/p1624554031043200

On Thu, Jun 24, 2021 at 9:59 AM Alexey Romanenko 
wrote:

> Hello,
>
> This job was supposed to fail because of checkstyle issue but it’s still
> running more than one day:
> https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/18127/
>
> Don’t we have any timeout for such cases?

[PROPOSAL] Add a "dsls-dataframe-api" component in jira

2021-06-21 Thread Brian Hulette

Hi all,
I'd like to propose that we add a new component in jira for the DataFarme
API [1]. I've just been using the label "dataframe-api" [2] to keep these
jiras organized, but I think it's time to graduate to a component, which
should be more discoverable for anyone else filing a bug or feature
request. I think we should use the name "dsls-dataframe-api" for the
component, to be consistent with SQL's "dsl-sql" component.

If there aren't any objections could a PMC member help with this?

Thanks!
Brian

PS If anyone is confused about the "DSL" designation I opted to document it
that way since it is analogous to SQL - it's possible to embed DataFrames
within a larger Python pipeline via DataframeTransform (similar to
SqlTransform), but it's also possible to describe a complete pipeline using
the DataFrame API because it includes its own IOs [3].

[1] https://beam.apache.org/documentation/dsls/dataframes/overview/
[2]
https://issues.apache.org/jira/issues/?jql=project=BEAM%20AND%20labels=dataframe-api
[3]
https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.dataframe.io.html

Re: Aliasing Pub/Sub Lite IO in external repo

2021-06-18 Thread Brian Hulette

How will this be communicated to the user? The idea is that they will
discover PubsubLiteIO through their IDE as you described, but that will get
them to the Beam one that's subject to the long release cycle. Will it just
be documented somewhere that users should prefer
com.google.cloud.pubsublite.beam.PubsubLiteIO if there's a recent fix they
need?

I wonder if a similar result could be achieved just by making Beam's
PubsubLiteIO a stub with no implementation that directs users to the
com.google.cloud one somehow?

junit's matcher interface comes to mind as a precedent here. I have been
warned many times by
Matcher._dont_implement_Matcher___instead_extend_BaseMatcher_ [1].

[1]
https://junit.org/junit4/javadoc/4.13/org/hamcrest/Matcher.html#_dont_implement_Matcher___instead_extend_BaseMatcher_()

Brian

On Thu, Jun 17, 2021 at 3:56 PM Daniel Collins  wrote:

> > Question 1: How are you going to approach testing/CI?
> The pull requests in the java-pubsublite repo do not trigger Beam repo's
> CI. You want to deliver things to your customers after they are tested as
> much as possible.
>
> I'd like to run the integration tests in both locations. They would only
> be meaningful in the beam setup when we went to validate a version bump on
> the I/O.
>
> > Question2 : in the code below, what is the purpose of keeping the
> PubsubLiteIO in the Beam repo?
>
> Visibility and autocomplete. It means the core class will be in the beam
> javadoc and if you type `import org.apache.beam.sdk.io.gcp.pubsu` in an IDE
> you'll see pubsublite and PubsubLiteIO.
>
> On Thu, Jun 17, 2021 at 5:35 PM Tomo Suzuki  wrote:
>
>> Hi Daniel,
>> (You helped me apply some change to this strange setup a few months back.
>> Thank you for working on rectifying the situation.)
>>
>> I like that idea overall.
>>
>> Question 1: How are you going to approach testing/CI?
>> The pull requests in the java-pubsublite repo do not trigger Beam repo's
>> CI. You want to deliver things to your customers after they are tested as
>> much as possible.
>>
>>
>> Question2 : in the code below, what is the purpose of keeping the
>> PubsubLiteIO in the Beam repo?
>>
>> ```
>> class PubsubLiteIO extends com.google.cloud.pubsublite.beam.PubsubLiteIO
>> {}
>> 
>>
>> The backward compatibility came to my mind but I thought you may have
>> more reasons.
>>
>>
>> My memo:
>> java-pubsublite repsitory has:
>> https://github.com/googleapis/java-pubsublite/blob/master/pubsublite-beam-io/src/main/java/com/google/cloud/pubsublite/beam/PubsubLiteIO.java
>> beam repo has:
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsublite/PubsubLiteIO.java
>> (and other files in the same directory)
>> google-cloud-pubsublite is not part of the Libraries BOM (yet) because of
>> its pre-1.0 status.
>>
>>
>> On Thu, Jun 17, 2021 at 5:07 PM Daniel Collins 
>> wrote:
>>
>>> I don't know that the cycle would cause a problem- wouldn't it override
>>> and cause it to use beam-sdks-java-core:2.30.0 (at least until beam goes to
>>> 3.X.X)?
>>>
>>> Something we can do if this is an issue is mark pubsublite-beam-io's dep
>>> on beam-sdks-java-core as 'provided'. But I'd prefer to avoid this and just
>>> let overriding fix it if that works.
>>>
>>> On Thu, Jun 17, 2021 at 4:15 PM Andrew Pilloud 
>>> wrote:
>>>
 How do you plan to address the circular dependency? Won't this end up
 with Beam depending on older versions of itself?

 beam-sdks-java-io-google-cloud-platform:2.30.0 ->
 pubsublite-beam-io:0.16.0 -> beam-sdks-java-core:2.29.0

 On Thu, Jun 17, 2021 at 11:56 AM Daniel Collins 
 wrote:

> Hello beam developers,
>
> I'm the primary author of the Pub/Sub Lite I/O, and I'd like to get
> some feedback on a change to the model for hosting this I/O in beam. Our
> team has been frustrated by the fact that we have no way to release
> features or fixes for bugs to customers on time scales shorter than the 
> 1-2
> months of the beam release cycle, and that those fixes are necessarily
> coupled with a beam version upgrade. To work around this, I forked the I/O
> in beam to our own repo about 6 months ago and have been maintaining both
> copies in parallel.
>
> I'd like to retain our ability to quickly fix and improve the I/O
> while retaining end-user visibility within the beam repo. To do this, I'd
> like to remove all the implementation from the beam repo, and leave the 
> I/O
> there implemented as:
>
> ```
> class PubsubLiteIO extends
> com.google.cloud.pubsublite.beam.PubsubLiteIO {}
> 
> , and add a dependency on our beam artifact.
>
> This enables beam users who want to just use the
> beam-sdks-java-io-google-cloud-platform artifact to do so, but they can
> also track the canonical version separately in our repo to get fixes and
> improvements at a faster rate. Al

[PROPOSAL] Stable URL for "current" API Documentation

2021-06-17 Thread Brian Hulette

Hi everyone,
You may have noticed that our API Documentation could really use some SEO.
It's possible to search for Beam APIs (e.g. "beam dataframe read_csv" [1]
or "beam java ParquetIO" [2]) and you will be directed to some
documentation, but it almost always takes you to an old version. I think
this could be significantly improved if we just make one change: rather
than making https://beam.apache.org/releases/javadoc/current redirect to
the latest release, we should just always stage the latest documentation
there.

To be clear I'm not 100% sure this will help. I haven't talked to any
search engineers or SEO experts about it. I'm just looking at other
projects as a point of reference. I've found that I never have trouble
finding the latest pandas documentation (e.g. "pandas read_csv" [3]) since
it always directs to "pandas-docs/stable/" rather than a particular version
number.

We should also make sure the version number shows up in the page title, it
looks like this isn't the case for Python right now.

Would there be any objections to making this change?

Also are there thoughts on how to make the change? Presumably this is
something we'd have to update in the release process.

Thanks,
Brian

[1]
https://beam.apache.org/releases/pydoc/2.25.0/apache_beam.dataframe.io.html
[2]
https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
[3]
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Re: SqlTransform on windows using direct runner

2021-06-16 Thread Brian Hulette

Ah, this looks like a bug in Beam on Windows. It looks like
send_signal(signal.SIGINT) is not a cross-platform way to close a process.
We should probably use terminate or kill [1] here instead. I opened
BEAM-12501 [2] for this issue.

+dev  for awareness - I think this will affect most
external transforms in Python.

Thanks for letting us know about this Igor

[1]
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.terminate
[2] https://issues.apache.org/jira/browse/BEAM-12501

On Tue, Jun 15, 2021 at 5:19 PM Igor Gois  wrote:

> Hi Brian,
>
> Thank you for your clarification.
>
> Actually, I am only trying to run a simple batch pipeline using the Sql
> transform locally. [1]
>

> The Kafka error didn't happen to me. I only mentioned it because I found
> the same error message on google.
>
> Here is the full error:
> Traceback (most recent call last):
>   File "beam-sql.py", line 18, in 
> |'sql print' >> beam.Map(print)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\pvalue.py",
> line 142, in __or__
> return self.pipeline.apply(ptransform, self)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\pipeline.py",
> line 641, in apply
> transform.transform, pvalueish, label or transform.label)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\pipeline.py",
> line 651, in apply
> return self.apply(transform, pvalueish)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\pipeline.py",
> line 694, in apply
> pvalueish_result = self.runner.apply(transform, pvalueish,
> self._options)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\runners\runner.py",
> line 188, in apply
> return m(transform, input, options)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\runners\runner.py",
> line 218, in apply_PTransform
> return transform.expand(input)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\transforms\external.py",
> line 304, in expand
> pipeline.local_tempdir)
>   File
> "c:\users\XXX\appdata\local\programs\python\python37\lib\contextlib.py",
> line 119, in __exit__
> next(self.gen)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\transforms\external.py",
> line 351, in _service
> yield stub
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\transforms\external.py",
> line 503, in __exit__
> self._service_provider.__exit__(*args)
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\utils\subprocess_server.py",
> line 72, in __exit__
> self.stop()
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\utils\subprocess_server.py",
> line 131, in stop
> self.stop_process()
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\utils\subprocess_server.py",
> line 181, in stop_process
> return super(JavaJarServer, self).stop_process()
>   File
> "C:\Users\XXX\Documents\8-Git\apache_beam\venv\lib\site-packages\apache_beam\utils\subprocess_server.py",
> line 141, in stop_process
> self._process.send_signal(signal.SIGINT)
>   File
> "c:\users\XXX\appdata\local\programs\python\python37\lib\subprocess.py",
> line 1306, in send_signal
> raise ValueError("Unsupported signal: {}".format(sig))
> ValueError: Unsupported signal: 2
>
> Thank you again and congratulations for the youtube video. It's very nice!
>
> [1]
> https://stackoverflow.com/questions/67977704/apache-beam-with-sqltransform-in-direct-runner/67990831#67990831
>
> Att,
>
> Igor Gois
>
>
>
>
>
>
> Am Di., 15. Juni 2021 um 19:56 Uhr schrieb Brian Hulette <
> bhule...@google.com>:
>
>> Hi Igor,
>>
>> "Universal Local Runner" is a term we've used in the past for a runner
>> that executes your pipeline locally. It's similar to each SDK's
>> DirectRunner, except that by leveraging portability we should only need one
>> implementation, making it "universal". I don't think we've been using that
>> term recently, I'm sorry I mentioned it in that talk and confused things.
>>
>> The Python DirectRunner is basically the ULR since it is a portable
>> runner. Unfortunatel

Re: [DISCUSSION] Docker based development environment issue

2021-06-11 Thread Brian Hulette

FYI Fernando recently contributed a setup script for configuring a local
development environment [1] that is continuously verified on mac and ubuntu
[2].

[1] https://github.com/apache/beam/pull/14584
[2] https://github.com/apache/beam/actions/workflows/local_env_tests.yml

On Wed, May 26, 2021 at 2:17 PM Alexey Romanenko 
wrote:

> Thanks for the link!
>
> Well, going back to initial problem - seems that it’s MacOS-related issue
> about permissions to “/var/run/docker.sock” and quick workaround, like “sudo
> chmod 666 /var/run/docker.sock” fixes the problem.
> Though, I’m not sure we need to do this when building a container (since
> it’s not good in terms of security). Perhaps, there is another proper way
> how to fix it.
>
> —
> Alexey
>
> On 25 May 2021, at 19:27, Brian Hulette  wrote:
>
> Yes this came up before, see BEAM-12149 [1]. I'm not sure what causes it,
> I wasn't able to repro the issue.
>
> [1] https://issues.apache.org/jira/browse/BEAM-12149
>
> On Tue, May 25, 2021 at 10:21 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> I checked it again and I have failed Docker-dependent tests:
>>
>> > Task :sdks:java:io:clickhouse:test
>>
>> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
>> java.lang.IllegalStateException at
>> DockerClientProviderStrategy.java:215
>>
>> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
>> java.lang.NullPointerException at BaseClickHouseTest.java:134
>>
>> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
>> java.lang.IllegalStateException at
>> DockerClientProviderStrategy.java:109
>>
>> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
>> java.lang.NullPointerException at BaseClickHouseTest.java:134
>>
>> 29 tests completed, 4 failed
>>
>> > Task :sdks:java:io:clickhouse:test FAILED
>>
>>
>>
>> Btw, while running it for the first time once a container was started, I
>> had an issue with file permissions that I foxed manually. Is it a known
>> issue?
>>
>> $ ./gradlew -p sdks/java/io/clickhouse/ check
>> Starting a Gradle Daemon, 1 incompatible and 2 stopped Daemons could not
>> be reused, use --status for details
>> Configuration on demand is an incubating feature.
>> > Task :model:pipeline:generateProto FAILED
>>
>> FAILURE: Build failed with an exception.
>>
>> * What went wrong:
>> Execution failed for task ':model:pipeline:generateProto'.
>> > protoc: stdout: . stderr:
>> /home/aromanenko/.gradle/caches/modules-2/files-2.1/io.grpc/protoc-gen-grpc-java/1.26.0/4f8bb54a74ab655cf2691e6eaa513fccc4c605d5/protoc-gen-grpc-java-1.26.0-linux-x86_64.exe:
>> program not found or is not executable
>>   Please specify a program using absolute path or make sure the program
>> is available in your PATH system variable
>>   --grpc_out: protoc-gen-grpc: Plugin failed with status code 1.
>>
>> $ ll
>> /home/aromanenko/.gradle/caches/modules-2/files-2.1/io.grpc/protoc-gen-grpc-java/1.26.0/4f8bb54a74ab655cf2691e6eaa513fccc4c605d5/protoc-gen-grpc-java-1.26.0-linux-x86_64.exe
>> -rw-r--r-- 1 aromanenko users 3009112 May 21 17:35
>> /home/aromanenko/.gradle/caches/modules-2/files-2.1/io.grpc/protoc-gen-grpc-java/1.26.0/4f8bb54a74ab655cf2691e6eaa513fccc4c605d5/protoc-gen-grpc-java-1.26.0-linux-x86_64.exe
>>
>>
>> —
>> Alexey
>>
>> On 21 May 2021, at 18:06, Brian Hulette  wrote:
>>
>> I think the build environment was set up with that configured:
>> https://github.com/apache/beam/blob/40326dd0a2a1c9b5dcbbcd6486a43e3875a64a43/start-build-env.sh#L110
>> Could there be something about your environment preventing that from
>> working?
>>
>> Brian
>>
>> On Fri, May 21, 2021 at 3:34 AM Gleb Kanterov  wrote:
>>
>>> Is it possible to mount the Docker socket inside the build-env Docker
>>> container? We run a lot of similar tests in CI, and it always worked:
>>>
>>> --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock
>>>
>>> On Fri, May 21, 2021 at 12:26 PM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Beam provides a very cool feature to run a local development
>>>> environment via Docker [1]. In the same time, some unit tests require to
>>>> run Docker containers to test against “real” instances (for example,
>>>> ClickHouseIOTest). So, it will end up with “docker-in-docker” issue and
>>>> such tests will fail.
>>>>
>>>> What would be a proper solution for that? Annotate these tests with a
>>>> specific “DockerImageRequired” annotation and skip them when running from
>>>> inside container or something else? Any ideas on this?
>>>>
>>>> Thanks,
>>>> Alexey
>>>>
>>>>
>>>> [1] https://github.com/apache/beam/blob/master/start-build-env.sh
>>>
>>>
>>
>

Re:

2021-06-09 Thread Brian Hulette

> And also the ticket and "// TODO: BEAM-10396 use writeRows() when it's
available" appeared later than this functionality was added to
"JdbcIO.Write".

Note that this TODO has been moved around through a few refactors. It was
initially added last summer [1].
You're right that JdbcIO.Write's statement generation functionality was
added about a year before that [2]. It's possible that the author of [1]
didn't realize [2] was done. Or maybe there's some reason why it doesn't
work there?

+1 for Alexey's requests:
- Identify cases where statement generation in JdbcIO.Write is
insufficient, if they exist (e.g. can we just use it where that TODO is
[3]? If not what goes wrong?).
- Update documentation to avoid this confusion in the future.

Brian

[1] https://github.com/apache/beam/pull/12145
[2] https://github.com/apache/beam/pull/8962
[3] https://github.com/apache/beam/pull/14954#discussion_r648456230

On Wed, Jun 9, 2021 at 7:49 AM Alexey Romanenko 
wrote:

> Hello Raphael,
>
> On 9 Jun 2021, at 09:31, Raphael Sanamyan 
> wrote:
>
> The "JdbcIO.Write" allows you to write rows without a statement or
> statement preparer, but not all functionality works without them.
>
>
> Could you show a use case when the current functionality is not enough?
>
>
> The method "WithResults" requires a statement and statement preparer. And
> also the ticket  and "//
> TODO: BEAM-10396 use writeRows() when it's available"
> 
>  appeared
> later than this functionality was added to "JdbcIO.Write". And without
> reading the code, just the documentation, it's not clear that the schema is
> enough.
>
>
> Agree but the documentation can be updated. On the oath hand, it would be
> great to have some examples that show the needs of WriteRows.
>
> Thanks,
> Alexey
>
> Thank you,
> Raphael.
>
>
>
> --
> *От:* Pablo Estrada 
> *Отправлено:* 7 июня 2021 г. 22:43:24
> *Кому:* dev; Reuven Lax
> *Копия:* Ilya Kozyrev
> *Тема:* Re:
>
>  This Message Is From an External Sender 
> +Reuven Lax  do you know if this is already supported
> or not?
> I have been able to use `JdbcIO.write()` without specifying a statement
> nor a statement preparer. Is that not what's necessary? I've done this with
> a named class with schemas (i.e. not Row) - is this perhaps the difference?
> Best
> -P.
>
> On Fri, Jun 4, 2021 at 3:44 PM Robert Bradshaw 
> wrote:
>
>> That would be great! I don't know much about this particular issue,
>> but tips for getting started in general can be found at
>> https://beam.apache.org/contribute/
>>
>> On Thu, Jun 3, 2021 at 10:55 AM Raphael Sanamyan
>>  wrote:
>> >
>> > Hi, community,
>> >
>> > I would like to start work on this task  beam-10396, I hope nobody
>> minds?
>> > Also, if anyone has any details or developments on this task, I would
>> be glad if you could share them.
>> >
>> > Thank you,
>> > Raphael.
>> >
>> >
>
>
>

Re: PreCommit tests not running

2021-06-08 Thread Brian Hulette

The INFRA ticket indicates this should be resolved now, and things do seem
to be working (e.g. Java PreCommit has started on
https://github.com/apache/beam/pull/14949).

Brian

On Tue, Jun 8, 2021 at 7:51 AM Reuven Lax  wrote:

> Sigh. Hopefully infra addresses this soon.
>
> On Mon, Jun 7, 2021 at 9:34 PM Boyuan Zhang  wrote:
>
>> There is an infra issue ongoing:
>> https://issues.apache.org/jira/browse/INFRA-21976, which affects all
>> beam tests.
>>
>> On Mon, Jun 7, 2021 at 9:21 PM Reuven Lax  wrote:
>>
>>> https://github.com/apache/beam/pull/14949
>>>
>>> Java PreCommit has been pending for 2 days. Is something wrong
>>> with Jenkins?
>>>
>>> Reuven
>>>
>>

Re: Beam SNAPSHOTS not working since friday

2021-06-08 Thread Brian Hulette

You may have already made this connection, but this is likely the same
issue discussed in [1][2].

[1]
https://lists.apache.org/thread.html/r658cdfa643c44a3fa18c226238e537ad221c8f65337f0eab3ad6dad9%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/INFRA-21976

On Tue, Jun 8, 2021 at 12:53 AM Ismaël Mejía  wrote:

> While trying to check on the new 2.32.0-SNAPSHOTs this morning I noticed
> that the daily SNAPSHOTs have not been updating since last friday:
>
> https://ci-beam.apache.org/job/beam_Release_NightlySnapshot/
>
> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-core/
>
> Can someone please check what is going on and kick the daily generation.
> I am also not able to log in into the ci-beam server, not sure if related.
>
> Regards,
> Ismaël
>
>

Re: [Question] Best Practice of Handling Null Key for KafkaRecordCoder

2021-06-04 Thread Brian Hulette

Could you register a schema for KafkaRecord? Then you can use SchemaCoder
which handles the conversion to/from Row.

On Thu, Jun 3, 2021 at 2:39 PM Chamikara Jayalath 
wrote:

> I think for that and for any transform that produces a PCollection of a
> type that is not represented by a standard coder, we would have to add a
> cross-language builder class that returns a PCollection that can be
> supported at the cross-language boundary. For example, it can be a
> PCollection since RowCoder is already a standard coder. I haven't
> looked closely into fixing this particular Jira though.
>
> Thanks,
> Cham
>
> On Thu, Jun 3, 2021 at 2:31 PM Boyuan Zhang  wrote:
>
>> Considering the problem of populating KafkaRecord metadata(BEAM-12076
>> )
>> together, what's the plan there? Are we going to make KafkaRecordCoder as a
>> well-known coder as well? The reason why I ask is because it might be a
>> good chance to revisit the KafkaRecordCoder implementation.
>>
>> On Thu, Jun 3, 2021 at 2:17 PM Chamikara Jayalath 
>> wrote:
>>
>>>
>>>
>>> On Thu, Jun 3, 2021 at 2:06 PM Boyuan Zhang  wrote:
>>>
 Supporting the x-lang boundary is a good point. So you are suggesting
 that:

1. We make NullableCoder as a standard coder.
2. KafkaIO wraps the keyCoder with NullabeCoder directly if it
requires.

 Is that correct?

>>>
>>> Yeah.
>>>
>>>


 On Wed, Jun 2, 2021 at 6:47 PM Chamikara Jayalath 
 wrote:

> I think we should make NullableCoder a standard coder for Beam [1] and
> use a standard Nullablecoder(KeyCoder) for Kafka keys (where KeyCoder 
> might
> be the standard ByteArrayCoder for example)
> I think we have compatible Java and Python NullableCoder
> implementations already so implementing this should be relatively
> straightforward.
>
> Non-standard coders may not be supported by runners at the
> cross-language boundary.
>
> Thanks,
> Cham
>
> [1]
> https://github.com/apache/beam/blob/d0c3dd72874fada03cc601f30cde022a8dd6aa9c/model/pipeline/src/main/proto/beam_runner_api.proto#L784
>
> On Wed, Jun 2, 2021 at 6:25 PM Ahmet Altay  wrote:
>
>> /cc folks who commented on the issue: @Robin Qiu 
>>  @Chamikara Jayalath  @Alexey Romanenko
>>  @Daniel Collins 
>>
>> On Tue, Jun 1, 2021 at 2:03 PM Weiwen Xu  wrote:
>>
>>> Hello,
>>>
>>> I'm working on [this issue](
>>> https://issues.apache.org/jira/browse/BEAM-12008) with Boyuan. She
>>> was very helpful in identifying the issue which is that KafkaRecordCoder
>>> couldn't handle the case when key is null.
>>>
>>> We came out with two potential solutions. Yet both have its pros and
>>> cons so I'm hoping to gather some suggestions/opinions or ideas of how 
>>> to
>>> handle this issue. For our solutions:
>>>
>>> 1. directly wrapping the keyCoder with Nullablecoder i.e.
>>> NullableCoder.of(keyCoder)
>>> cons: backwards compatibility problem
>>>
>>> 2. writing a completely new class named something like
>>> NullableKeyKafkaRecordCoder
>>> instead of using KVCoder and encode/decode KVs, we have KeyCoder
>>> and ValueCoder as fields and another BooleanCoder to encode/decode T/F 
>>> for
>>> present of null key. If key is null, KeyCoder will not encode/decode.
>>>
>>>   - [L63] encode(...){
>>>stringCoder.encode(topic, ...);
>>>intCoder.encode(partition, ...);
>>>longCoder.encode(offset, ...);
>>>longCoder.encode(timestamp, ...);
>>>intCoder.encode(timestamptype, ...);
>>>headerCoder.encode(...)
>>>if(Key!=null){
>>>   BooleanCoder.encode(false, ...);
>>>   KeyCoder.encode(key, ...);
>>>}else{
>>>   BooleanCoder.encode(true, ...);
>>>   // skips KeyCoder when key is null
>>>}
>>>   ValueCoder.encode(value, ...);
>>> }
>>>
>>>   - [L74] decode(...){
>>>   return new KafkaRecord<>(
>>>
>>> stringCoder.decode(inStream),
>>>
>>> intCoder.decode(inStream),
>>>
>>> longCoder.decode(inStream),
>>>
>>> longCoder.decode(inStream),
>>>
>>> KafkaTimestampType.forOrdinal(intCoder.decode(inStream)),
>>> (Headers)
>>> toHeaders(headerCoder.decode(inStream)),
>>>
>>> BooleanCoder.decode(inStream)? null:KeyCoder.decode(inStream),
>>>
>>> ValueCoder.decode(inStream)
>>>

Re: Question about BEAM-10277

2021-06-04 Thread Brian Hulette

Hi Irwin,
Yes the Java part is complete. That should make verifying a Python
implementation a little easier since you'll want to add a test case
to standard_coders.yaml [1] that passes in the Java implementation and your
new Python one.

There's some general advice and pointers in the Jira. Do you have some
specific questions not covered there?

I'd recommend first studying the beam:coder:row:v1 spec [2] and
schema.proto [3] to get a handle on how the coder works.

Brian

[1]
https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
[2]
https://github.com/apache/beam/blob/1e60f383fb39b9ff8d44edcbe5357da4c1e52378/model/pipeline/src/main/proto/beam_runner_api.proto#L937-L990
[3]
https://github.com/apache/beam/blob/1e60f383fb39b9ff8d44edcbe5357da4c1e52378/model/pipeline/src/main/proto/schema.proto


On Fri, Jun 4, 2021 at 5:58 AM Irwin Alejandro Rodriguez Ramirez <
irwin.rodrig...@wizeline.com> wrote:

> Hi Team,
>
> I'm working on BEAM-10277 on the Python implementation. I would like to
> know if the Java part is complete already as mention in the issue. And also
> if you have any suggestions for the implementation on python it would be
> helpful.
>
> Thank you
> --
>
> Irwin Alejandro
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: [PROPOSAL] Remove pylint format checks

2021-06-04 Thread Brian Hulette

> but I'd like to understand why yapf lets breakable lines go longer than
80 chars.

I filed an issue with yapf [1] to see if we can figure this out.

> Whatever the decision is, please update the instructions here :)

Will do!

Brian

[1] https://github.com/google/yapf/issues/927

On Fri, Apr 9, 2021 at 5:35 PM Alex Amato  wrote:

> Whatever the decision is, please update the instructions here :)
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips
>
> (And if possible let's have one simple, easy to remember command to run
> all python lint/formatting). Possibly using a wrapper script.
>
>
> On Fri, Apr 9, 2021 at 4:59 PM Robert Bradshaw 
> wrote:
>
>> I'd be happy with yapf + docformatter + isort, but I'd like to understand
>> why yapf lets breakable lines go longer than 80 chars.
>>
>> On Fri, Apr 9, 2021 at 4:19 PM Brian Hulette  wrote:
>>
>>> Currently we have two different format checks for the Python SDK. Most
>>> format checks are handled by yapf, which is nice since it is also capable
>>> of re-writing the code to make it pass the checks done in CI. However we
>>> *also* have some formatting checks still enabled in our .pylintrc [1], and
>>> pylint has no such capability.
>>>
>>> Generally yapf's output just passes these pylint format checks, but not
>>> always. For example yapf is lenient about lines over the column limit, and
>>> pylint is not. So things like [2] can happen even on a PR formatted by
>>> yapf. This is frustrating because it requires manual changes.
>>>
>>> I experimented with the yapf config to see if we can make it strict
>>> about the column limit, but it doesn't seem to be possible. So instead I'd
>>> like to propose that we just remove the pylint format checks, and rely on
>>> yapf's checks alone.
>>>
>>> There are a couple issues here:
>>> - we'd need to be ok with yapf deciding that some lines can be >80
>>> characters
>>> - yapf has no opinion whatsoever about docstrings [3], so the only thing
>>> checking them is pylint. We might work around this by setting up
>>> docformatter [4].
>>>
>>> Personally I'm ok with this if it means Python code formatting can be
>>> completely automated with a single script that runs yapf, docformatter, and
>>> isort.
>>>
>>> Brian
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/2408d0c11337b45e289736d4d7483868e717760c/sdks/python/.pylintrc#L165
>>> [2]
>>> https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/9088/console
>>> [3] https://github.com/google/yapf/issues/279
>>> [4] https://github.com/myint/docformatter
>>>
>>

Re: KafkaIO SSL issue

2021-06-03 Thread Brian Hulette

Oh I guess you are running the KafkaToPubsub example pipeline [1]? The code
you copied is here [2].
Based on that code I think Kafka is in control of creating the InputStream,
since you're just passing a file path in through the config object. So
either Kafka is creating a bad InputStream (seems unlikely), or there's
something wrong with /tmp/kafka.truststore.jks. Maybe it was cleaned up
while Kafka is reading it, or the file is somehow corrupt?

Looking at the code you copied, I suppose it's possible you're not writing
the full file to the local disk. The javadoc for transferFrom [3] states:

> Fewer than the requested number of bytes will be transferred if the
source channel has fewer than count bytes remaining, ** or if the source
channel is non-blocking and has fewer than count bytes immediately
available in its input buffer. **

Is it possible sometimes you're hitting this second case and the whole file
isn't being read? I don't know if readerChannel is blocking or not. You
might check by adding a log statement that prints the number of bytes that
are transferred to see if that correlates with the failure.

Someone else on this list may have advice on a more robust way to copy a
file from remote storage.

Brian

[1]
https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete/kafkatopubsub/
[2]
https://github.com/apache/beam/blob/f881a412fe85c64b1caf075160a6c0312995ea45/examples/java/src/main/java/org/apache/beam/examples/complete/kafkatopubsub/kafka/consumer/SslConsumerFactoryFn.java#L128
[3]
https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#transferFrom-java.nio.channels.ReadableByteChannel-long-long-

On Wed, Jun 2, 2021 at 9:00 AM Ilya Kozyrev 
wrote:

> Hi Brain,
>
>
>
> We’re using consumerFactoryFn that reads certs from GCP and copying those
> to local FS on each Dataflow worker.
>
> Exception raised after consumerFactoryFn when Kafka tries to read certs
> from local fs using KeyStore.load(InputStream is, String pass).
>
>
>
> This code we using in consumerFactoryFn to read from GCP and writing to
> local fs :
>
> try (ReadableByteChannel readerChannel =
>
>
>   
> FileSystems.open(FileSystems.matchSingleFileSpec(gcsFilePath).resourceId()))
> {
>
> try (FileChannel writeChannel =
> FileChannel.open(Paths.get(outputFilePath), options)) {
>
>   writeChannel.transferFrom(readerChannel, 0, Long.MAX_VALUE);
>
> }
>
> }
>
>
>
> Thank you,
>
> Ilya
>
>
>
> *From: *Brian Hulette 
> *Reply to: *"dev@beam.apache.org" 
> *Date: *Wednesday, 26 May 2021, 21:32
> *To: *dev 
> *Cc: *Artur Khanin 
> *Subject: *Re: KafkaIO SSL issue
>
>
>
> I came across this relevant StackOverflow question:
> https://stackoverflow.com/questions/7399154/pkcs12-derinputstream-getlength-exception
>
> They say the error is from a call to `KeyStore.load(InputStream is,
> String pass);` (consistent with your stacktrace), and can occur whenever
> there's an issue with the InputStream passed to it. Who created the
> InputStream used in this case, is it Kafka code, Beam code, or your
> consumerFactoryFn?
>
>
>
> Brian
>
>
>
> On Mon, May 24, 2021 at 4:01 AM Ilya Kozyrev 
> wrote:
>
> Hi community,
>
>
> We have an issue with KafkaIO in the case of using a secure connection
> SASL SSL to the Confluent Kafka 5.5.1. When we trying to configure the
> Kafka consumer using consumerFactoryFn, we have an irregular issue related
> to certificate reads from the file system. Irregular means, that different
> Dataflow jobs with the same parameters and certs might be failed and
> succeeded. Store cert types for Keystore and Truststore are specified
> explicitly in consumer config. In our case, it's JKS for both certs.
>
>
> *Stacktrase*:
>
> Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL
> keystore /tmp/kafka.truststore.jks of type JKS
>   at
> org.apache.kafka.common.security.ssl.SslEngineBuilder$SecurityStore.load(SslEngineBuilder.java:289)
>   at
> org.apache.kafka.common.security.ssl.SslEngineBuilder.createSSLContext(SslEngineBuilder.java:153)
>   ... 23 more
> Caused by: java.security.cert.CertificateException: Unable to initialize,
> java.io.IOException: DerInputStream.getLength(): lengthTag=65, too big.
>   at sun.security.x509.X509CertImpl.(X509CertImpl.java:198)
>   at
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:102)
>   at
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
>   at
> sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette

> One thing that's been on the back burner for a long time is making
CoderProperties into a CoderTester like Guava's EqualityTester.

Reuven's point still applies here though. This issue is not due to a bug in
SchemaCoder, it's a problem with the Row we gave SchemaCoder to encode. I'm
assuming a CoderTester would require manually generating inputs right?
These input Rows represent an illegal state that we wouldn't test with.
(That being said I like the idea of a CoderTester in general)

Brian

On Wed, Jun 2, 2021 at 12:11 PM Kenneth Knowles  wrote:

> Mutability checking might catch that.
>
> I meant to suggest not putting the check in the pipeline, but offering a
> testing discipline that will catch such issues. One thing that's been on
> the back burner for a long time is making CoderProperties into a
> CoderTester like Guava's EqualityTester. Then it can run through all the
> properties without a user setting up test suites. Downside is that the test
> failure signal gets aggregated.
>
> Kenn
>
> On Wed, Jun 2, 2021 at 12:09 PM Brian Hulette  wrote:
>
>> Could the DirectRunner just do an equality check whenever it does an
>> encode/decode? It sounds like it's already effectively performing
>> a CoderProperties.coderDecodeEncodeEqual for every element, just omitting
>> the equality check.
>>
>> On Wed, Jun 2, 2021 at 12:04 PM Reuven Lax  wrote:
>>
>>> There is no bug in the Coder itself, so that wouldn't catch it. We could
>>> insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if
>>> the Direct runner already does an encode/decode before that ParDo, then
>>> that would have fixed the problem before we could see it.
>>>
>>> On Wed, Jun 2, 2021 at 11:53 AM Kenneth Knowles  wrote:
>>>
>>>> Would it be caught by CoderProperties?
>>>>
>>>> Kenn
>>>>
>>>> On Wed, Jun 2, 2021 at 8:16 AM Reuven Lax  wrote:
>>>>
>>>>> I don't think this bug is schema specific - we created a Java object
>>>>> that is inconsistent with its encoded form, which could happen to any
>>>>> transform.
>>>>>
>>>>> This does seem to be a gap in DirectRunner testing though. It also
>>>>> makes it hard to test using PAssert, as I believe that puts everything in 
>>>>> a
>>>>> side input, forcing an encoding/decoding.
>>>>>
>>>>> On Wed, Jun 2, 2021 at 8:12 AM Brian Hulette 
>>>>> wrote:
>>>>>
>>>>>> +dev 
>>>>>>
>>>>>> > I bet the DirectRunner is encoding and decoding in between, which
>>>>>> fixes the object.
>>>>>>
>>>>>> Do we need better testing of schema-aware (and potentially other
>>>>>> built-in) transforms in the face of fusion to root out issues like this?
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> On Wed, Jun 2, 2021 at 5:13 AM Matthew Ouyang <
>>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>>
>>>>>>> I have some other work-related things I need to do this week, so I
>>>>>>> will likely report back on this over the weekend.  Thank you for the
>>>>>>> explanation.  It makes perfect sense now.
>>>>>>>
>>>>>>> On Tue, Jun 1, 2021 at 11:18 PM Reuven Lax  wrote:
>>>>>>>
>>>>>>>> Some more context - the problem is that RenameFields outputs (in
>>>>>>>> this case) Java Row objects that are inconsistent with the actual 
>>>>>>>> schema.
>>>>>>>> For example if you have the following schema:
>>>>>>>>
>>>>>>>> Row {
>>>>>>>>field1: Row {
>>>>>>>>   field2: string
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> And rename field1.field2 -> renamed, you'll get the following schema
>>>>>>>>
>>>>>>>> Row {
>>>>>>>>   field1: Row {
>>>>>>>>  renamed: string
>>>>>>>>}
>>>>>>>> }
>>>>>>>>
>>>>>>>> However the Java object for the _nested_ row will return the old
>>>>>>>> schema if getSchema() is called on it. This i

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette

Could the DirectRunner just do an equality check whenever it does an
encode/decode? It sounds like it's already effectively performing
a CoderProperties.coderDecodeEncodeEqual for every element, just omitting
the equality check.

On Wed, Jun 2, 2021 at 12:04 PM Reuven Lax  wrote:

> There is no bug in the Coder itself, so that wouldn't catch it. We could
> insert CoderProperties.coderDecodeEncodeEqual in a subsequent ParDo, but if
> the Direct runner already does an encode/decode before that ParDo, then
> that would have fixed the problem before we could see it.
>
> On Wed, Jun 2, 2021 at 11:53 AM Kenneth Knowles  wrote:
>
>> Would it be caught by CoderProperties?
>>
>> Kenn
>>
>> On Wed, Jun 2, 2021 at 8:16 AM Reuven Lax  wrote:
>>
>>> I don't think this bug is schema specific - we created a Java object
>>> that is inconsistent with its encoded form, which could happen to any
>>> transform.
>>>
>>> This does seem to be a gap in DirectRunner testing though. It also makes
>>> it hard to test using PAssert, as I believe that puts everything in a side
>>> input, forcing an encoding/decoding.
>>>
>>> On Wed, Jun 2, 2021 at 8:12 AM Brian Hulette 
>>> wrote:
>>>
>>>> +dev 
>>>>
>>>> > I bet the DirectRunner is encoding and decoding in between, which
>>>> fixes the object.
>>>>
>>>> Do we need better testing of schema-aware (and potentially other
>>>> built-in) transforms in the face of fusion to root out issues like this?
>>>>
>>>> Brian
>>>>
>>>> On Wed, Jun 2, 2021 at 5:13 AM Matthew Ouyang 
>>>> wrote:
>>>>
>>>>> I have some other work-related things I need to do this week, so I
>>>>> will likely report back on this over the weekend.  Thank you for the
>>>>> explanation.  It makes perfect sense now.
>>>>>
>>>>> On Tue, Jun 1, 2021 at 11:18 PM Reuven Lax  wrote:
>>>>>
>>>>>> Some more context - the problem is that RenameFields outputs (in this
>>>>>> case) Java Row objects that are inconsistent with the actual schema.
>>>>>> For example if you have the following schema:
>>>>>>
>>>>>> Row {
>>>>>>field1: Row {
>>>>>>   field2: string
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> And rename field1.field2 -> renamed, you'll get the following schema
>>>>>>
>>>>>> Row {
>>>>>>   field1: Row {
>>>>>>  renamed: string
>>>>>>}
>>>>>> }
>>>>>>
>>>>>> However the Java object for the _nested_ row will return the old
>>>>>> schema if getSchema() is called on it. This is because we only update the
>>>>>> schema on the top-level row.
>>>>>>
>>>>>> I think this explains why your test works in the direct runner. If
>>>>>> the row ever goes through an encode/decode path, it will come back 
>>>>>> correct.
>>>>>> The original incorrect Java objects are no longer around, and new
>>>>>> (consistent) objects are constructed from the raw data and the 
>>>>>> PCollection
>>>>>> schema. Dataflow tends to fuse ParDos together, so the following ParDo 
>>>>>> will
>>>>>> see the incorrect Row object. I bet the DirectRunner is encoding and
>>>>>> decoding in between, which fixes the object.
>>>>>>
>>>>>> You can validate this theory by forcing a shuffle after RenameFields
>>>>>> using Reshufflle. It should fix the issue If it does, let me know and 
>>>>>> I'll
>>>>>> work on a fix to RenameFields.
>>>>>>
>>>>>> On Tue, Jun 1, 2021 at 7:39 PM Reuven Lax  wrote:
>>>>>>
>>>>>>> Aha, yes this indeed another bug in the transform. The schema is set
>>>>>>> on the top-level Row but not on any nested rows.
>>>>>>>
>>>>>>> On Tue, Jun 1, 2021 at 6:37 PM Matthew Ouyang <
>>>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you everyone for your input.  I believe it will be easiest to
>>>>>>>> respond to

Re: RenameFields behaves differently in DirectRunner

2021-06-02 Thread Brian Hulette

t;>
>>>> On Tue, Jun 1, 2021 at 5:51 PM Reuven Lax  wrote:
>>>>
>>>>> This transform is the same across all runners. A few comments on the
>>>>> test:
>>>>>
>>>>>   - Using attachValues directly is error prone (per the comment on the
>>>>> method). I recommend using the withFieldValue builders instead.
>>>>>   - I recommend capturing the RenameFields PCollection into a local
>>>>> variable of type PCollection and printing out the schema (which you
>>>>> can get using the PCollection.getSchema method) to ensure that the output
>>>>> schema looks like you expect.
>>>>>- RenameFields doesn't flatten. So renaming field0_1.field1_0 - >
>>>>> nestedStringField results in field0_1.nestedStringField; if you wanted to
>>>>> flatten, then the better transform would be
>>>>> Select.fieldNameAs("field0_1.field1_0", nestedStringField).
>>>>>
>>>>> This all being said, eyeballing the implementation of RenameFields
>>>>> makes me think that it is buggy in the case where you specify a top-level
>>>>> field multiple times like you do. I think it is simply adding the 
>>>>> top-level
>>>>> field into the output schema multiple times, and the second time is with
>>>>> the field0_1 base name; I have no idea why your test doesn't catch this in
>>>>> the DirectRunner, as it's equally broken there. Could you file a JIRA 
>>>>> about
>>>>> this issue and assign it to me?
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Tue, Jun 1, 2021 at 12:47 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 1, 2021 at 12:42 PM Brian Hulette 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Matthew,
>>>>>>>
>>>>>>> > The unit tests also seem to be disabled for this as well and so I
>>>>>>> don’t know if the PTransform behaves as expected.
>>>>>>>
>>>>>>> The exclusion for NeedsRunner tests is just a quirk in our testing
>>>>>>> framework. NeedsRunner indicates that a test suite can't be executed 
>>>>>>> with
>>>>>>> the SDK alone, it needs a runner. So that exclusion just makes sure we
>>>>>>> don't run the test when we're verifying the SDK by itself in the
>>>>>>> :sdks:java:core:test task. The test is still run in other tasks where we
>>>>>>> have a runner, most notably in the Java PreCommit [1], where we run it 
>>>>>>> as
>>>>>>> part of the :runners:direct-java:test task.
>>>>>>>
>>>>>>> That being said, we may only run these tests continuously with the
>>>>>>> DirectRunner, I'm not sure if we test them on all the runners like we do
>>>>>>> with ValidatesRunner tests.
>>>>>>>
>>>>>>
>>>>>> That is correct. The tests are tests _of the transform_ so they run
>>>>>> only on the DirectRunner. They are not tests of the runner, which is only
>>>>>> responsible for correctly implementing Beam's primitives. The transform
>>>>>> should not behave differently on different runners, except for 
>>>>>> fundamental
>>>>>> differences in how they schedule work and checkpoint.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>>> > The error message I’m receiving, : Error while reading data,
>>>>>>> error message: JSON parsing error in row starting at position 0: No such
>>>>>>> field: nestedField.field1_0, suggests the BigQuery is trying to use
>>>>>>> the original name for the nested field and not the substitute name.
>>>>>>>
>>>>>>> Is there a stacktrace associated with this error? It would be
>>>>>>> helpful to see where the error is coming from.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/4101/testReport/org.apache.beam.sdk.schemas.transforms/Ren

Re: Out of band pickling in Python (pickle5)

2021-05-27 Thread Brian Hulette

I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
you have any interest in taking it on?

On Tue, May 25, 2021 at 3:09 PM Brian Hulette  wrote:

> Hm this would definitely be of interest for the DataFrame API, which is
> shuffling pandas objects. This issue [1] confirms what you suggested above,
> that pandas supports out-of-band pickling since DataFrames are mostly just
> collections of numpy arrays.
>
> Brian
>
> [1] https://github.com/pandas-dev/pandas/issues/34244
>
> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:
>
>> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
>> argument into pickle.dumps() and the "buffers" argument into
>> pickle.loads(). I expect this would be relatively straightforward.
>>
>> Then it should "just work", assuming that data is stored in objects (like
>> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
>> Pickle protocol.
>>
>>
>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette 
>> wrote:
>>
>>> I'm not aware of anyone looking at it.
>>>
>>> Will out-of-band pickling "just work" in Beam for types that implement
>>> the correct interface in Python 3.8?
>>>
>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> FWIW I recently ran into the exact case you described (high
>>>> serialization cost). The solution was to implement some not-so-intuitive
>>>> alternative transforms in my case, but I would have very much appreciated
>>>> faster serialization performance.
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>>>>
>>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>>> i.e., Pickle protocol version 5?
>>>>> https://www.python.org/dev/peps/pep-0574/
>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>>
>>>>> For Beam pipelines passing around NumPy arrays (or collections of
>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization costs
>>>>> can be significant. Beam seems to currently incur at least one one (maybe
>>>>> two) unnecessary memory copies.
>>>>>
>>>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>>>> can serialize collections of arbitrary Python objects in a fully streaming
>>>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>>>> supported by NumPy since version 1.16, released in January 2019.
>>>>>
>>>>> Cheers,
>>>>> Stephan
>>>>>
>>>>

Re: KafkaIO SSL issue

2021-05-26 Thread Brian Hulette

I came across this relevant StackOverflow question:
https://stackoverflow.com/questions/7399154/pkcs12-derinputstream-getlength-exception

They say the error is from a call to `KeyStore.load(InputStream is, String
pass);` (consistent with your stacktrace), and can occur whenever there's
an issue with the InputStream passed to it. Who created the InputStream
used in this case, is it Kafka code, Beam code, or your consumerFactoryFn?

Brian

On Mon, May 24, 2021 at 4:01 AM Ilya Kozyrev 
wrote:

> Hi community,
>
>
> We have an issue with KafkaIO in the case of using a secure connection
> SASL SSL to the Confluent Kafka 5.5.1. When we trying to configure the
> Kafka consumer using consumerFactoryFn, we have an irregular issue related
> to certificate reads from the file system. Irregular means, that different
> Dataflow jobs with the same parameters and certs might be failed and
> succeeded. Store cert types for Keystore and Truststore are specified
> explicitly in consumer config. In our case, it's JKS for both certs.
>
>
> *Stacktrase*:
>
> Caused by: org.apache.kafka.common.KafkaException: Failed to load SSL
> keystore /tmp/kafka.truststore.jks of type JKS
>   at
> org.apache.kafka.common.security.ssl.SslEngineBuilder$SecurityStore.load(SslEngineBuilder.java:289)
>   at
> org.apache.kafka.common.security.ssl.SslEngineBuilder.createSSLContext(SslEngineBuilder.java:153)
>   ... 23 more
> Caused by: java.security.cert.CertificateException: Unable to initialize,
> java.io.IOException: DerInputStream.getLength(): lengthTag=65, too big.
>   at sun.security.x509.X509CertImpl.(X509CertImpl.java:198)
>   at
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:102)
>   at
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
>   at
> sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:755)
>   at
> sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:56)
>   at
> sun.security.provider.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:224)
>   at
> sun.security.provider.JavaKeyStore$DualFormatJKS.engineLoad(JavaKeyStore.java:70)
>   at java.security.KeyStore.load(KeyStore.java:1445)
>   at
> org.apache.kafka.common.security.ssl.SslEngineBuilder$SecurityStore.load(SslEngineBuilder.java:286)
>   ... 24 more
>
>
>
> /tmp/kafka.truststore.jks is a path that’s used in consumerFactoryFn to
> load certs from GCP to the worker's local file system.
>
>
>
> Does anyone have any ideas on how to fix this issue?
>
>
>
>
>
> Thank you,
>
> Ilya
>

Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-25 Thread Brian Hulette

> Would someone be willing to review and merge
https://github.com/apache/beam/pull/14878 which should fix this?

Done! Thanks for writing the fix.

On Mon, May 24, 2021 at 8:22 PM Daniel Collins  wrote:

> Looks to me like this is flaky because the Watch class doesn't provide a
> way to ensure a number of iterations have occurred.
>
> Would someone be willing to review and merge
> https://github.com/apache/beam/pull/14878 which should fix this?
>
> On Mon, May 24, 2021 at 10:44 PM Reuven Lax  wrote:
>
>> It appears to be flaky. After several reruns, the test finally succeeded.
>>
>> On Mon, May 24, 2021 at 7:21 PM Daniel Collins 
>> wrote:
>>
>>> Looking. This is a very surprising test to be failing, the underlying
>>> class just isn't doing that much.
>>>
>>> On Mon, May 24, 2021 at 9:21 PM Reuven Lax  wrote:
>>>
 Hmmm... I'm seeing this fail during Java PreCommit.

 On Mon, May 24, 2021 at 6:19 PM Evan Galpin 
 wrote:

> It did, yes :-)
>
> On Mon, May 24, 2021 at 21:17 Reuven Lax  wrote:
>
>> Did Java PreCommit pass on your PR?
>>
>> On Mon, May 24, 2021 at 5:42 PM Evan Galpin 
>> wrote:
>>
>>> I’m not certain that it’s related based on a quick scan of the test
>>> output that you linked, but I do know that I recently made a change[1] 
>>> to
>>> Reshuffe.AssignToShard which @Daniel Collins mentioned was used by 
>>> PubSub
>>> Lite[2].
>>>
>>> Given that the change is recent and the test is failing on remote
>>> but not locally, I thought maybe the remote test env might be using the
>>> AssignToShard change and your local code might not? (not certain if the
>>> remote tests use exact SHA or rebase code on master first possibly?)
>>>
>>> [1] https://github.com/apache/beam/pull/14720
>>>
>>> [2]
>>> https://lists.apache.org/x/thread.html/r62b191e8318413739520b67dd3a4dfa788cbbc7b8d91ad9a80720dc6@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Mon, May 24, 2021 at 20:27 Reuven Lax  wrote:
>>>
 This test keeps failing on my PR. For example:


 https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/17841/testReport/junit/org.apache.beam.sdk.io.gcp.pubsublite/SubscriptionPartitionLoaderTest/addedResults/

 I haven't changed anything related to this test, and it passes for
 me locally. Is anyone else seeing this test fail?

 Reuven

>>>

Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Brian Hulette

Hm this would definitely be of interest for the DataFrame API, which is
shuffling pandas objects. This issue [1] confirms what you suggested above,
that pandas supports out-of-band pickling since DataFrames are mostly just
collections of numpy arrays.

Brian

[1] https://github.com/pandas-dev/pandas/issues/34244

On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer  wrote:

> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
> argument into pickle.dumps() and the "buffers" argument into
> pickle.loads(). I expect this would be relatively straightforward.
>
> Then it should "just work", assuming that data is stored in objects (like
> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
> Pickle protocol.
>
>
> On Tue, May 25, 2021 at 2:50 PM Brian Hulette  wrote:
>
>> I'm not aware of anyone looking at it.
>>
>> Will out-of-band pickling "just work" in Beam for types that implement
>> the correct interface in Python 3.8?
>>
>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin 
>> wrote:
>>
>>> +1
>>>
>>> FWIW I recently ran into the exact case you described (high
>>> serialization cost). The solution was to implement some not-so-intuitive
>>> alternative transforms in my case, but I would have very much appreciated
>>> faster serialization performance.
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>>>
>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>> i.e., Pickle protocol version 5?
>>>> https://www.python.org/dev/peps/pep-0574/
>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>
>>>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>>>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>>>> significant. Beam seems to currently incur at least one one (maybe two)
>>>> unnecessary memory copies.
>>>>
>>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>>> can serialize collections of arbitrary Python objects in a fully streaming
>>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>>> supported by NumPy since version 1.16, released in January 2019.
>>>>
>>>> Cheers,
>>>> Stephan
>>>>
>>>

Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Brian Hulette

I'm not aware of anyone looking at it.

Will out-of-band pickling "just work" in Beam for types that implement the
correct interface in Python 3.8?

On Tue, May 25, 2021 at 2:43 PM Evan Galpin  wrote:

> +1
>
> FWIW I recently ran into the exact case you described (high serialization
> cost). The solution was to implement some not-so-intuitive alternative
> transforms in my case, but I would have very much appreciated faster
> serialization performance.
>
> Thanks,
> Evan
>
> On Tue, May 25, 2021 at 15:26 Stephan Hoyer  wrote:
>
>> Has anyone looked into out of band pickling for Beam's Python SDK, i.e.,
>> Pickle protocol version 5?
>> https://www.python.org/dev/peps/pep-0574/
>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>
>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>> significant. Beam seems to currently incur at least one one (maybe two)
>> unnecessary memory copies.
>>
>> Pickle protocol version 5 exists for solving exactly this problem. You
>> can serialize collections of arbitrary Python objects in a fully streaming
>> fashion using memory buffers. This is a Python 3.8 feature, but the
>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>> supported by NumPy since version 1.16, released in January 2019.
>>
>> Cheers,
>> Stephan
>>
>

Re: Apply a Beam PTransform per key

2021-05-24 Thread Brian Hulette

Isn't it possible to read the grouped values produced by a GBK from an
Iterable and yield results as you go, without needing to collect all of
each input into memory? Perhaps I'm misunderstanding your use-case.

Brian

On Mon, May 24, 2021 at 10:41 AM Kenneth Knowles  wrote:

> I'm just pinging this thread because I think it is an interesting problem
> and don't want it to slip by.
>
> I bet a lot of users have gone through the tedious conversion you
> describe. Of course, it may often not be possible if you are using a
> library transform. There are a number of aspects of the Beam model that are
> designed a specific way explicitly *because* we need to assume that a large
> number of composites in your pipeline are not modifiable by you. Most
> closely related: this is why windowing is something carried along
> implicitly rather than just a parameter to GBK - that would require all
> transforms to expose how they use GBK under the hood and they would all
> have to plumb this extra key/WindowFn through every API. Instead, we have
> this way to implicitly add a second key to any transform :-)
>
> So in addition to being tedious for you, it would be good to have a better
> solution.
>
> Kenn
>
> On Fri, May 21, 2021 at 7:18 PM Stephan Hoyer  wrote:
>
>> I'd like to write a Beam PTransform that applies an *existing* Beam
>> transform to each set of grouped values, separately, and combines the
>> result. Is anything like this possible with Beam using the Python SDK?
>>
>> Here are the closest things I've come up with:
>> 1. If each set of *inputs* to my transform fit into memory, I could use
>> GroupByKey followed by FlatMap.
>> 2. If each set of *outputs* from my transform fit into memory, I could
>> use CombinePerKey.
>> 3. If I knew the static number of groups ahead of time, I could use
>> Partition, followed by applying my transform multiple times, followed by
>> Flatten.
>>
>> In my scenario, none of these holds true. For example, currently I have
>> ~20 groups of values, with each group holding ~1 TB of data. My custom
>> transform simply shuffles this TB of data around, so each set of outputs is
>> also 1TB in size.
>>
>> In my particular case, it seems my options are to either relax these
>> constraints, or to manually convert each step of my existing transform to
>> apply per key. This conversion process is tedious, but very
>> straightforward, e.g., the GroupByKey and ParDo that my transform is built
>> out of just need to deal with an expanded key.
>>
>> I wonder, could this be something built into Beam itself, e.g,. as
>> TransformPerKey? The ptranforms that result from combining other Beam
>> transforms (e.g., _ChainPTransform in Python) are private, so this seems
>> like something that would need to exist in Beam itself, if it could exist
>> at all.
>>
>> Cheers,
>> Stephan
>>
>

Re: [DISCUSSION] Docker based development environment issue

2021-05-21 Thread Brian Hulette

I think the build environment was set up with that configured:
https://github.com/apache/beam/blob/40326dd0a2a1c9b5dcbbcd6486a43e3875a64a43/start-build-env.sh#L110
Could there be something about your environment preventing that from
working?

Brian

On Fri, May 21, 2021 at 3:34 AM Gleb Kanterov  wrote:

> Is it possible to mount the Docker socket inside the build-env Docker
> container? We run a lot of similar tests in CI, and it always worked:
>
> --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock
>
> On Fri, May 21, 2021 at 12:26 PM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Hello,
>>
>> Beam provides a very cool feature to run a local development environment
>> via Docker [1]. In the same time, some unit tests require to run Docker
>> containers to test against “real” instances (for example,
>> ClickHouseIOTest). So, it will end up with “docker-in-docker” issue and
>> such tests will fail.
>>
>> What would be a proper solution for that? Annotate these tests with a
>> specific “DockerImageRequired” annotation and skip them when running from
>> inside container or something else? Any ideas on this?
>>
>> Thanks,
>> Alexey
>>
>>
>> [1] https://github.com/apache/beam/blob/master/start-build-env.sh
>
>

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-05-20 Thread Brian Hulette

> So the question is how do we proceed? Do I contact INFRA to enable it for
the main repo?

No objections from me, is this something we should vote on though? Perhaps
we already have lazy consensus :)

> more concretely how do we deal with these PRs in a practical sense? Do we
rename them and create an associated JIRA for tracking?

Yes I wonder about this too. I only have more questions though:

Would we set reviewers [1] to make sure these PRs get attention? Who should
they go to? It would be nice if we could have reviewers per dependency,
similar to [2], but it looks like it has to be the same person or group for
all PRs.
What about our existing dependency tracking which sends the dependency
update message to dev@ and files JIRAs? Probably we could just leave that
on in parallel for now, and consider turning it down later if dependabot is
working well.

Brian

[1]
https://docs.github.com/en/code-security/supply-chain-security/keeping-your-dependencies-updated-automatically/configuration-options-for-dependency-updates#reviewers
[2]
https://github.com/apache/beam/tree/243128a8fc52798e1b58b0cf1a271d95ee7aa241/ownership

On Wed, May 12, 2021 at 5:43 AM Ismaël Mejía  wrote:

> My excuses Brian I had not seen your question:
>
> > - Will dependabot work ok with the version ranges that we specify? For
> example some Python dependencies have upper bounds for the next major
> version, some for the next minor version. Is dependabot smart enough to try
> bumping the appropriate version number?
>
> Yes, it does and we can also explicitly set it to ignore certain versions
> or a all for each dependency if we don't want to have any PR upgrade for it.
>
> As a follow up on this I received an email from my Beam fork this morning
> reporting a CVE issue on one of the website dependencies, it is a moderate
> issue since this is a dep for the website generation, so it won't affect
> Beam users) but it is a clear example of the utility of dependabot.
>
> So the question is how do we proceed? Do I contact INFRA to enable it for
> the main repo? and more concretely how do we deal with these PRs in a
> practical sense? Do we rename them and create an associated JIRA for
> tracking?
>
> Other opinions?
>
> Ismaël
>
>
>
> On Fri, Apr 16, 2021 at 5:36 PM Brian Hulette  wrote:
>
>> Yeah I can see the advantage in tooling like this for easy upgrades. I
>> suspect many of the outdated Python dependencies fall under this category,
>> but the toil of creating a PR and verifying it passes tests is enough of a
>> barrier that we just haven't done it. Having a bot create the PR and
>> trigger CI to verify it would be helpful IMO.
>>
>> Some questions/concerns I have:
>> - I think many python upgrades will still require manual work:
>>   - We also have pinned versions for some Python dependencies in
>> base_image_requirements.txt [1]
>>   - We test with multiple major versions of pyarrow. We'd want to add a
>> new test environment [2] when bumping to the next major version
>> - Will dependabot work ok with the version ranges that we specify? For
>> example some Python dependencies have upper bounds for the next major
>> version, some for the next minor version. Is dependabot smart enough to try
>> bumping the appropriate version number?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt
>>
>> [2]
>> https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/python/tox.ini#L231
>>
>> On Fri, Apr 16, 2021 at 7:16 AM Ismaël Mejía  wrote:
>>
>>> Oh forgot to mention one alternative that we do in the Avro project,
>>> it is that we don't create issues for the dependabot PRs and then we
>>> search all the commits authored by dependabot and include them in the
>>> release notes to track dependency upgrades.
>>>
>>> On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
>>> >
>>> > > Quite often, dependency upgrade to latest versions leads to either
>>> compilation errors or failed tests and it should be resolved manually or
>>> declined. Having this, maybe I miss something, but I don’t see what kind of
>>> advantages automatic upgrade will bring to us except that we don’t need to
>>> create a PR manually (which is a not big deal).
>>> >
>>> > The advantage is exactly that, that we don't have to create and track
>>> > dependency updates manually, it will be done by the bot and we will
>>> > only have to do the review and guarantee that no issues are
>>> > introduced. I forgot to mention but we can create

Re: Proposal: Generalize S3FileSystem

2021-05-20 Thread Brian Hulette

That's an interesting idea. What do you mean by its own project? A couple
of possibilities:
- Spinning off a new ASF project
- A separate Beam-governed repository (e.g. apache/beam-filesystems)
- More clearly separate it in the current build system and release
artifacts that allow it to be used independently

Personally I'd be resistant to the first two (I am a Google engineer and I
like monorepos after all), but I don't see a major problem with the last
one, except that it gives us another surface to maintain.

Brian

On Wed, May 19, 2021 at 8:38 PM Chad Dombrova  wrote:

> This is a random idea, but the whole file IO system inside Beam would
> actually be awesome to extract into its own project.  IIRC, it’s not
> particularly tied to Beam.
>
> I’m not saying this should be done now, but it’s be nice to keep it mind
> for a future goal.
>
> -chad
>
>
>
> On Wed, May 19, 2021 at 10:23 AM Pablo Estrada  wrote:
>
>> That would be great to add, Matt. Of course it's important to make this
>> backwards compatible, but other than that, the addition would be very
>> welcome.
>>
>> On Wed, May 19, 2021 at 9:41 AM Matt Rudary 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> This is a quick sketch of a proposal – I wanted to get a sense of
>>> whether there’s general support for this idea before fleshing it out
>>> further, getting internal approvals, etc.
>>>
>>>
>>>
>>> I’m working with multiple storage systems that speak the S3 api. I would
>>> like to support FileIO operations for these storage systems, but
>>> S3FileSystem hardcodes the s3 scheme (the various systems use different URI
>>> schemes) and it is in any case impossible to instantiate more than one in
>>> the current design.
>>>
>>>
>>>
>>> I’d like to refactor the code in org.apache.beam.sdk.io.aws.s3 (and
>>> maybe …aws.options) somewhat to enable this use-case. I haven’t worked out
>>> the details yet, but it will take some thought to make this work in a
>>> non-hacky way.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Matt Rudary
>>>
>>

Compatibility Check Badges need attention

2021-05-18 Thread Brian Hulette

Hi all,
I just noticed that the two "compatibility check" badges on
github.com/apache/beam are in a bad state [1,2]. Does anyone have context
on these? What do we need to do to fix them?

It looks like it may be a configuration issue. Some of the complaints are
related to python 2. It's also notable that in one of the Python 3 sections
it states "The package does not support this version of python."

Thanks,
Brian

[1]
https://python-compatibility-tools.appspot.com/one_badge_target?package=apache-beam%5Bgcp%5D
[2]
https://python-compatibility-tools.appspot.com/one_badge_target?package=git%2Bgit%3A//github.com/apache/beam.git%23subdirectory%3Dsdks/python

Re: A problem with nexmark build

2021-05-17 Thread Brian Hulette

Hm it looks like there may be a bug in our gradle config, it doesn't seem
to make a shaded jar for use with Spark (see this comment on the PR that
added this to the website [1]). Maybe we need to add a shadowJar
configuration to :sdks:java:testing:nexmark?

+dev  does anyone have context on this?

[1]
https://github.com/apache/beam/commit/2ae7950328cd27330befe0e64688230c83755137#r29690967

On Wed, May 12, 2021 at 4:06 PM Tao Li  wrote:

> Hi Beam community,
>
>
>
> I have been following this nexmark doc:
> https://beam.apache.org/documentation/sdks/java/testing/nexmark/
>
>
>
> I ran into a problem with “Running query 0 on a Spark cluster with Apache
> Hadoop YARN” section.
>
>
>
> I was following the instruction by running “./gradlew
> :sdks:java:testing:nexmark:assemble” command, but did not find the uber jar
> “build/libs/beam-sdks-java-nexmark-2.29.0-spark.jar” that was built locally
> (the nexmark doc is referencing that jar).
>
>
>
> Can someone provide some guidance and help? Thanks.
>
>
>
>
>

Re: Flake trends - better?

2021-05-10 Thread Brian Hulette

In addition to stale flake jiras, I think there are also many tracking
tests that were disabled years ago due to flakiness.

On Sat, May 8, 2021 at 1:39 PM Kenneth Knowles  wrote:

> Oh the second chart is not automatically associated with the board/filter.
> Here is the correct link:
> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=filter-12350547&periodName=daily&daysprevious=300&selectedProjectId=12319527&reportKey=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report&atl_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin&Next=Next
>
> On Sat, May 8, 2021 at 1:37 PM Kenneth Knowles  wrote:
>
>> The second chart is clearly bad and getting worse. Our flake bugs are not
>> getting addressed in a timely manner.
>>
>> Zooming in on the first chart for the last 3 months you can see a notable
>> change:
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464&projectKey=BEAM&view=reporting&chart=cumulativeFlowDiagram&swimlane=1174&swimlane=1175&column=2038&column=2039&column=2040&days=90.
>> This will not change the average in the second chart very quickly.
>>
>> It may be just cleanup. That seems likely. Anecdotally, I have done a lot
>> of triage recently of failures and I know of only two severe flakes (that
>> you can count on seeing in a day/week). If so, then more cleanup would be
>> valuable. This is why I ran the second report: I suspected that we had a
>> lot of very old stale flake bugs that noone is looking at.
>>
>> Kenn
>>
>> On Fri, May 7, 2021 at 4:37 PM Ahmet Altay  wrote:
>>
>>> Thank you for sharing the charts.
>>>
>>> I know you are the messenger here, but I disagree with the message that
>>> flakes are getting noticeably better. Number of open issues look quite
>>> large but at least stable. I will guess that some of those are stale and
>>> seemingly we did a clean up in July 2020. We can try that again. Second
>>> chart shows a bad picture IMO. Issues staying open for 500-600 days on
>>> average sounds like really long.
>>>
>>> On Fri, May 7, 2021 at 1:42 PM Kenneth Knowles  wrote:
>>>
>>>> Alright, I think it should be fixed. The underlying saved filter had
>>>> not been shared.
>>>>
>>>> Kenn
>>>>
>>>> On Fri, May 7, 2021 at 8:02 AM Brian Hulette 
>>>> wrote:
>>>>
>>>>> The first link doesn't work for me, I just see a blank page with some
>>>>> jira header and navbar. Do I need some additional permissions?
>>>>>
>>>>> If I click over to "Kanban Board" on the toggle at the top right I see
>>>>> a card with "Error: The requested board cannot be viewed because it either
>>>>> does not exist or you do not have permission to view it."
>>>>>
>>>>> Brian
>>>>>
>>>>> On Thu, May 6, 2021 at 5:56 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> I spoke too soon?
>>>>>>
>>>>>>
>>>>>> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=project-12319527&periodName=daily&daysprevious=300&selectedProjectId=12319527&reportKey=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report&atl_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin&Next=Next
>>>>>>
>>>>>> On Thu, May 6, 2021 at 5:54 PM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> I made a quick* Jira chart to see how we are doing at flakes:
>>>>>>>
>>>>>>>
>>>>>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464&projectKey=BEAM&view=reporting&chart=cumulativeFlowDiagram&swimlane=1174&swimlane=1175&column=2038&column=2039&column=2040
>>>>>>>
>>>>>>> Looking a lot better recently at resolving them! (whether these are
>>>>>>> new fixes or just resolving stale bugs, I love it)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> *AFAICT you need to make a saved search, then an agile board based
>>>>>>> on the saved search, then you can look at reports
>>>>>>>
>>>>>>

Re: Jenkins master is down

2021-05-07 Thread Brian Hulette

Hey all, a quick update here. Infra helped clear out some disk space, so
the immediate issue seems to be resolved and Jenkins is now executing jobs.
You may need to comment "retest this please" on your PRs to get jobs
started.

*However*, Infra said they've also been planning on increasing the disk
size for the ci-beam jenkins master (up to 200GB), and would like to go
ahead and do that this afternoon (PDT).

Once they're ready they will direct Jenkins to finish in-flight jobs and
stop accepting new ones, then when that's done they'll switch out the new
disk and start it up again. I'm not sure exactly when this will happen, or
how long it will take. Apologies for the interruption, but I figured we may
as well get this upgrade while we have Infra's attention, and Friday
afternoon (PDT) is hopefully low-volume anyway.

If you have any concerns please reply here, or jump into #asfinfra at
the-asf.slack.com

Brian

On Fri, May 7, 2021 at 11:49 AM Pablo Estrada  wrote:

> Thanks for the update and for looking into it Kyle!
>
> On Fri, May 7, 2021 at 11:39 AM Kyle Weaver  wrote:
>
>> Just a quick PSA - Beam's Jenkins master is currently out of disk space,
>> preventing any Jenkins CI jobs from running. We are looking into the issue
>> ([1] for tracking). Thanks for your patience.
>>
>> Kyle
>>
>> [1] https://issues.apache.org/jira/browse/INFRA-21857
>>
>

Re: Potential Bug in Reshuffle.AssignToShard?

2021-05-07 Thread Brian Hulette

I suspect this was unintentional. It looks like @Daniel Collins
 added the numBuckets parameter in
https://github.com/apache/beam/pull/11919, maybe they can confirm.

Brian

On Mon, May 3, 2021 at 5:17 PM Evan Galpin  wrote:

> Hi all,
>
> While testing for a feature I’m implementing, I noticed that
> Reshuffle.AssignToShard[1] produces (N*2)-1 buckets, where N is the value
> of the user-defined numBuckets parameter. This is because the value of the
>  variable having the remainder operator applied, hashOfShard, can be
> negative.
>
> Is it intentional to produce (N*2)-1 buckets? If not I’ll submit a small
> patch.
>
> I only worry about the implications of changing it for use cases already
> employing AssignToShard. Until recently (28 days ago[2]) the class was
> private, plus Reshuffle is marked as deprecated, so I imagine the impact
> would be low? Thoughts?
>
> Thanks,
> Evan
>
> [1]
>
> https://github.com/apache/beam/blob/abbe14f721327d51cce02876324e7feba98581e2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L160
> [2]
>
> https://github.com/apache/beam/commit/abbe14f721327d51cce02876324e7feba98581e2
>

Re: Flake trends - better?

2021-05-07 Thread Brian Hulette

The first link doesn't work for me, I just see a blank page with some jira
header and navbar. Do I need some additional permissions?

If I click over to "Kanban Board" on the toggle at the top right I see a
card with "Error: The requested board cannot be viewed because it either
does not exist or you do not have permission to view it."

Brian

On Thu, May 6, 2021 at 5:56 PM Kenneth Knowles  wrote:

> I spoke too soon?
>
>
> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=project-12319527&periodName=daily&daysprevious=300&selectedProjectId=12319527&reportKey=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report&atl_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin&Next=Next
>
> On Thu, May 6, 2021 at 5:54 PM Kenneth Knowles  wrote:
>
>> I made a quick* Jira chart to see how we are doing at flakes:
>>
>>
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464&projectKey=BEAM&view=reporting&chart=cumulativeFlowDiagram&swimlane=1174&swimlane=1175&column=2038&column=2039&column=2040
>>
>> Looking a lot better recently at resolving them! (whether these are new
>> fixes or just resolving stale bugs, I love it)
>>
>> Kenn
>>
>> *AFAICT you need to make a saved search, then an agile board based on the
>> saved search, then you can look at reports
>>
>

Re: New Beam Glossary

2021-05-05 Thread Brian Hulette

+user@beam for visibility

Thanks David!

-- Forwarded message -
From: David Huntsperger 
Date: Wed, May 5, 2021 at 10:25 AM
Subject: New Beam Glossary
To: 

Hey all,

We published a new Apache Beam glossary
, to help new users learn
the terminology of the programming model. Feedback welcome!

Thanks,

David

Re: suggestions to implement BEAM-12017

2021-04-30 Thread Brian Hulette

I think for now it's fine to just require Singleton partitioning for this.
In the future we could add a couple optimizations:
- Recognize elementwise np.ufunc implementations. I think we can do this by
looking at the signature [1].
- Allow the user to indicate their function is elementwise with a
beam-specific argument (as Robert suggested).

[1]
https://numpy.org/doc/stable/reference/generated/numpy.ufunc.signature.html#numpy.ufunc.signature

On Fri, Apr 30, 2021 at 11:52 AM Robert Bradshaw 
wrote:

> On Fri, Apr 30, 2021 at 7:04 AM Irwin Alejandro Rodriguez Ramirez
>  wrote:
> >
> > Awesome, thanks! It helps me a lot,
>
> You're welcome. Looking forward to a PR :).
>
> > Now I don't know how to tell if the callable would act on a full column
> or will be pure elementwise, there are some examples of this?
>
> I don't think it's possible to figure this out in general. which is
> why we'd have to take it as explicit user input or use the Singleton
> partitioning (which brings everything to the same machine where it
> doesn't matter as the full columns would then be available).
>
> > On Wed, Apr 28, 2021 at 7:57 PM Robert Bradshaw 
> wrote:
> >>
> >> Hi Irwin,
> >>
> >> Looking forward to your first contribution!
> >>
> >> For combine_first, reading the documentation, is completely elementwise.
> >> One could implement it as
> >>
> >>
> https://github.com/apache/beam/blob/release-2.28.0/sdks/python/apache_beam/dataframe/frames.py#L182
> >>
> >> and then update the tests to allow this
> >>
> >>
> https://github.com/apache/beam/blob/release-2.28.0/sdks/python/apache_beam/dataframe/pandas_doctests_test.py#L98
> >>
> >> The plaine old combine has the unfortunate property that the passed
> >> callable may act on a full column, but in practice is often
> >> elementwise. It could be implemented similar to the non-pearson
> >> variant of corr:
> >>
> >>
> https://github.com/apache/beam/blob/release-2.29.0/sdks/python/apache_beam/dataframe/frames.py#L636
> >>
> >> requiring Singleton partitioning. One could consider adding an extra
> >> flag "elementwise" which would allow one to only require Index
> >> partitioning.
> >>
> >>
> >>
> >>
> >> On Wed, Apr 28, 2021 at 5:00 PM Irwin Alejandro Rodriguez Ramirez
> >>  wrote:
> >> >
> >> > Hi team,
> >> >
> >> > I'm a new contributor at Beam, and I'm trying to implement the
> methods combine and combine_first from BEAM-12017, I couldn't solve it yet,
> I was looking for some suggestions on how to implement these methods.
> >> > I would appreciate any help you can provide.
> >> >
> >> >
> >> > --
> >> >
> >> > Irwin Alejandro Rodríguez Ramírez | WIZELINE
> >> >
> >> > Software Engineer
> >> >
> >> > irwin.rodrig...@wizeline.com | +52 1(55) 6694 6649
> <+52%2055%206694%206649>
> >> >
> >> > Paseo de la Reforma #296, Piso 32, Col. Juárez, Del. Cuauhtémoc,
> 06600 CDMX.
> >> >
> >> > This email and its contents (including any attachments) are being
> sent to
> >> > you on the condition of confidentiality and may be protected by legal
> >> > privilege. Access to this email by anyone other than the intended
> recipient
> >> > is unauthorized. If you are not the intended recipient, please
> immediately
> >> > notify the sender by replying to this message and delete the material
> >> > immediately from your system. Any further use, dissemination,
> distribution
> >> > or reproduction of this email is strictly prohibited. Further, no
> >> > representation is made with respect to any content contained in this
> email.
> >
> >
> > This email and its contents (including any attachments) are being sent to
> > you on the condition of confidentiality and may be protected by legal
> > privilege. Access to this email by anyone other than the intended
> recipient
> > is unauthorized. If you are not the intended recipient, please
> immediately
> > notify the sender by replying to this message and delete the material
> > immediately from your system. Any further use, dissemination,
> distribution
> > or reproduction of this email is strictly prohibited. Further, no
> > representation is made with respect to any content contained in this
> email.
>

Re: [DISCUSS] Warn when KafkaIO is used as a bounded source

2021-04-30 Thread Brian Hulette

I guess that is the question. [2] and [3] above make me think that this is
experimental and just not labeled as such.

It doesn't seem reasonable to have both an open feature request for bounded
KafkaIO (BEAM-2185), and a bug report regarding bounded KafkaIO (BEAM-6466).

On Fri, Apr 30, 2021 at 11:26 AM Pablo Estrada  wrote:

> Are they experimental? I suppose this is a valid use case, right? I am in
> favor of adding a warning, but I don't know if I would call them
> experimental.
>
> I suppose a repeated-batch use case may do this repeatedly (though then
> users would need to recover the latest offsets for each partition, which I
> guess is not possible at the moment?)
>
> On Thu, Apr 29, 2021 at 4:17 PM Brian Hulette  wrote:
>
>> Our oldest open P1 issue is BEAM-6466 - "KafkaIO doesn't commit offsets
>> while being used as bounded source" [1]. I'm not sure this is an actual
>> issue since KafkaIO doesn't seem to officially support this use-case. The
>> relevant parameters indicate they are "mainly used for tests and demo
>> applications" [2], and BEAM-2185 - "KafkaIO bounded source" [3] is still
>> open.
>>
>> I think we should close out BEAM-6466 by more clearly indicating that
>> withMaxReadTime() and withMaxRecords() are experimental, and/or logging a
>> warning when they are used.
>>
>> I'm happy to make such a change, but I wanted to check if there are any
>> objections to this first.
>>
>> Thanks,
>> Brian
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6466
>> [2]
>> https://github.com/apache/beam/blob/3d4db26cfa4ace0a0f2fbb602f422fe30670c35f/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L960
>> [3] https://issues.apache.org/jira/browse/BEAM-2185
>>
>

Re: Consider Cloudpickle instead of dill for Python pickling

2021-04-30 Thread Brian Hulette

> I think with cloudpickle we will not be able have a tight range.

If cloudpickle is backwards compatible, we should be able to just keep an
upper bound in setup.py [1] synced up with a pinned version in
base_image_requirements.txt [2], right?

> We could solve this problem by passing the version of pickler used at job
submission

A bit of a digression, but it may be worth considering something more
general here, for a couple of reasons:
- I've had a similar concern for the Beam DataFrame API. Our goal is for it
to match the behavior of the pandas version used at construction time, but
we could get into some surprising edge cases if the version of pandas used
to compute partial results in the SDK harness is different.
- Occasionally we have Dataflow customers report NameErrors/AttributeErrors
that can be attributed to a dependency mismatch. It would be nice to
proactively warn about this.

That being said I imagine it would be hard to do something truly general
since every dependency will have different compatibility guarantees.

[1] https://github.com/apache/beam/blob/master/sdks/python/setup.py
[2]
https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt

On Fri, Apr 30, 2021 at 9:34 AM Valentyn Tymofieiev 
wrote:

> Hi Stephan,
>
> Thanks for reaching out. We first considered switching to cloudpickle when
> adding Python 3 support[1], and there is a tracking issue[2]. We were able
> to fix or work around missing Py3 in dill, features although some are still
> not working for us [3].
> I agree that Beam can and should support cloudpickle as a pickler.
> Practically, we can make cloudpickle the default pickler starting from a
> particular python version, for example we are planning to add Python 3.9
> support and we can try to make cloudpickle the default pickler for this
> version to avoid breaking users while ironing out rough edges.
>
> My main concern is client-server version range compatibility of the
> pickler. When SDK creates the job representation, it serializes the objects
> using the pickler used on the user's machine. When SDK deserializes the
> objects on the Runner side, it uses the pickler installed on the runner,
> for example it can be a dill version installed the docker container
> provided by Beam or Dataflow. We have been burned in the past by having an
> open version bound for the pickler in Beam's requirements: client side
> would pick the newest version, but runner container would have a somewhat
> older version, either because the container did not have the new version,
> or because some pipeline dependency wanted to downgrade dill. Older version
> of pickler did not correctly deserialize new pickles. I suspect cloudpickle
> may have the same problem. A solution was to have a very tight version
> range for the pickler in SDK's requirements [4]. Given that dill is not a
> popular dependency, the tight range did not create much friction for Beam
> users. I think with cloudpickle we will not be able have a tight range.  We
> could solve this problem by passing the version of pickler used at job
> submission, and have a check on the runner to make sure that the client
> version is not newer than the runner's version. Additionally, we should
> make sure cloudpickle is backwards compatible (newer version can
> deserialize objects created by older version).
>
> [1]
> https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-8123
> [3] https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
> [4]
> https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143
>
> On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer  wrote:
>
>> cloudpickle [1] and dill [2] are two Python packages that implement
>> extensions of Python's pickle protocol for arbitrary objects. Beam
>> currently uses dill, but I'm wondering if we could consider additionally or
>> alternatively use cloudpickle instead.
>>
>> Overall, cloudpickle seems to be a more popular choice for extended
>> pickle support in distributing computing in Python, e.g., it's used by
>> Spark, Dask and joblib.
>>
>> One of the major differences between cloudpickle and dill is how they
>> handle pickling global variables (such as Python modules) that are referred
>> to by a function:
>> - Dill doesn't serialize globals. If you want to save globals, you need
>> to call dill.dump_session(). This is what the "save_main_session" flag does
>> in Beam.
>> - Cloudpickle takes a different approach. It introspects which global
>> variables are used by a function, and creates a closure around the
>> serialized function that only contains these variables.
>>
>> The cloudpickle approach results in larger serialized functions, but it's
>> also much more robust, because the required globals are included by
>> default. In contrast, with dill, one either needs to save *all *globals
>> or none. Th

[DISCUSS] Warn when KafkaIO is used as a bounded source

2021-04-29 Thread Brian Hulette

Our oldest open P1 issue is BEAM-6466 - "KafkaIO doesn't commit offsets
while being used as bounded source" [1]. I'm not sure this is an actual
issue since KafkaIO doesn't seem to officially support this use-case. The
relevant parameters indicate they are "mainly used for tests and demo
applications" [2], and BEAM-2185 - "KafkaIO bounded source" [3] is still
open.

I think we should close out BEAM-6466 by more clearly indicating that
withMaxReadTime() and withMaxRecords() are experimental, and/or logging a
warning when they are used.

I'm happy to make such a change, but I wanted to check if there are any
objections to this first.

Thanks,
Brian

[1] https://issues.apache.org/jira/browse/BEAM-6466
[2]
https://github.com/apache/beam/blob/3d4db26cfa4ace0a0f2fbb602f422fe30670c35f/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L960
[3] https://issues.apache.org/jira/browse/BEAM-2185

Re: Please help with the 2.29.0 release announcement! (curate the bugs)

2021-04-28 Thread Brian Hulette

Here's a query to find 2.29.0 issues where you are the author or reporter
[1].

[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.29.0%20AND%20(assignee%20in%20(currentUser())%20OR%20reporter%20in%20(currentUser()))

On Tue, Apr 27, 2021 at 9:22 PM Kenneth Knowles  wrote:

> Hi all,
>
> I was just self-reviewing the blog post [1] for the 2.29.0 release
> announcement.
>
> I have these requests, which I would appreciate if I were a user:
>
> * Can you browse the jiras [2] and change them between
> Bug/Feature/Improvement and make sure they are relevant? In particular:
>   -  "sub-task" is so generic; only use if it does not stand on its own
>   - "feature" should be something a user might care about (IMO)
>   - "bug" is something else a user might care about, if they are figuring
> out how far they have to upgrade
>   - "improvement" can be changes that aren't that interesting to users
>   - if a test was failing for a while and we fixed it, set Fix Version =
> Not Applicable since it is not really tied to a release
>
> * Can you take another pass and highlight things that should be on the
> blog? Leave a comment on https://github.com/apache/beam/pull/14562. You
> can highlight your own work or other people. A quick read makes me think a
> lot is missing. The initial blog post is autogenerated from CHANGES.md but
> we are not yet very good about making sure things are in that file.
>
> Kenn
>
> [1]
> http://apache-beam-website-pull-requests.storage.googleapis.com/14562/blog/beam-2.29.0/index.html
> [2]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12349629
>

Order-sensitive operations in DataFrames and SQL (Was: [DISCUSSION] TPC-DS benchmark via Beam SQL, issues)

2021-04-27 Thread Brian Hulette

(Forking this since it's straying from the original topic)

Got it, thank you Rui. The DataFrame API could do something similar, in
that context we could do this by requiring Singleton partitioning between
operations that enforce an ordering (e.g. sort_values) and operations that
depend on the ordering (e.g. head). We typically warn users about pipelines
that require Singleton partitioning, but this could be acceptable when done
within a window.

Implementing this could be a reasonable first step before adding a
distributed approach.

Brian

On Tue, Apr 27, 2021 at 11:32 AM Rui Wang  wrote:

> I gave a try but didn't find the discussion in my email list. That might
> also be an offline discussion.
>
> I think for SQL implementation, what is the most close so far is we
> support an ordering for window analytics functions by sorting those values
> in memory (no partitioning, just sort all values for a key in memory).
>
> -Rui
>
> On Tue, Apr 27, 2021 at 11:25 AM Brian Hulette 
> wrote:
>
>>
>>
>> On Tue, Apr 27, 2021 at 10:25 AM Rui Wang  wrote:
>>
>>>
>>>
>>> On Tue, Apr 27, 2021 at 9:10 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I try to run a Beam implementation [1] of TPC-DS benchmark [2] and I
>>>> observe that most of the queries don’t pass because of different reasons
>>>> (see below). I run it with Spark Runner but the issues, I believe, are
>>>> mostly related to either query parsing or query planning, so we can expect
>>>> the same with other runners too. For now, only ~22% (23/103) of TPC-DS
>>>> queries passed successfully via Beam SQL / CalciteSQL.
>>>>
>>>> The most common issues are the following ones:
>>>>
>>>>1. *“Caused by: java.lang.UnsupportedOperationException: Non
>>>>equi-join is not supported”*
>>>>2. *“Caused by: java.lang.UnsupportedOperationException: ORDER BY
>>>>without a LIMIT is not supported!”*
>>>>3. *“Caused by: 
>>>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.plan.RelOptPlanner$CannotPlanException:
>>>>  There
>>>>are not enough rules to produce a node with desired
>>>>properties: convention=BEAM_LOGICAL. All the inputs have relevant nodes,
>>>>however the cost is still infinite.”*
>>>>4. *“Caused by: 
>>>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
>>>>  No
>>>>match found for function signature substr(, ,
>>>>)”*
>>>>
>>>> The full list of query statuses is available here [3]. The generated
>>>> TPC-DS SQL queries can be found there as well [4].
>>>>
>>>
>>> Not every query can be supported by BeamSQL easily. For example, support
>>> non equi-join(BEAM-2194). We had discussions for cause 2 to add the
>>> limitation that BeamSQL only supports ORDER BY LIMIT (LIMIT is required).
>>> Cause 3 needs a case by case investigation, some might be able to be fixed.
>>> Cause 4 looks like no such function found in the catalog.
>>>
>>
>> If (2) was discussed on the mailing list could you link that discussion?
>> We have the same issue in the DataFrame API - order-sensitive operations
>> are not supported since they don't map well to unordered PCollections, but
>> we do have support for nlargest/nsmallest, which are analogous to ORDER BY
>> with a LIMIT. We've discussed (offline) implementing these in the DataFrame
>> API by partitioning the PCollection based on the ordering (as opposed to
>> the typical hash of the index). I'm curious how SQL might implement ORDER
>> BY without a LIMIT and whether we could use a similar approach for
>> dataframes.
>>
>>
>>>> I’m not very familiar with a current status of ongoing work for Beam
>>>> SQL, so I’m sorry in advance if my questions will sound naive.
>>>>
>>>> Please, guide me on this:
>>>>
>>>> 1. Are there any chances that we can resolve, at least, partly the
>>>> current limitations of the query parsing/planning, mentioned above? Are
>>>> there any principal blockers among them?
>>>> 2. Are there any plans or ongoing work related to this?
>>>> 3. Are there any plans to upgrade vendored Calcite version to more
>>>> recent one? Should it reduce the number of current limitations or not?
>>>> 4. Do you think it could be valuable for Beam SQL to run TPC-DS
>>>> benchmark on a regular basis (as we do for Nexmark, for example) even if
>>>> not all queries can pass with Beam SQL?
>>>>
>>>
>>> This is definitely valuable for BeamSQL if we have enough resources to
>>> run such queries regularly.
>>>
>>>>
>>>> I’d appreciate any additional information/docs/details/opinions on this
>>>> topic.
>>>>
>>>> —
>>>> Alexey
>>>>
>>>> [1] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds
>>>> [2] http://www.tpc.org/tpcds/
>>>> [3]
>>>> https://docs.google.com/spreadsheets/d/1Gya9Xoa6uWwORHSrRqpkfSII4ajYvDpUTt0cNJCRHjE/edit?usp=sharing
>>>> [4]
>>>> https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds/src/main/resources/queries
>>>>
>>>

Re: [DISCUSSION] TPC-DS benchmark via Beam SQL, issues

2021-04-27 Thread Brian Hulette

On Tue, Apr 27, 2021 at 10:25 AM Rui Wang  wrote:

>
>
> On Tue, Apr 27, 2021 at 9:10 AM Alexey Romanenko 
> wrote:
>
>> Hello all,
>>
>> I try to run a Beam implementation [1] of TPC-DS benchmark [2] and I
>> observe that most of the queries don’t pass because of different reasons
>> (see below). I run it with Spark Runner but the issues, I believe, are
>> mostly related to either query parsing or query planning, so we can expect
>> the same with other runners too. For now, only ~22% (23/103) of TPC-DS
>> queries passed successfully via Beam SQL / CalciteSQL.
>>
>> The most common issues are the following ones:
>>
>>1. *“Caused by: java.lang.UnsupportedOperationException: Non
>>equi-join is not supported”*
>>2. *“Caused by: java.lang.UnsupportedOperationException: ORDER BY
>>without a LIMIT is not supported!”*
>>3. *“Caused by: 
>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.plan.RelOptPlanner$CannotPlanException:
>>  There
>>are not enough rules to produce a node with desired
>>properties: convention=BEAM_LOGICAL. All the inputs have relevant nodes,
>>however the cost is still infinite.”*
>>4. *“Caused by: 
>> org.apache.beam.vendor.calcite.v1_20_0.org.apache.calcite.sql.validate.SqlValidatorException:
>>  No
>>match found for function signature substr(, ,
>>)”*
>>
>> The full list of query statuses is available here [3]. The generated
>> TPC-DS SQL queries can be found there as well [4].
>>
>
> Not every query can be supported by BeamSQL easily. For example, support
> non equi-join(BEAM-2194). We had discussions for cause 2 to add the
> limitation that BeamSQL only supports ORDER BY LIMIT (LIMIT is required).
> Cause 3 needs a case by case investigation, some might be able to be fixed.
> Cause 4 looks like no such function found in the catalog.
>

If (2) was discussed on the mailing list could you link that discussion? We
have the same issue in the DataFrame API - order-sensitive operations are
not supported since they don't map well to unordered PCollections, but we
do have support for nlargest/nsmallest, which are analogous to ORDER BY
with a LIMIT. We've discussed (offline) implementing these in the DataFrame
API by partitioning the PCollection based on the ordering (as opposed to
the typical hash of the index). I'm curious how SQL might implement ORDER
BY without a LIMIT and whether we could use a similar approach for
dataframes.


>> I’m not very familiar with a current status of ongoing work for Beam SQL,
>> so I’m sorry in advance if my questions will sound naive.
>>
>> Please, guide me on this:
>>
>> 1. Are there any chances that we can resolve, at least, partly the
>> current limitations of the query parsing/planning, mentioned above? Are
>> there any principal blockers among them?
>> 2. Are there any plans or ongoing work related to this?
>> 3. Are there any plans to upgrade vendored Calcite version to more recent
>> one? Should it reduce the number of current limitations or not?
>> 4. Do you think it could be valuable for Beam SQL to run TPC-DS benchmark
>> on a regular basis (as we do for Nexmark, for example) even if not all
>> queries can pass with Beam SQL?
>>
>
> This is definitely valuable for BeamSQL if we have enough resources to run
> such queries regularly.
>
>>
>> I’d appreciate any additional information/docs/details/opinions on this
>> topic.
>>
>> —
>> Alexey
>>
>> [1] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds
>> [2] http://www.tpc.org/tpcds/
>> [3]
>> https://docs.google.com/spreadsheets/d/1Gya9Xoa6uWwORHSrRqpkfSII4ajYvDpUTt0cNJCRHjE/edit?usp=sharing
>> [4]
>> https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds/src/main/resources/queries
>>
>

Re: [ANNOUNCE] New committer: Yichi Zhang

2021-04-22 Thread Brian Hulette

Congratulations Yichi!

On Thu, Apr 22, 2021 at 8:05 AM Robert Burke  wrote:

> Congratulations Yichi!
>
> On Thu, Apr 22, 2021, 7:17 AM Alexey Romanenko 
> wrote:
>
>> Congratulations, well deserved!
>>
>> On 22 Apr 2021, at 10:03, Jan Lukavský  wrote:
>>
>> Congrats Yichi!
>> On 4/22/21 4:58 AM, Ahmet Altay wrote:
>>
>> Congratulations Yichi! 📣📣📣
>>
>> On Wed, Apr 21, 2021 at 6:48 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats Yichi!
>>>
>>> On Wed, Apr 21, 2021 at 6:14 PM Heejong Lee  wrote:
>>>
 Congratulations :)

 On Wed, Apr 21, 2021 at 5:20 PM Tomo Suzuki  wrote:

> Congratulations!
>
> On Wed, Apr 21, 2021 at 7:48 PM Tyson Hamilton 
> wrote:
>
>> Congrats!
>>
>> On Wed, Apr 21, 2021 at 4:37 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Well deserved and congrats, Yichi!
>>>
>>> On Wed, Apr 21, 2021 at 4:23 PM Pablo Estrada 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Yichi Zhang

 Yichi has been working in Beam for a while. He has contributed to
 various areas, including Nexmark tests, test health, Python's streaming
 capabilities, he has answered questions on StackOverflow, and helped 
 with
 release validations, among many other things that Yichi has 
 contributed to
 the Beam community.

 Considering these contributions, the Beam PMC trusts Yichi with the
 responsibilities of a Beam committer.[1]

 Thanks Yichi!
 -P.

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>
> --
> Regards,
> Tomo
>

>>

Re: [QUESTION] Dockerized Integration Tests with Java/Gradle

2021-04-22 Thread Brian Hulette

Welcome Evan!

Note we do have some examples in Beam of running IO integration tests
against testcontainers [1] that startup "fakes". We do this for Kafka [2],
Kinesis [3], and there's a test that does this for both Kafka and Pubsub
[4]. Is that the kind of thing you had in mind?

It looks like there is an Elasticsearch testcontainer you could use if
appropriate [5].

Brian

[1] https://www.testcontainers.org/
[2]
https://github.com/apache/beam/blob/93ecc1d3a4b997b2490c4439972ffaf09125299f/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOIT.java#L67
[3]
https://github.com/apache/beam/blob/93ecc1d3a4b997b2490c4439972ffaf09125299f/sdks/java/io/kinesis/src/test/java/org/apache/beam/sdk/io/kinesis/KinesisIOIT.java#L51
[4]
https://github.com/apache/beam/blob/93ecc1d3a4b997b2490c4439972ffaf09125299f/examples/java/src/test/java/org/apache/beam/examples/complete/kafkatopubsub/KafkaToPubsubE2ETest.java#L49-L50
[5] https://www.testcontainers.org/modules/elasticsearch/

On Thu, Apr 22, 2021 at 9:01 AM Alexey Romanenko 
wrote:

> Hi Evan,
>
> Great to hear that you are going to contribute to Beam. Welcome!
>
> For integration tests we mostly use k8s. Did you take a look on current
> implementation of ITs for ElasticsearchIO (e.g. [1]) and how it runs on
> Jenkins [2]?
> Also, perhaps worth to mention our very good guide about writing ITs for
> Beam [3]
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch-tests/elasticsearch-tests-7/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOIT.java
> [2]
> https://github.com/apache/beam/tree/master/.test-infra/kubernetes/elasticsearch
> [3]
> https://beam.apache.org/documentation/io/testing/#i-o-transform-integration-tests
>
> ---
> Alexey
>
>
>
> On 22 Apr 2021, at 15:54, Evan Galpin  wrote:
>
> Hi folks!
>
> I'm Evan, and I'm fairly new to developing the Beam SDK. I've been a user
> for a number of years and have done some private SDK customizations along
> the way for my day job, but have recently been given the green light to
> contribute back to the OSS repo 🙌 In particular, I've worked with
> ElasticsearchIO the most.
>
> I'm looking ahead at how to potentially revamp the ElasticsearchIO test
> suite. I wanted to double-check if there are any WIP efforts underway so as
> to not duplicate effort or step on toes.
>
> Barring that, I'm thinking that using docker containers to support
> integration testing could be really beneficial. Do we have examples of
> doing just that in the Java SDK at the moment? Do the Jenkins VMs support
> docker?  Looking around briefly outside of Beam it appears there are a few
> options for running Docker containers from gradle [1][2]. Or perhaps
> there's another alternative to Docker+gradle entirely.
>
> Thanks in advance for your advice,
> Evan
>
> [1] https://github.com/bmuschko/gradle-docker-plugin
> [2] https://github.com/palantir/gradle-docker
>
>
>

Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-21 Thread Brian Hulette

+1 (non-binding)

I ran a python pipeline exercising the DataFrame API, and another
exercising SQLTransform in Python, both on Dataflow.

On Wed, Apr 21, 2021 at 12:55 PM Kenneth Knowles  wrote:

> Since the artifacts were changed about 26 hours ago, I intend to leave
> this vote open until 46 hours from now. Specifically, around noon my time
> (US Pacific) on Friday I will close the vote and finalize the release, if
> no problems are discovered.
>
> Kenn
>
> On Wed, Apr 21, 2021 at 12:52 PM Kenneth Knowles  wrote:
>
>> +1 (binding)
>>
>> I ran the script at
>> https://beam.apache.org/contribute/release-guide/#run-validations-using-run_rc_validationsh
>> except for the part that requires a GitHub PR, since Cham already did that
>> part.
>>
>> Kenn
>>
>> On Wed, Apr 21, 2021 at 12:11 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> +1, verified that my previous findings are fixed.
>>>
>>> On Wed, Apr 21, 2021 at 8:17 AM Chamikara Jayalath 
>>> wrote:
>>>
 +1 (binding)

 Ran some Python scenarios and updated the spreadsheet.

 Thanks,
 Cham

 On Tue, Apr 20, 2021 at 3:39 PM Kenneth Knowles 
 wrote:

>
>
> On Tue, Apr 20, 2021 at 3:24 PM Robert Bradshaw 
> wrote:
>
>> The artifacts and signatures look good to me. +1 (binding)
>>
>> (The release branch still has the .dev name, maybe you didn't push?
>> https://github.com/apache/beam/blob/release-2.29.0/sdks/python/apache_beam/version.py
>> )
>>
>
> Good point. I'll highlight that I finally implemented the branching
> changes from
> https://lists.apache.org/thread.html/205472bdaf3c2c5876533750d417c19b0d1078131a3dc04916082ce8%40%3Cdev.beam.apache.org%3E
>
> The new guide with diagram is here:
> https://beam.apache.org/contribute/release-guide/#tag-a-chosen-commit-for-the-rc
>
> TL;DR:
>  - the release branch continues to be dev/SNAPSHOT for 2.29.0 while
> the main branch is now dev/SNAPSHOT for 2.30.0
>  - the RC tag v2.29.0-RC1 no longer lies on the release branch. It is
> a single tagged commit that removes the dev/SNAPSHOT suffix
>
> Kenn
>
>
>> On Tue, Apr 20, 2021 at 10:36 AM Kenneth Knowles 
>> wrote:
>>
>>> Please take another look.
>>>
>>>  - I re-ran the RC creation script so the source release and wheels
>>> are new and built from the RC tag. I confirmed the source zip and wheels
>>> have version 2.29.0 (not .dev or -SNAPSHOT).
>>>  - I fixed and rebuilt Dataflow worker container images from exactly
>>> the RC commit, added dataclasses, with internal changes to get the 
>>> version
>>> to match.
>>>  - I confirmed that the staged jars already have version 2.29.0 (not
>>> -SNAPSHOT).
>>>  - I confirmed with `diff -r -q` that the source tarball matches the
>>> RC tag (minus the .git* files and directories and gradlew)
>>>
>>> Kenn
>>>
>>> On Mon, Apr 19, 2021 at 9:19 PM Kenneth Knowles 
>>> wrote:
>>>
 At this point, the release train has just about come around to
 2.30.0 which will pick up that change. I don't think it makes sense to
 cherry-pick anything more into 2.29.0 unless it is nonfunctional. As 
 it is,
 I think we have a good commit and just need to build the expected
 artifacts. Since it isn't all the artifacts, I was planning on just
 overwriting the RC1 artifacts in question and re-verify. I could also 
 roll
 a new RC2 from the same commit fairly easily.

 Kenn

 On Mon, Apr 19, 2021 at 8:57 PM Reuven Lax 
 wrote:

> Any chance we could include
> https://github.com/apache/beam/pull/14548?
>
> On Mon, Apr 19, 2021 at 8:54 PM Kenneth Knowles 
> wrote:
>
>> To clarify: I am running and fixing the release scripts on the
>> `master` branch. They work from fresh clones of the RC tag so this 
>> should
>> work in most cases. The exception is the GitHub Actions 
>> configuration,
>> which I cherrypicked
>> to the release branch.
>>
>> Kenn
>>
>> On Mon, Apr 19, 2021 at 8:34 PM Kenneth Knowles 
>> wrote:
>>
>>> OK it sounds like I need to re-roll the artifacts in question. I
>>> don't think anything raised here indicates a problem with the tagged
>>> commit, but with the state of the release scripts at the time I 
>>> built the
>>> earlier artifacts.
>>>
>>> On Mon, Apr 19, 2021 at 1:03 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 It looks like the wheels are also versioned "2.29.0.dev".

 Not sure if it's important, but the source tarball also seems
 to contain some release script changes that are not reflected in

Re: Issues and PR names and descriptions (or should we change the contribution guide)

2021-04-21 Thread Brian Hulette

I'd argue that the history is almost always "most useful" when one PR ==
one commit on master. Intermediate commits from a PR may be useful to aid
code review, but they're not verified by presubmits and thus aren't
necessarily independently revertible, so I see little value in keeping them
around on master. In fact if you're breaking up a PR into multiple commits
to aid code review, it's worth considering if they could/should be
separately reviewed and verified PRs.
We could solve the unwanted commit issue if we have a policy to always
"Squash and Merge" PRs with rare exceptions.

I agree jira/PR titles could be better, I'm not sure what we can do about
it aside from reminding committers of this responsibility. Perhaps the
triage process can help catch poorly titled jiras?

On Wed, Apr 21, 2021 at 11:38 AM Robert Bradshaw 
wrote:

> +1 to better descriptions for JIRA (and PRs). Thanks for bringing this up.
>
> For merging unwanted commits, can we automate a simple check (e.g. with
> github actions)?
>
> On Wed, Apr 21, 2021 at 8:00 AM Tomo Suzuki  wrote:
>
>> BEAM-12173 is on me. I'm sorry about that. Re-reading committer guide
>> [1], I see I was not following this
>>
>> > The reviewer should give the LGTM and then request that the author of
>> the pull request rebase, squash, split, etc, the commits, so that the
>> history is most useful
>>
>>
>> Thank you for the feedback on this matter! (And I don't think we
>> should change the contribution guide)
>>
>> [1] https://beam.apache.org/contribute/committer-guide/
>>
>> On Wed, Apr 21, 2021 at 10:35 AM Ismaël Mejía  wrote:
>> >
>> > Hello,
>> >
>> > I have noticed an ongoing pattern of carelessness around issues/PR
>> titles and
>> > descriptions. It is really painful to see more and more examples like:
>> >
>> > BEAM-12160 Add TODO for fixing the warning
>> > BEAM-12165 Fix ParquetIO
>> > BEAM-12173 avoid intermediate conversion (PR) and BEAM-12173 use
>> > toMinutes (commit)
>> >
>> > In all these cases with just a bit of detail in the title it would be
>> enough to
>> > make other contributors or reviewers life easierm as well as to have a
>> better
>> > project history.  What astonishes me apart of the lack of care is that
>> some of
>> > those are from Beam commmitters.
>> >
>> > We already have discussed about not paying attention during commit
>> merges where
>> > some PRs end up merging tons of 'unwanted' fixup commits, and nothing
>> has
>> > changed so I am wondering if we should maybe just totally remove that
>> rule (for
>> > commits) and also eventually for titles and descriptions.
>> >
>> > Ismaël
>> >
>> > [1] https://beam.apache.org/contribute/
>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Brian Hulette

Yeah I can see the advantage in tooling like this for easy upgrades. I
suspect many of the outdated Python dependencies fall under this category,
but the toil of creating a PR and verifying it passes tests is enough of a
barrier that we just haven't done it. Having a bot create the PR and
trigger CI to verify it would be helpful IMO.

Some questions/concerns I have:
- I think many python upgrades will still require manual work:
  - We also have pinned versions for some Python dependencies in
base_image_requirements.txt [1]
  - We test with multiple major versions of pyarrow. We'd want to add a new
test environment [2] when bumping to the next major version
- Will dependabot work ok with the version ranges that we specify? For
example some Python dependencies have upper bounds for the next major
version, some for the next minor version. Is dependabot smart enough to try
bumping the appropriate version number?

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt

[2]
https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/python/tox.ini#L231

On Fri, Apr 16, 2021 at 7:16 AM Ismaël Mejía  wrote:

> Oh forgot to mention one alternative that we do in the Avro project,
> it is that we don't create issues for the dependabot PRs and then we
> search all the commits authored by dependabot and include them in the
> release notes to track dependency upgrades.
>
> On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
> >
> > > Quite often, dependency upgrade to latest versions leads to either
> compilation errors or failed tests and it should be resolved manually or
> declined. Having this, maybe I miss something, but I don’t see what kind of
> advantages automatic upgrade will bring to us except that we don’t need to
> create a PR manually (which is a not big deal).
> >
> > The advantage is exactly that, that we don't have to create and track
> > dependency updates manually, it will be done by the bot and we will
> > only have to do the review and guarantee that no issues are
> > introduced. I forgot to mention but we can create exception rules so
> > no further upgrades will be proposed for some dependencies e.g.
> > Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
> > advantage that is the detailed security report that will help us
> > prioritize dependency upgrades.
> >
> > > Regarding another issue - it’s already a problem, imho. Since we have
> a one Jira per package upgrade now and usually it “accumulates” all package
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable
> way to notify in release notes about all dependency upgrades for current
> release. One of the way is to mention the package upgrade in CHANGES.md
> which seems not very relible because it's quite easy to forget to do. I’d
> prefer to have a dedicated Jira issue for every upgrade and it will be
> included into releases notes almost automatically.
> >
> > Yes it seems the best for release note tracking to create the issue
> > and rename the PR title for this, but that would be part of the
> > review/merge process, so up to the Beam committers to do it
> > systematically and given how well we respect the commit naming /
> > squashing rules I am not sure if we will win much by having another
> > useless rule.
> >
> > On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
> >  wrote:
> > >
> > > Quite often, dependency upgrade to latest versions leads to either
> compilation errors or failed tests and it should be resolved manually or
> declined. Having this, maybe I miss something, but I don’t see what kind of
> advantages automatic upgrade will bring to us except that we don’t need to
> create a PR manually (which is a not big deal).
> > >
> > > Regarding another issue - it’s already a problem, imho. Since we have
> a one Jira per package upgrade now and usually it “accumulates” all package
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable
> way to notify in release notes about all dependency upgrades for current
> release. One of the way is to mention the package upgrade in CHANGES.md
> which seems not very relible because it's quite easy to forget to do. I’d
> prefer to have a dedicated Jira issue for every upgrade and it will be
> included into releases notes almost automatically.
> > >
> > > > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> > > >
> > > > Hello,
> > > >
> > > > Github has a bot that creates automatically Dependency Update PRs and
> > > > report security issues called dependabot.
> > > >
> > > > I was wondering if we should enable it for Beam. I tested it in my
> > > > personal Beam fork and it seems to be working well, it created
> > > > dependency updates for both Python and JS (website) dependencies.
> > > > The bot seems to be having problems to understand our gradle
> > > > dependency definitions for Java but that's something we can address
> in
> > > > the future to benefit of th

Re: BEAM - 12016

2021-04-15 Thread Brian Hulette

Hi Mike,

Welcome to the project! I'm happy to provide some pointers on this.
BEAM-12016 involves adding a couple of new operations to the Beam DataFrame
API, so it would be helpful to read up on that to get the necessary
background info. Some useful resources include user facing documentation
[1,2], and Robert's original design doc [3]. Also the talk that Robert and
I gave at Beam Summit 2020 [4,5] introduces the API from the user
perspective and provides some implementation details (note we also talk
about SqlTransform there which isn't relevant for this jira).

After that all that's left is actually writing the tests and the code!
Note that our goal with the Beam DataFrame API is to exactly match what the
pandas library does, so most of our tests take the form:
- Perform some operation with pandas
- Perform the same operation with the Beam DataFrame API
- Verify the results are equivalent

One place we do this is in pandas_doctests_test.py [6], which follows that
pattern with all of the examples from the pandas documentation (see the
add_prefix examples here [7]).
Currently we've explicitly skipped those examples in
pandas_doctests_test.py [8]. So a great first step would be to just remove
those two lines and run pandas_doctest_test.py to see how it fails.
We also have a lot of tests that follow this pattern in frames_test.py [9].
Those are generally easier to debug, so if it's helpful you may want to add
some test-cases for add_suffix and/or add_prefix there.

Now for the code. In order to make the tests pass you'll need to add an
implementation for add_prefix and add_suffix in frames.py [10].
In the DeferredDataFrame case these two operations will be "elementwise"
methods, which we have a special helper for - you can see how it's used for
abs here [11].
The DeferredSeries case is more complicated because these operations modify
the index of the Series, so we need to note that the operations do *not*
preserve any partitioning by Index, reorder_levels is an example of another
operation like that [12].

The partitioning concept can be a little tough to wrap your head around at
first, I recently wrote up some documentation that could be helpful there
[13].

I hope that helps! Please don't hesitate to reach out with more questions,
I know there's a lot to take in here.

Brian

Background:
[1] https://beam.apache.org/blog/dataframe-api-preview-available/
[2] https://beam.apache.org/documentation/dsls/dataframes/overview/
[3] https://s.apache.org/beam-dataframes
[4] https://s.apache.org/simpler-python-pipelines-2020
[5] https://2020.beamsummit.org/sessions/simpler-python-pipelines/

Code pointers:
[6]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/pandas_doctests_test.py
[7]
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.add_prefix.html
[8]
https://github.com/apache/beam/blob/437bdeef302cb30abad36db86aef4e89c12eadd4/sdks/python/apache_beam/dataframe/pandas_doctests_test.py#L79-L80
[9]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames_test.py
[10]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/dataframe/frames.py
[11]
https://github.com/apache/beam/blob/8122b33c164f131a5e9c14e740d0169c594727d3/sdks/python/apache_beam/dataframe/frames.py#L301
[12]
https://github.com/apache/beam/blob/8122b33c164f131a5e9c14e740d0169c594727d3/sdks/python/apache_beam/dataframe/frames.py#L513
[13]
https://github.com/apache/beam/blob/8122b33c164f131a5e9c14e740d0169c594727d3/sdks/python/apache_beam/dataframe/expressions.py#L156

On Thu, Apr 15, 2021 at 1:44 PM Rogelio Miguel Hernández Sandoval <
rogelio.hernan...@wizeline.com> wrote:

> Hi team,
> I'm new to the project and I wanted to start by taking the task BEAM-12016
> . Do you have any info
> or suggestions for its implementation/analysis?
>
> Thank you
> --
>
> Mike
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: Codecov Bash Uploader Security Notice

2021-04-15 Thread Brian Hulette

I also got this email, it stated "Unfortunately, we can confirm that you
were impacted by this security event," but it didn't specify _how_ I
was impacted. I assumed it was through Beam, but perhaps it was through
Arrow. It looks like they use the Bash uploader [1].

The codecov notice states:
> The Bash Uploader is also used in these related uploaders:
Codecov-actions uploader for Github, the Codecov CircleCl Orb, and the
Codecov Bitrise Step (together, the “Bash Uploaders”). Therefore, these
related uploaders were also impacted by this event.

Which would seem to confirm the Python codecov tool is not impacted.

[1]
https://github.com/apache/arrow/blob/13c334e976f09d4d896c26d4b5f470e36a46572b/.github/workflows/rust.yml#L337





On Thu, Apr 15, 2021 at 12:50 PM Pablo Estrada  wrote:

> I believe that the utility that we use is the Python codecov tool[1], not
> the bash uploader[2].
> Specifically, the upload seems to happen in Python here[3].
>
> Why do I think we use the Python tool? Because it seems to be installed by
> tox around the link Udi shared[4]
>
> So it seems we're okay?
>
>
> [1] https://github.com/codecov/codecov-python
> [2] https://docs.codecov.io/docs/about-the-codecov-bash-uploader
> [3]
> https://github.com/codecov/codecov-python/blob/158a38eed7fd6f0d2f9c9f4c5258ab1f244b6e13/codecov/__init__.py#L1129-L1157
> [4]
> https://github.com/apache/beam/blob/39923d8f843ecfd3d89443dccc359c14aea8f26f/sdks/python/tox.ini#L105
>
>
> On Thu, Apr 15, 2021 at 11:38 AM Udi Meiri  wrote:
>
>> From the notice: "We strongly recommend affected users immediately
>> re-roll all of their credentials, tokens, or keys located in the
>> environment variables in their CI processes that used one of Codecov’s Bash
>> Uploaders."
>>
>>
>> On Thu, Apr 15, 2021 at 11:35 AM Udi Meiri  wrote:
>>
>>> I got this email: https://about.codecov.io/security-update/
>>>
>>> This is where we use codecov:
>>>
>>> https://github.com/apache/beam/blob/39923d8f843ecfd3d89443dccc359c14aea8f26f/sdks/python/tox.ini#L105
>>>
>>> I'm not sure if this runs the "bash uploader", but we do set
>>> a CODECOV_TOKEN environment variable.
>>>
>>

Re: protoc issues in docker container

2021-04-15 Thread Brian Hulette

Hm I thought we tested out docker builds within the development container.
Maybe this only happens for certain docker setups. Could you file a jira
for this?

I came across this GitHub issue [1] that cites a similar problem, but sadly
doesn't reveal any details about the root cause or a solution.

Brian

[1] https://github.com/google/protobuf-gradle-plugin/issues/165

On Fri, Apr 9, 2021 at 3:33 PM Evan Galpin  wrote:

> I ran into this same issue recently as well. It looks like you may have
> already found the same, but I can say that adding execute permissions did
> fix the issue for me.
>
> If I recall, adding execute perms lead to another very similar issue where
> an additional exe required the same fix. Once applied to both, everything
> seemed to get along fine in my case.
>
> Thanks,
> Evan
>
> On Fri, Apr 9, 2021 at 18:28 Elliotte Rusty Harold 
> wrote:
>
>> I tried to run some tests using the Docker build environment (neat
>> idea by the way) and hit this:
>>
>> $ ./gradlew -p sdks/java/io/cassandra check
>> Starting a Gradle Daemon, 1 incompatible and 1 stopped Daemons could
>> not be reused, use --status for details
>> Configuration on demand is an incubating feature.
>> > Task :model:pipeline:generateProto FAILED
>>
>> FAILURE: Build failed with an exception.
>>
>> * What went wrong:
>> Execution failed for task ':model:pipeline:generateProto'.
>> > java.io.IOException: Cannot run program
>> "/home/elharo/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protoc/3.15.3/8fb78f9edc8192143d947edc2217aafaa5d5f79b/protoc-3.15.3-linux-x86_64.exe":
>> error=13, Permission denied
>>
>> I'm not sure immediately what's wrong, but perhaps the Docker
>> container isn't configured quite right? There could be more config I
>> need to do, but I thought this would be set up out of the box in the
>> Docker image. I do see that protoc does not have execute permission:
>>
>> $ ls -l
>> /home/elharo/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protoc/3.15.3/8fb78f9edc8192143d947edc2217aafaa5d5f79b/
>> total 5120
>> -rw-r--r-- 1 elharo users 5241944 Apr  9 22:17
>> protoc-3.15.3-linux-x86_64.exe
>>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>

Re: P1 issues report

2021-04-14 Thread Brian Hulette

+1 I think this is useful. Given the expectation set at [1], these should
be getting attention or deprioritized. Maybe one day we'll get to a point
where the report is usually empty :)
I think it is having an effect - we've gone down from 57 open P1s in the
first report (3/20) to 34 today (4/14) [2]

Brian

[1] https://beam.apache.org/contribute/jira-priorities/#p1-critical
[2]
https://lists.apache.org/thread.html/raea73b07915d456d54fa189e1d14426430630f40ded077a9f296c26b%40%3Cdev.beam.apache.org%3E

On Wed, Apr 14, 2021 at 4:33 PM Pablo Estrada  wrote:

> I personally think this is good to have.
>
> On Wed, Apr 14, 2021 at 4:08 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> I want to ask: is this automation useful?
>>
>> Every few days I choose a couple of these and investigate. Does anyone
>> else do something like this?
>>
>> Especially the old ones, I expect either we should resolve it, or declare
>> it "not a bug", or lower the priority.
>>
>> Kenn
>>
>> On Wed, Apr 14, 2021 at 3:52 PM Beam Jira Bot  wrote:
>>
>>> This is your daily summary of Beam's current P1 issues, not including
>>> flaky tests.
>>>
>>> See https://beam.apache.org/contribute/jira-priorities/#p1-critical for
>>> the meaning and expectations around P1 issues.
>>>
>>> BEAM-12170: [beam_PostCommit_Java_PVR_Spark_Batch] Multiple entries
>>> with same key (https://issues.apache.org/jira/browse/BEAM-12170)
>>> BEAM-11959: Python Beam SDK Harness hangs when installing pip
>>> packages (https://issues.apache.org/jira/browse/BEAM-11959)
>>> BEAM-11906: No trigger early repeatedly for session windows (
>>> https://issues.apache.org/jira/browse/BEAM-11906)
>>> BEAM-11875: XmlIO.Read does not handle XML encoding per spec (
>>> https://issues.apache.org/jira/browse/BEAM-11875)
>>> BEAM-11828: JmsIO is not acknowledging messages correctly (
>>> https://issues.apache.org/jira/browse/BEAM-11828)
>>> BEAM-11772: GCP BigQuery sink (file loads) uses runner determined
>>> sharding for unbounded data (
>>> https://issues.apache.org/jira/browse/BEAM-11772)
>>> BEAM-11755: Cross-language consistency (RequiresStableInputs) is
>>> quietly broken (at least on portable flink runner) (
>>> https://issues.apache.org/jira/browse/BEAM-11755)
>>> BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when
>>> int overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
>>> BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow
>>> Runner (https://issues.apache.org/jira/browse/BEAM-11576)
>>> BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor (
>>> https://issues.apache.org/jira/browse/BEAM-11434)
>>> BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix
>>> CVE-2020-27216 (https://issues.apache.org/jira/browse/BEAM-11227)
>>> BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink (
>>> https://issues.apache.org/jira/browse/BEAM-11148)
>>> BEAM-11017: Timer with dataflow runner can be set multiple times
>>> (dataflow runner) (https://issues.apache.org/jira/browse/BEAM-11017)
>>> BEAM-10861: Adds URNs and payloads to PubSub transforms (
>>> https://issues.apache.org/jira/browse/BEAM-10861)
>>> BEAM-10617: python CombineGlobally().with_fanout() cause duplicate
>>> combine results for sliding windows (
>>> https://issues.apache.org/jira/browse/BEAM-10617)
>>> BEAM-10573: CSV files are loaded several times if they are too large
>>> (https://issues.apache.org/jira/browse/BEAM-10573)
>>> BEAM-10569: SpannerIO tests don't actually assert anything. (
>>> https://issues.apache.org/jira/browse/BEAM-10569)
>>> BEAM-10288: Quickstart documents are out of date (
>>> https://issues.apache.org/jira/browse/BEAM-10288)
>>> BEAM-10244: Populate requirements cache fails on poetry-based
>>> packages (https://issues.apache.org/jira/browse/BEAM-10244)
>>> BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all
>>> data (https://issues.apache.org/jira/browse/BEAM-10100)
>>> BEAM-9917: BigQueryBatchFileLoads dynamic destination (
>>> https://issues.apache.org/jira/browse/BEAM-9917)
>>> BEAM-9564: Remove insecure ssl options from MongoDBIO (
>>> https://issues.apache.org/jira/browse/BEAM-9564)
>>> BEAM-9455: Environment-sensitive provisioning for Dataflow (
>>> https://issues.apache.org/jira/browse/BEAM-9455)
>>> BEAM-9293: Python direct runner doesn't emit empty pane when it
>>> should (https://issues.apache.org/jira/browse/BEAM-9293)
>>> BEAM-9154: Move Chicago Taxi Example to Python 3 (
>>> https://issues.apache.org/jira/browse/BEAM-9154)
>>> BEAM-8986: SortValues may not work correct for numerical types (
>>> https://issues.apache.org/jira/browse/BEAM-8986)
>>> BEAM-8985: SortValues should fail if SecondaryKey coder is not
>>> deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
>>> BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get
>>> marked as passing (Direct Runner) (
>>> https://issues.apach

Re: Long term support versions of Beam Java

2021-04-12 Thread Brian Hulette

> Beam is also multi language, which adjusts concerns. How does GRPC handle
that? Or protos? (I'm sure there are other examples we can pull from...)

I'm not sure those are good projects to look at. They're likely much more
concerned with binary compatibility and there's probably not much change in
the user-facing API.
Arrow is another multi-language project but I don't think we can learn much
from it's versioning policy [1], which is much more concerned with binary
compatibility than it is with API compatibility (for now). Perhaps one
lesson is that they track a separate format version and API version. We
could do something similar and have a separate version number for the Beam
model protos. I'm not sure if that's relevant for this discussion or not.

Spark may be a reasonable comparison since it provides an API in multiple
languages, but that doesn't seem to have any bearing on it's versioning
policy [2]. It sounds similar to Flink in that every minor release gets
backported bugfixes for 18 months, but releases are slower (~6 months) so
that's not as much of a burden.

Brian

[1]
https://arrow.apache.org/docs/format/Versioning.html#backward-compatibility
[2] https://spark.apache.org/versioning-policy.html

On Thu, Apr 8, 2021 at 1:18 PM Robert Bradshaw  wrote:

> Python (again a language) has a slower release cycle, fairly strict
> backwards compatibility stance (with the ability to opt-in before changes
> become the default) and clear ownership for maintenance of each minor
> version until end-of-life (so each could be considered an LTS release).
> https://devguide.python.org/devcycle/
>
> Cython is more similar to Beam: best-effort compatibility, no LTS, but as
> a code-generater rather than a runtime library a developer is mostly free
> to upgrade at their own cadence regardless of the surrounding
> ecosystem (including downstream projects that take them on as a
> dependency).
>
> IIRC, Flink supports the latest N (3?) releases, which are infrequent
> enough to cover about 12-18 months.
>
> My take is that Beam should be supportive of LTS releases, but we're not
> in a position to commit to it (to the same level we commit to the 6-week
> cut-from-head release cycle). But certain users of Beam (which have a large
> overlap with the Beam community) could make such commitments as it helps
> them (directly or indirectly). Let's give it a try.
>
>
> On Thu, Apr 8, 2021 at 1:00 PM Robert Burke  wrote:
>
>> I don't know about other Apache projects but the Go Programming Language
>> uses a slower release cadence, two releases a year. Only the last two
>> releases are maintained with backports, except for significant security
>> issues, which are backported as far back as possible.
>>
>> Further they divide each release period into clear 3 month "commit" and "
>> fix" periods, with the main tree only open to new features in the commit
>> phase with the fix phase largely frozen outside of discovered issues.
>>
>> Beam isn't entirely the same thing as a Programming Language, and
>> certainly hasn't had such a rigorous commitment to forward compatibility as
>> Go has. There's been a very strong expectation that Go programs must
>> continue to compile, and function as expected through new versions of Go,
>> and it's standard libraries (and to a lesser extent, arbitrary Go
>> packages). But these tenets allow the Go maintainers and contributors to
>> avoid LTS as a concept to worry about.
>>
>> The way I see it, it's not necessarily the number of releases that's the
>> problem exactly, but the overall stability of the system. More frequent
>> releases "feels" less stable than fewer, even if most of them are
>> compatible ona code level. This will vary on the way language ecosystems
>> handle dependencies though.
>>
>> Beam is also multi language, which adjusts concerns. How does GRPC handle
>> that? Or protos? (I'm sure there are other examples we can pull from...)
>>
>>
>> On Thu, Apr 8, 2021, 9:31 AM Kenneth Knowles  wrote:
>>
>>> The other obvious question, which I often ask and very rarely research:
>>> what do other similar projects do? Does anyone know?
>>>
>>> On Thu, Apr 8, 2021 at 9:30 AM Kenneth Knowles  wrote:
>>>
 I want to send a short summary of my views. There is a
 tension/contradiction that I think is OK:

  - An LTS series becomes more valuable when more stakeholders care for
 it. For those who don't care about LTS, the main thing would be to
 highlight major issues that are worth backporting. So having it part of how
 we think about things is helpful, even for the people who are not using it
 or supporting users of it. And cherrypicks are very easy to open. So I
 support building consensus around this.

  - Anyone can volunteer and perform a point release for any past
 branch. IMO if a stakeholder proposed to do the work, we (the Beam
 community) should welcome it. Certainly we have no ability to make anyone
 else do the point releases or d

[PROPOSAL] Remove pylint format checks

2021-04-09 Thread Brian Hulette

Currently we have two different format checks for the Python SDK. Most
format checks are handled by yapf, which is nice since it is also capable
of re-writing the code to make it pass the checks done in CI. However we
*also* have some formatting checks still enabled in our .pylintrc [1], and
pylint has no such capability.

Generally yapf's output just passes these pylint format checks, but not
always. For example yapf is lenient about lines over the column limit, and
pylint is not. So things like [2] can happen even on a PR formatted by
yapf. This is frustrating because it requires manual changes.

I experimented with the yapf config to see if we can make it strict about
the column limit, but it doesn't seem to be possible. So instead I'd like
to propose that we just remove the pylint format checks, and rely on yapf's
checks alone.

There are a couple issues here:
- we'd need to be ok with yapf deciding that some lines can be >80
characters
- yapf has no opinion whatsoever about docstrings [3], so the only thing
checking them is pylint. We might work around this by setting up
docformatter [4].

Personally I'm ok with this if it means Python code formatting can be
completely automated with a single script that runs yapf, docformatter, and
isort.

Brian

[1]
https://github.com/apache/beam/blob/2408d0c11337b45e289736d4d7483868e717760c/sdks/python/.pylintrc#L165
[2]
https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/9088/console
[3] https://github.com/google/yapf/issues/279
[4] https://github.com/myint/docformatter

Re: [DISCUSS] Include inherited members in Python API Docs?

2021-04-08 Thread Brian Hulette

Yep! I'm working on tools to generate the API docs here [1]. For the most
part we'll just populate these with the pydoc from pandas, but the tooling
will give us the ability to include a note about anything that isn't
implemented in our API yet, for example operations that are sensitive to
order.

[1] https://github.com/apache/beam/pull/14382

On Thu, Apr 8, 2021 at 2:30 PM Pablo Estrada  wrote:

> Looking at [1] I see all the methods on the DF but not the pydoc. Will the
> Pydoc be included later somehow?
> Best
> -P.
>
> [1]
> https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.dataframe.frames.html
>
> On Thu, Apr 8, 2021 at 1:05 PM Brian Hulette  wrote:
>
>> I found a slightly hacky way to enable :inherited-members: just for the
>> DataFrame API. I can add the option to the .rst output generated by
>> sphinx-apidoc, before we run sphinx-build [1].
>>
>> I'm fine just doing that instead of turning it on globally.
>>
>> [1]
>> https://github.com/TheNeuralBit/beam/blob/e26760937f7a34fd72578b65f716098c74e4380b/sdks/python/scripts/generate_pydoc.sh#L86
>>
>> On Tue, Apr 6, 2021 at 1:50 PM Brian Hulette  wrote:
>>
>>> Sure, I can try cutting out PTransform.
>>>
>>> We could also look into reducing noise by:
>>> - removing undoc-members from the config [1] (this would make it so only
>>> objects with a docstring are added to the generated docs)
>>> - adding :meta private:` to docstrings for objects we don't want
>>> publicly visible
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/python/scripts/generate_pydoc.sh#L48
>>>
>>> On Tue, Apr 6, 2021 at 1:17 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> Way too many things are inherited from PTransform, can we at least cut
>>>> that out?
>>>>
>>>> On Tue, Apr 6, 2021 at 1:09 PM Brian Hulette 
>>>> wrote:
>>>>
>>>>> Just wanted to bump this - does anyone have concerns with the way the
>>>>> API docs look when inherited members are included?
>>>>>
>>>>> On Wed, Mar 31, 2021 at 5:23 PM Brian Hulette 
>>>>> wrote:
>>>>>
>>>>>> I staged my current working copy built from head here [1], see
>>>>>> CombinePerKey here [2]. Note it also has a few other changes, most 
>>>>>> notably
>>>>>> I excluded several internal-only modules that are currently in our API 
>>>>>> docs
>>>>>> (I will PR this soon regardless).
>>>>>>
>>>>>> > are these inherited members grouped in such a way that it makes it
>>>>>> easy to ignore them once they get to "low" in the stack?
>>>>>> There doesn't seem to be any grouping, but it does look like
>>>>>> inherited members are added at the end.
>>>>>>
>>>>>> > If it can't be per-module, is there a "nice" set of ancestors to
>>>>>> avoid (as it seems this option takes such an argument).
>>>>>> Ah good point, I missed this. I suppose we could avoid basic
>>>>>> constructs like PTransform, DoFn, etc. I'm not sure how realistic that is
>>>>>> though. It would be nice if this argument worked the other way
>>>>>>
>>>>>> [1] https://theneuralbit.github.io/beam-site/pydoc/inherited-members
>>>>>> [2]
>>>>>> https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey
>>>>>>
>>>>>> On Wed, Mar 31, 2021 at 4:45 PM Robert Bradshaw 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to an example. In particular, are these inherited members grouped
>>>>>>> in such a way that it makes it easy to ignore them once they get to 
>>>>>>> "low"
>>>>>>> in the stack? If it can't be per-module, is there a "nice" set of 
>>>>>>> ancestors
>>>>>>> to avoid (as it seems this option takes such an argument).
>>>>>>>
>>>>>>> On Wed, Mar 31, 2021 at 4:23 PM Pablo Estrada 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Do you have an example of what it would look like when released?
>>>>>>>>
>

Re: [DISCUSS] Include inherited members in Python API Docs?

2021-04-08 Thread Brian Hulette

I found a slightly hacky way to enable :inherited-members: just for the
DataFrame API. I can add the option to the .rst output generated by
sphinx-apidoc, before we run sphinx-build [1].

I'm fine just doing that instead of turning it on globally.

[1]
https://github.com/TheNeuralBit/beam/blob/e26760937f7a34fd72578b65f716098c74e4380b/sdks/python/scripts/generate_pydoc.sh#L86

On Tue, Apr 6, 2021 at 1:50 PM Brian Hulette  wrote:

> Sure, I can try cutting out PTransform.
>
> We could also look into reducing noise by:
> - removing undoc-members from the config [1] (this would make it so only
> objects with a docstring are added to the generated docs)
> - adding :meta private:` to docstrings for objects we don't want publicly
> visible
>
> [1]
> https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/python/scripts/generate_pydoc.sh#L48
>
> On Tue, Apr 6, 2021 at 1:17 PM Robert Bradshaw 
> wrote:
>
>> Way too many things are inherited from PTransform, can we at least cut
>> that out?
>>
>> On Tue, Apr 6, 2021 at 1:09 PM Brian Hulette  wrote:
>>
>>> Just wanted to bump this - does anyone have concerns with the way the
>>> API docs look when inherited members are included?
>>>
>>> On Wed, Mar 31, 2021 at 5:23 PM Brian Hulette 
>>> wrote:
>>>
>>>> I staged my current working copy built from head here [1], see
>>>> CombinePerKey here [2]. Note it also has a few other changes, most notably
>>>> I excluded several internal-only modules that are currently in our API docs
>>>> (I will PR this soon regardless).
>>>>
>>>> > are these inherited members grouped in such a way that it makes it
>>>> easy to ignore them once they get to "low" in the stack?
>>>> There doesn't seem to be any grouping, but it does look like inherited
>>>> members are added at the end.
>>>>
>>>> > If it can't be per-module, is there a "nice" set of ancestors to
>>>> avoid (as it seems this option takes such an argument).
>>>> Ah good point, I missed this. I suppose we could avoid basic constructs
>>>> like PTransform, DoFn, etc. I'm not sure how realistic that is though. It
>>>> would be nice if this argument worked the other way
>>>>
>>>> [1] https://theneuralbit.github.io/beam-site/pydoc/inherited-members
>>>> [2]
>>>> https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey
>>>>
>>>> On Wed, Mar 31, 2021 at 4:45 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> +1 to an example. In particular, are these inherited members grouped
>>>>> in such a way that it makes it easy to ignore them once they get to "low"
>>>>> in the stack? If it can't be per-module, is there a "nice" set of 
>>>>> ancestors
>>>>> to avoid (as it seems this option takes such an argument).
>>>>>
>>>>> On Wed, Mar 31, 2021 at 4:23 PM Pablo Estrada 
>>>>> wrote:
>>>>>
>>>>>> Do you have an example of what it would look like when released?
>>>>>>
>>>>>> On Wed, Mar 31, 2021 at 4:16 PM Brian Hulette 
>>>>>> wrote:
>>>>>>
>>>>>>> I'm working on generating useful API docs for the DataFrame API
>>>>>>> (BEAM-12074). In doing so, one thing I've found would be very helpful 
>>>>>>> is if
>>>>>>> we could include docstrings for inherited members in the API docs. That 
>>>>>>> way
>>>>>>> docstrings for operations defined in DeferredDataFrameOrSeries [1], 
>>>>>>> will be
>>>>>>> propagated to DeferredDataFrame [2] and DeferredSeries, and the former 
>>>>>>> can
>>>>>>> be hidden entirely. This would be more consistent with the pandas
>>>>>>> documentation [3].
>>>>>>>
>>>>>>> It looks like we can do this by specifying :inherited-members: [4],
>>>>>>> but this will apply to _all_ of our API docs, there doesn't seem to be a
>>>>>>> way to restrict it to a particular module. This seems generally useful 
>>>>>>> to
>>>>>>> me, but it would be a significant change, so I wanted to see if there 
>>>>>>&g

Re: [DISCUSS] Include inherited members in Python API Docs?

2021-04-06 Thread Brian Hulette

Sure, I can try cutting out PTransform.

We could also look into reducing noise by:
- removing undoc-members from the config [1] (this would make it so only
objects with a docstring are added to the generated docs)
- adding :meta private:` to docstrings for objects we don't want publicly
visible

[1]
https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/python/scripts/generate_pydoc.sh#L48

On Tue, Apr 6, 2021 at 1:17 PM Robert Bradshaw  wrote:

> Way too many things are inherited from PTransform, can we at least cut
> that out?
>
> On Tue, Apr 6, 2021 at 1:09 PM Brian Hulette  wrote:
>
>> Just wanted to bump this - does anyone have concerns with the way the API
>> docs look when inherited members are included?
>>
>> On Wed, Mar 31, 2021 at 5:23 PM Brian Hulette 
>> wrote:
>>
>>> I staged my current working copy built from head here [1], see
>>> CombinePerKey here [2]. Note it also has a few other changes, most notably
>>> I excluded several internal-only modules that are currently in our API docs
>>> (I will PR this soon regardless).
>>>
>>> > are these inherited members grouped in such a way that it makes it
>>> easy to ignore them once they get to "low" in the stack?
>>> There doesn't seem to be any grouping, but it does look like inherited
>>> members are added at the end.
>>>
>>> > If it can't be per-module, is there a "nice" set of ancestors to avoid
>>> (as it seems this option takes such an argument).
>>> Ah good point, I missed this. I suppose we could avoid basic constructs
>>> like PTransform, DoFn, etc. I'm not sure how realistic that is though. It
>>> would be nice if this argument worked the other way
>>>
>>> [1] https://theneuralbit.github.io/beam-site/pydoc/inherited-members
>>> [2]
>>> https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey
>>>
>>> On Wed, Mar 31, 2021 at 4:45 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> +1 to an example. In particular, are these inherited members grouped in
>>>> such a way that it makes it easy to ignore them once they get to "low" in
>>>> the stack? If it can't be per-module, is there a "nice" set of ancestors to
>>>> avoid (as it seems this option takes such an argument).
>>>>
>>>> On Wed, Mar 31, 2021 at 4:23 PM Pablo Estrada 
>>>> wrote:
>>>>
>>>>> Do you have an example of what it would look like when released?
>>>>>
>>>>> On Wed, Mar 31, 2021 at 4:16 PM Brian Hulette 
>>>>> wrote:
>>>>>
>>>>>> I'm working on generating useful API docs for the DataFrame API
>>>>>> (BEAM-12074). In doing so, one thing I've found would be very helpful is 
>>>>>> if
>>>>>> we could include docstrings for inherited members in the API docs. That 
>>>>>> way
>>>>>> docstrings for operations defined in DeferredDataFrameOrSeries [1], will 
>>>>>> be
>>>>>> propagated to DeferredDataFrame [2] and DeferredSeries, and the former 
>>>>>> can
>>>>>> be hidden entirely. This would be more consistent with the pandas
>>>>>> documentation [3].
>>>>>>
>>>>>> It looks like we can do this by specifying :inherited-members: [4],
>>>>>> but this will apply to _all_ of our API docs, there doesn't seem to be a
>>>>>> way to restrict it to a particular module. This seems generally useful to
>>>>>> me, but it would be a significant change, so I wanted to see if there are
>>>>>> any objections from dev@ before doing this.
>>>>>>
>>>>>> An example of the kind of change this would produce: any PTransform
>>>>>> sub-classes, e.g. CombinePerKey [5], would now include docstrings for 
>>>>>> every
>>>>>> PTransform member, e.g. with_input_types [6], and display_data [7].
>>>>>>
>>>>>> Would there be any objections to that?
>>>>>>
>>>>>> Thanks,
>>>>>> Brian
>>>>>>
>>>>>> [1]
>>>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrameOrSeries
>>>>>> [2]
>>>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame
>>>>>> [3] https://pandas.pydata.org/docs/reference/frame.html
>>>>>> [4]
>>>>>> https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
>>>>>> [5]
>>>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.core.html?highlight=combineperkey#apache_beam.transforms.core.CombinePerKey
>>>>>> [6]
>>>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_input_types
>>>>>> [7]
>>>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.display.html#apache_beam.transforms.display.HasDisplayData.display_data
>>>>>>
>>>>>

Re: [DISCUSS] Include inherited members in Python API Docs?

2021-04-06 Thread Brian Hulette

Just wanted to bump this - does anyone have concerns with the way the API
docs look when inherited members are included?

On Wed, Mar 31, 2021 at 5:23 PM Brian Hulette  wrote:

> I staged my current working copy built from head here [1], see
> CombinePerKey here [2]. Note it also has a few other changes, most notably
> I excluded several internal-only modules that are currently in our API docs
> (I will PR this soon regardless).
>
> > are these inherited members grouped in such a way that it makes it easy
> to ignore them once they get to "low" in the stack?
> There doesn't seem to be any grouping, but it does look like inherited
> members are added at the end.
>
> > If it can't be per-module, is there a "nice" set of ancestors to avoid
> (as it seems this option takes such an argument).
> Ah good point, I missed this. I suppose we could avoid basic constructs
> like PTransform, DoFn, etc. I'm not sure how realistic that is though. It
> would be nice if this argument worked the other way
>
> [1] https://theneuralbit.github.io/beam-site/pydoc/inherited-members
> [2]
> https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey
>
> On Wed, Mar 31, 2021 at 4:45 PM Robert Bradshaw 
> wrote:
>
>> +1 to an example. In particular, are these inherited members grouped in
>> such a way that it makes it easy to ignore them once they get to "low" in
>> the stack? If it can't be per-module, is there a "nice" set of ancestors to
>> avoid (as it seems this option takes such an argument).
>>
>> On Wed, Mar 31, 2021 at 4:23 PM Pablo Estrada  wrote:
>>
>>> Do you have an example of what it would look like when released?
>>>
>>> On Wed, Mar 31, 2021 at 4:16 PM Brian Hulette 
>>> wrote:
>>>
>>>> I'm working on generating useful API docs for the DataFrame API
>>>> (BEAM-12074). In doing so, one thing I've found would be very helpful is if
>>>> we could include docstrings for inherited members in the API docs. That way
>>>> docstrings for operations defined in DeferredDataFrameOrSeries [1], will be
>>>> propagated to DeferredDataFrame [2] and DeferredSeries, and the former can
>>>> be hidden entirely. This would be more consistent with the pandas
>>>> documentation [3].
>>>>
>>>> It looks like we can do this by specifying :inherited-members: [4], but
>>>> this will apply to _all_ of our API docs, there doesn't seem to be a way to
>>>> restrict it to a particular module. This seems generally useful to me, but
>>>> it would be a significant change, so I wanted to see if there are any
>>>> objections from dev@ before doing this.
>>>>
>>>> An example of the kind of change this would produce: any PTransform
>>>> sub-classes, e.g. CombinePerKey [5], would now include docstrings for every
>>>> PTransform member, e.g. with_input_types [6], and display_data [7].
>>>>
>>>> Would there be any objections to that?
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>> [1]
>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrameOrSeries
>>>> [2]
>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame
>>>> [3] https://pandas.pydata.org/docs/reference/frame.html
>>>> [4] https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
>>>> [5]
>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.core.html?highlight=combineperkey#apache_beam.transforms.core.CombinePerKey
>>>> [6]
>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_input_types
>>>> [7]
>>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.display.html#apache_beam.transforms.display.HasDisplayData.display_data
>>>>
>>>

Re: [ANNOUNCE] New committer: Tomo Suzuki

2021-04-02 Thread Brian Hulette

Congratulations Tomo! Well deserved :)

On Fri, Apr 2, 2021 at 9:51 AM Yichi Zhang  wrote:

> Congratulations!
>
> On Fri, Apr 2, 2021 at 9:42 AM Ahmet Altay  wrote:
>
>> Congratulations! 🎉🎉🎉
>>
>> On Fri, Apr 2, 2021 at 9:38 AM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Tomo Suzuki
>>>
>>> Since joining the Beam community in 2019, Tomo has done lots of critical
>>> work on Beam's dependencies: maintaining the dependency checker that files
>>> Jiras and sends emails, upgrading dependencies, fixing dependency
>>> configuration errors, maintaining our linkage checker. Most recently, an
>>> epic upgrade of gRPC.
>>>
>>> Considering these highlighted contributions and the rest, the Beam PMC
>>> trusts Tomo with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Tomo, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>

Re: Null checking in Beam

2021-04-01 Thread Brian Hulette

What value does it add? Is it that it enables them to use checkerframework
with our interfaces?

On Thu, Apr 1, 2021 at 8:54 AM Kenneth Knowles  wrote:

> Thanks for filing that. Once it is fixed in IntelliJ, the annotations
> actually add value for downstream users.
>
> Kenn
>
> On Thu, Apr 1, 2021 at 1:10 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> I created the issue in JetBrains tracker [1]. I'm still not 100%
>> convinced that it is correct for the checker to actually modify the
>> bytecode. An open questions is - in guava this does not happen. Does guava
>> apply the check on code being released? From what is in this thread is
>> seems to me, that the answer is no.
>>
>>  Jan
>>
>> [1] https://youtrack.jetbrains.com/issue/IDEA-265658
>> On 4/1/21 6:15 AM, Kenneth Knowles wrote:
>>
>> Hi all,
>>
>> About the IntelliJ automatic method stub issue: I raised it to the
>> checkerframework list and got a helpful response:
>> https://groups.google.com/g/checker-framework-discuss/c/KHQdjF4lesk/m/dJ4u1BBNBgAJ
>>
>> It eventually reached back to Jetbrains and they would appreciate a
>> detailed report with steps to reproduce, preferably a sample project. Would
>> you - Jan or Ismaël or Reuven - provide them with this issue report? It
>> sounds like Jan you have an example ready to go.
>>
>> Kenn
>>
>> On Mon, Mar 15, 2021 at 1:29 PM Jan Lukavský  wrote:
>>
>>> Yes, annotations that we add to the code base on purpose (like @Nullable
>>> or @SuppressWarnings) are aboslutely fine. What is worse is that the
>>> checked is not only checked, but a code generator. :)
>>>
>>> For example when one wants to implement Coder by extending CustomCoder
>>> and use auto-generating the overridden methods, they look like
>>>
>>> @Overridepublic void encode(Long value, @UnknownKeyFor @NonNull 
>>> @Initialized OutputStream outStream) throws 
>>> @UnknownKeyFor@NonNull@Initialized CoderException, 
>>> @UnknownKeyFor@NonNull@Initialized IOException {
>>>
>>> }
>>>
>>> Which is really ugly. :-(
>>>
>>>  Jan
>>>
>>> On 3/15/21 6:37 PM, Ismaël Mejía wrote:
>>>
>>> +1
>>>
>>> Even if I like the strictness for Null checking, I also think that
>>> this is adding too much extra time for builds (that I noticed locally
>>> when enabled) and also I agree with Jan that the annotations are
>>> really an undesired side effect. For reference when you try to auto
>>> complete some method signatures on IntelliJ on downstream projects
>>> with C-A-v it generates some extra Checkers annotations like @NonNull
>>> and others even if the user isn't using them which is not desirable.
>>>
>>>
>>>
>>> On Mon, Mar 15, 2021 at 6:04 PM Kyle Weaver  
>>>  wrote:
>>>
>>> Big +1 for moving this to separate CI job. I really don't like what 
>>> annotations are currently added to the code we ship. Tools like Idea add 
>>> these annotations to code they generate when overriding classes and that's 
>>> very annoying. Users should not be exposed to internal tools like 
>>> nullability checking.
>>>
>>> I was only planning on moving this to a separate CI job. The job would 
>>> still be release blocking, so the same annotations would still be required.
>>>
>>> I'm not sure which annotations you are concerned about. There are two 
>>> annotations involved with nullness checking, @SuppressWarnings and 
>>> @Nullable. @SuppressWarnings has retention policy SOURCE, so it shouldn't 
>>> be exposed to users at all. @Nullable is not just for internal tooling, it 
>>> also provides useful information about our APIs to users. The user should 
>>> not have to guess whether a method argument etc. can be null or not, and 
>>> for better or worse, these annotations are the standard way of expressing 
>>> that in Java.
>>>
>>>

Re: [DISCUSS] Include inherited members in Python API Docs?

2021-03-31 Thread Brian Hulette

I staged my current working copy built from head here [1], see
CombinePerKey here [2]. Note it also has a few other changes, most notably
I excluded several internal-only modules that are currently in our API docs
(I will PR this soon regardless).

> are these inherited members grouped in such a way that it makes it easy
to ignore them once they get to "low" in the stack?
There doesn't seem to be any grouping, but it does look like inherited
members are added at the end.

> If it can't be per-module, is there a "nice" set of ancestors to avoid
(as it seems this option takes such an argument).
Ah good point, I missed this. I suppose we could avoid basic constructs
like PTransform, DoFn, etc. I'm not sure how realistic that is though. It
would be nice if this argument worked the other way

[1] https://theneuralbit.github.io/beam-site/pydoc/inherited-members
[2]
https://theneuralbit.github.io/beam-site/pydoc/inherited-members/apache_beam.transforms.core.html#apache_beam.transforms.core.CombinePerKey

On Wed, Mar 31, 2021 at 4:45 PM Robert Bradshaw  wrote:

> +1 to an example. In particular, are these inherited members grouped in
> such a way that it makes it easy to ignore them once they get to "low" in
> the stack? If it can't be per-module, is there a "nice" set of ancestors to
> avoid (as it seems this option takes such an argument).
>
> On Wed, Mar 31, 2021 at 4:23 PM Pablo Estrada  wrote:
>
>> Do you have an example of what it would look like when released?
>>
>> On Wed, Mar 31, 2021 at 4:16 PM Brian Hulette 
>> wrote:
>>
>>> I'm working on generating useful API docs for the DataFrame API
>>> (BEAM-12074). In doing so, one thing I've found would be very helpful is if
>>> we could include docstrings for inherited members in the API docs. That way
>>> docstrings for operations defined in DeferredDataFrameOrSeries [1], will be
>>> propagated to DeferredDataFrame [2] and DeferredSeries, and the former can
>>> be hidden entirely. This would be more consistent with the pandas
>>> documentation [3].
>>>
>>> It looks like we can do this by specifying :inherited-members: [4], but
>>> this will apply to _all_ of our API docs, there doesn't seem to be a way to
>>> restrict it to a particular module. This seems generally useful to me, but
>>> it would be a significant change, so I wanted to see if there are any
>>> objections from dev@ before doing this.
>>>
>>> An example of the kind of change this would produce: any PTransform
>>> sub-classes, e.g. CombinePerKey [5], would now include docstrings for every
>>> PTransform member, e.g. with_input_types [6], and display_data [7].
>>>
>>> Would there be any objections to that?
>>>
>>> Thanks,
>>> Brian
>>>
>>> [1]
>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrameOrSeries
>>> [2]
>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame
>>> [3] https://pandas.pydata.org/docs/reference/frame.html
>>> [4] https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
>>> [5]
>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.core.html?highlight=combineperkey#apache_beam.transforms.core.CombinePerKey
>>> [6]
>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_input_types
>>> [7]
>>> https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.display.html#apache_beam.transforms.display.HasDisplayData.display_data
>>>
>>

[DISCUSS] Include inherited members in Python API Docs?

2021-03-31 Thread Brian Hulette

I'm working on generating useful API docs for the DataFrame API
(BEAM-12074). In doing so, one thing I've found would be very helpful is if
we could include docstrings for inherited members in the API docs. That way
docstrings for operations defined in DeferredDataFrameOrSeries [1], will be
propagated to DeferredDataFrame [2] and DeferredSeries, and the former can
be hidden entirely. This would be more consistent with the pandas
documentation [3].

It looks like we can do this by specifying :inherited-members: [4], but
this will apply to _all_ of our API docs, there doesn't seem to be a way to
restrict it to a particular module. This seems generally useful to me, but
it would be a significant change, so I wanted to see if there are any
objections from dev@ before doing this.

An example of the kind of change this would produce: any PTransform
sub-classes, e.g. CombinePerKey [5], would now include docstrings for every
PTransform member, e.g. with_input_types [6], and display_data [7].

Would there be any objections to that?

Thanks,
Brian

[1]
https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrameOrSeries
[2]
https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredDataFrame
[3] https://pandas.pydata.org/docs/reference/frame.html
[4] https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html
[5]
https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.core.html?highlight=combineperkey#apache_beam.transforms.core.CombinePerKey
[6]
https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_input_types
[7]
https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.transforms.display.html#apache_beam.transforms.display.HasDisplayData.display_data

Re: BEAM-9613 - BigQueryUtils and Avro

2021-03-31 Thread Brian Hulette

Commented on the PR, it looks like there are just some double literals that
need to be changed to strings.

On Wed, Mar 31, 2021 at 7:07 AM Matthew Ouyang 
wrote:

> Hello, Apologies for not linking to the PR with the failed unit tests in
> my last e-mail.  The link is below.  I'm stuck and I need help in fixing
> this.  I can't reproduce in my dev environment because the build can't
> trigger a BQ job, and it seems like the failed tests complement each other
> (collection of two elements, first test is missing one element, second test
> misses the other).
>
> https://github.com/apache/beam/pull/14350
>
> On Mon, Mar 29, 2021 at 9:28 AM Matthew Ouyang 
> wrote:
>
>> It looks like some unit tests failed.  I'm unable to reproduce it in my
>> environment because the error I received (described in the PR) suggests I
>> can't even start the job in the first place.  Any help would be
>> appreciated.  Ideally I'd like to have this included in the next possible
>> release (2.30.0 it looks like) since my team had to do a few workarounds
>> to get past these issues.
>>
>> On Sun, Mar 28, 2021 at 2:25 AM Matthew Ouyang 
>> wrote:
>>
>>> Thank you for the feedback.  I walked back my initial approach in favour
>>> of Brian's option (2) and also implemented a fix for lists as well (
>>> https://github.com/apache/beam/pull/14350).  I agree with the tradeoff
>>> Brian pointed out as it is consistent with the rest of that component.  If
>>> performance ends up being a problem BigQueryUtils could have a different
>>> mode for TableRow and Avro.
>>>
>>> On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax  wrote:
>>>
>>>> Logically, all JSON values are string. We often have put other objects
>>>> in there, which I believe works simply because of the implicit .toString()
>>>> method on those objects, but I'm not convinced this is really correct.
>>>>
>>>> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette 
>>>> wrote:
>>>>
>>>>> Thank you for digging into this and figuring out how this bug was
>>>>> introduced!
>>>>>
>>>>> In the long-term I think it would be preferable to avoid
>>>>> TableRow altogether in order to do a schema-aware read of avro data from
>>>>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
>>>>> would also be preferable for Arrow, where we could produce Row instances
>>>>> that are references into an underlying arrow RecordBatch (using
>>>>> RowWithGetters), rather than having to materialize each row to make a
>>>>> TableRow instance.
>>>>>
>>>>> For a short-term fix there are two options, both came up in Reuven's
>>>>> comments on BEAM-9613:
>>>>>
>>>>>> However if we expect to get Long, Double,etc. objects in the
>>>>>> TableRow, then this mapping code needs to handle those objects. Handling
>>>>>> them directly would be more efficient - converting to a String would 
>>>>>> simply
>>>>>> be a stopgap "one-line" fix.
>>>>>
>>>>>
>>>>> 1. handle them directly [in BigQueryUtils], this is what you've done
>>>>> in
>>>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
>>>>> 2. convert to a String [in BigQueryAvroUtils]
>>>>>
>>>>> I don't have a strong preference but I think (2) is a cleaner, albeit
>>>>> less performant, solution. It seems that BigQueryUtils expects all values
>>>>> in TableRow instances to be String instances. Since BigQueryAvroUtils is
>>>>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils,
>>>>> it should comply with that interface, rather than making BigQueryUtils 
>>>>> work
>>>>> around the discrepancy.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>>>>>
>>>>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <
>>>>> matthew.ouy...@gmail.com> wrote:
>>>>>
>>>>>> I'm working on fixing BEAM-9613 because encountered this issue as a
>>>>>> result of using BigQueryIO.readTableRowsWithSchema()
>>>>>>
>>>>>

Re: BEAM-9114 - BigQueryIO TableRowParser Arrow & Avro (Java SDK)

2021-03-31 Thread Brian Hulette

Note this was resolved in jira comments in BEAM -9114 and BEAM-8933.

On Mon, Mar 29, 2021 at 2:09 PM Isidro Martinez <
isidro.marti...@wizeline.com> wrote:

> Hello Team,
>
> I'm taking a look at the ticket BEAM-9114
> . It seems it was opened
> in the PR 10369
>  (currently 
> closed)
> but it wasn't merged with master. Do you think we should close this ticket
> or the same ticket it is related in another part of the code?
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: [PROPOSAL] Preparing for Beam 2.29.0 release

2021-03-30 Thread Brian Hulette

On Tue, Mar 30, 2021 at 12:16 PM Kenneth Knowles  wrote:

> Update:
>
> Based on a number of issues raised, I have re-cut the branch.
>
> All Jiras that were resolved to 2.30.0 I have adjusted to 2.29.0.
>
> The two remaining issues are:
>
> BEAM-11227 "Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216":
> the vendored gRPC has been released and is being debugged. We will
> cherrypick the upgrade.
>
> BEAM-12071 "DataFrame IO sinks do not correctly partition by window": this
> is not a regression but is a severe problem that makes dataframes unusable;
> a PR is already up
>

Note this makes DataFrame sinks unusable when used with windowing, not the
entire DataFrame API :)
PR is here: https://github.com/apache/beam/pull/14374


>
> If you are tracking any issue not mentioned here, be sure to let me know.
>
> Kenn
>
> On Tue, Mar 16, 2021 at 12:08 PM Kenneth Knowles  wrote:
>
>> Update:
>>
>> There are 5 issues tagged to 2.29.0 release:
>> https://issues.apache.org/jira/issues/?jql=statusCategory%20!%3D%20done%20AND%20project%20%3D%2012319527%20AND%20fixVersion%20%3D%2012349629%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>
>> There are 4 issues in Needs Triage state with priority P0 or P1:
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20%22Triage%20Needed%22%20AND%20priority%20in%20(P0%2C%20P1)%20ORDER%20BY%20updated%20DESC
>>
>> Kenn
>>
>> On Mon, Mar 15, 2021 at 5:01 PM Kenneth Knowles  wrote:
>>
>>> Update:
>>>
>>> There are 8 issues tagged to 2.29.0 release:
>>> https://issues.apache.org/jira/issues/?jql=statusCategory%20!%3D%20done%20AND%20project%20%3D%2012319527%20AND%20fixVersion%20%3D%2012349629%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>>
>>> There are 22 issues in Needs Triage state with priority P0 or P1:
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20%22Triage%20Needed%22%20AND%20priority%20in%20(P0%2C%20P1)%20ORDER%20BY%20updated%20DESC
>>>
>>> Kenn
>>>
>>> On Thu, Mar 11, 2021 at 8:43 PM Kenneth Knowles  wrote:
>>>

 https://issues.apache.org/jira/issues/?jql=project%20%3D%2012319527%20AND%20fixVersion%20%3D%2012349629%20AND%20statusCategory%20%3D%20%22To%20Do%22%20%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

 On Thu, Mar 11, 2021 at 8:43 PM Kenneth Knowles 
 wrote:

> Correction: that search did not limit to open issues. There are 17
> open.
>
> Kenn
>
> On Thu, Mar 11, 2021 at 8:42 PM Kenneth Knowles 
> wrote:
>
>> Update: I cut the branch today.
>>
>> There are 49 (!) issues with Fix Version = 2.29.0
>>
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%2012319527%20AND%20fixVersion%20%3D%2012349629%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>
>> If you are familiar with any of these, please help me by determining
>> if it is a critical release-blocking bug or if it is something else.
>>
>> Kenn
>>
>> On Wed, Mar 3, 2021 at 11:53 AM Ahmet Altay  wrote:
>>
>>> +1 and thank you!
>>>
>>> On Wed, Mar 3, 2021 at 11:51 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi All,

 Beam 2.29.0 release is scheduled to be cut in a week, on March 10
 according to the release calendar [1].

 I'd like to volunteer myself to be the release manager for this
 release. I plan on cutting the release branch on the scheduled
 date.

 Any comments or objections?

 Kenn

 [1]
 https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com&ctz=America/Los_Angeles

>>>

Re: BEAM-449 and BEAM-621 PRs request for review

2021-03-29 Thread Brian Hulette

Hi Vitaly,
It looks like Kenn is helping out with the BEAM-449 PR, I can look at the
one for BEAM-621.

Brian

On Fri, Mar 26, 2021 at 3:27 AM Vitaly Terentyev <
vitaly.terent...@akvelon.com> wrote:

> Hello devs,
>
> I am new to Beam. I recently assigned myself and worked on these two
> issues: BEAM-449 ,
> BEAM-621 .
> Can someone check my PRs and review them? Here they are:
> PR for BEAM-449 .
> PR for BEAM-621 .
>
> Best regards,
> Vitaly
>
>
>

Re: Python Dataframe API issue

2021-03-29 Thread Brian Hulette

Thanks for the feedback and the bug report Xinyu! I really appreciate it.

Brian

On Thu, Mar 25, 2021 at 6:04 PM Xinyu Liu  wrote:

> Np, thanks for quickly identifying the fix.
>
> Btw, I am very happy about Beam Python supporting the same Pandas
> dataframe api. It's super user-friendly to both devs and data scientists.
> Really cool work!
>
> Thanks,
> Xinyu
>
> On Thu, Mar 25, 2021 at 4:53 PM Robert Bradshaw 
> wrote:
>
>> Thanks, Xinyu, for finding this!
>>
>> On Thu, Mar 25, 2021 at 4:48 PM Kenneth Knowles  wrote:
>>
>>> Cloned to https://issues.apache.org/jira/browse/BEAM-12056
>>>
>>> On Thu, Mar 25, 2021 at 4:46 PM Brian Hulette 
>>> wrote:
>>>
>>>> Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929,
>>>> I removed it from the release blockers since there is a workaround (use a
>>>> NamedTuple type), but it's probably worth cherrypicking the fix.
>>>>
>>>> On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> This could be https://issues.apache.org/jira/browse/BEAM-11929
>>>>>
>>>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> This is definitely wrong. Looking into what's going on here, but this
>>>>>> seems severe enough to be a blocker for the next release.
>>>>>>
>>>>>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, folks,
>>>>>>>
>>>>>>> I am playing around with the Python Dataframe API, and seemly got an
>>>>>>> schema issue when converting pcollection to dataframe. I wrote the
>>>>>>> following code for a simple test:
>>>>>>>
>>>>>>> import apache_beam as beam
>>>>>>> from apache_beam.dataframe.convert import to_dataframe
>>>>>>> from apache_beam.dataframe.convert import to_pcollection
>>>>>>>
>>>>>>> p = beam.Pipeline()
>>>>>>> data = p | beam.Create([('a', ''), ('b', '')]) | beam.Map(
>>>>>>> lambda x : beam.Row(word=x[0], val=x[1]))
>>>>>>> _ = data | beam.Map(print)
>>>>>>> p.run()
>>>>>>>
>>>>>>> This shows the following:
>>>>>>> Row(val='', word='a') Row(val='', word='b')
>>>>>>>
>>>>>>> But if I use to_dataframe() to convert it into a df, seems the
>>>>>>> schema was reversed:
>>>>>>>
>>>>>>> df = to_dataframe(data)
>>>>>>> dataCopy = to_pcollection(df)
>>>>>>> _ = dataCopy | beam.Map(print)
>>>>>>> p.run()
>>>>>>>
>>>>>>> I got:
>>>>>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='',
>>>>>>> val='a') BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='',
>>>>>>> val='b')
>>>>>>>
>>>>>>> Seems now the column 'word' and 'val' is swapped. The problem seems
>>>>>>> to happen during to_dataframe(). If I print out df['word'], I got ''
>>>>>>> and ''. I am not sure whether I am doing something wrong or there 
>>>>>>> is an
>>>>>>> issue in the schema conversion. Could someone help me take a look?
>>>>>>>
>>>>>>> Thanks, Xinyu
>>>>>>>
>>>>>>

Re: Python Dataframe API issue

2021-03-25 Thread Brian Hulette

Yes this looks like https://issues.apache.org/jira/browse/BEAM-11929, I
removed it from the release blockers since there is a workaround (use a
NamedTuple type), but it's probably worth cherrypicking the fix.

On Thu, Mar 25, 2021 at 4:44 PM Robert Bradshaw  wrote:

> This could be https://issues.apache.org/jira/browse/BEAM-11929
>
> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw 
> wrote:
>
>> This is definitely wrong. Looking into what's going on here, but this
>> seems severe enough to be a blocker for the next release.
>>
>> On Thu, Mar 25, 2021 at 3:39 PM Xinyu Liu  wrote:
>>
>>> Hi, folks,
>>>
>>> I am playing around with the Python Dataframe API, and seemly got an
>>> schema issue when converting pcollection to dataframe. I wrote the
>>> following code for a simple test:
>>>
>>> import apache_beam as beam
>>> from apache_beam.dataframe.convert import to_dataframe
>>> from apache_beam.dataframe.convert import to_pcollection
>>>
>>> p = beam.Pipeline()
>>> data = p | beam.Create([('a', ''), ('b', '')]) | beam.Map(lambda
>>> x : beam.Row(word=x[0], val=x[1]))
>>> _ = data | beam.Map(print)
>>> p.run()
>>>
>>> This shows the following:
>>> Row(val='', word='a') Row(val='', word='b')
>>>
>>> But if I use to_dataframe() to convert it into a df, seems the schema
>>> was reversed:
>>>
>>> df = to_dataframe(data)
>>> dataCopy = to_pcollection(df)
>>> _ = dataCopy | beam.Map(print)
>>> p.run()
>>>
>>> I got:
>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='', val='a')
>>> BeamSchema_4100b64e_16e9_467d_932e_5fc2e4acaca7(word='', val='b')
>>>
>>> Seems now the column 'word' and 'val' is swapped. The problem seems to
>>> happen during to_dataframe(). If I print out df['word'], I got '' and
>>> ''. I am not sure whether I am doing something wrong or there is an
>>> issue in the schema conversion. Could someone help me take a look?
>>>
>>> Thanks, Xinyu
>>>
>>

Re: Tutorial - How to run a Beam pipeline with Flink on Kubernetes Natively

2021-03-23 Thread Brian Hulette

See
https://beam.apache.org/documentation/programming-guide/#inferring-schemas
for details on how to use schemas with various types of Java classes (POJO,
Java Beans, AutoValue).

On Mon, Mar 22, 2021 at 3:01 PM Robert Bradshaw  wrote:

> Thanks, that looks quite useful.
>
> A not on POJOs, Serializable has several disadvantages (e.g.
> non-deterministic coding for key grouping, less efficient serialization).
> You could look into making them compatible with Beam schemas.
>
> On Mon, Mar 22, 2021 at 1:11 PM Cristian Constantinescu 
> wrote:
>
>> Hi everyone,
>>
>> I spent the week-end putting the pieces together to run Beam with the
>> Flink Runner on Kubernetes. While I did find very good articles and videos
>> about Beam and Flink and Kubernetes separately, I didn't find one that
>> mixes all three of them in the same pot.
>>
>> So, I wrote a small demo project/tutorial combining all the bits and
>> pieces of information I found on my quest. It's here
>> https://github.com/zeidoo/tutorial-bean-flink-kubernetes-pojo. Any
>> feedback is welcome.
>>
>> I hope it helps someone.
>>
>> Cheers everyone!
>>
>> PS: I didn't find on the Beam documentation website that the easiest way
>> to pass PoJos around transforms is to make them implement Serializable.
>> I've put a small section about that in the tutorial.
>>
>

Re: BEAM-9613 - BigQueryUtils and Avro

2021-03-23 Thread Brian Hulette

Thank you for digging into this and figuring out how this bug was
introduced!

In the long-term I think it would be preferable to avoid
TableRow altogether in order to do a schema-aware read of avro data from
BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
would also be preferable for Arrow, where we could produce Row instances
that are references into an underlying arrow RecordBatch (using
RowWithGetters), rather than having to materialize each row to make a
TableRow instance.

For a short-term fix there are two options, both came up in Reuven's
comments on BEAM-9613:

> However if we expect to get Long, Double,etc. objects in the TableRow,
> then this mapping code needs to handle those objects. Handling them
> directly would be more efficient - converting to a String would simply be a
> stopgap "one-line" fix.

1. handle them directly [in BigQueryUtils], this is what you've done in
https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
2. convert to a String [in BigQueryAvroUtils]

I don't have a strong preference but I think (2) is a cleaner, albeit less
performant, solution. It seems that BigQueryUtils expects all values in
TableRow instances to be String instances. Since BigQueryAvroUtils is just
a shim to convert GenericRecord to TableRow for use in BigQueryUtils, it
should comply with that interface, rather than making BigQueryUtils work
around the discrepancy.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java

On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang 
wrote:

> I'm working on fixing BEAM-9613 because encountered this issue as a result
> of using BigQueryIO.readTableRowsWithSchema()
>
>1. BEAM-7526 provided support for Lists and Maps that came from the
>JSON export format
>2. BEAM-2879 switched the export format from JSON to Avro.  This
>caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT as
>a Java String but rather Java Boolean and Double.
>3. The switch from JSON to Avro also introduced an issue with fields
>with BQ REPEATED mode because fields of this mode.
>
> I have a simple fix to handle BQ BOOLEAN and FLOAT (
> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
> however I'm a bit uncomfortable with it because
>
>1. This would introduce mixing of both the JSON and Avro export
>formats,
>2. BEAM-8933 while still in progress would introduce Arrow and risk a
>regression,
>3. I haven't made a fix for REPEATED mode yet, but tests that use
>BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>
> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>I don't know if I should change this because I don't know if Beam wishes to
>continue support for the JSON format.
>
> I'd like to incorporate these changes in my project as soon as possible,
> but I also need guidance on next steps that would be in line with the
> general direction of the project.  I'm looking forward to any feedback.
> Thanks.
>

Re: P0 (outage) report

2021-03-22 Thread Brian Hulette

Can/should we configure this to not send the message when there are no P0s?

On Sat, Mar 20, 2021 at 3:53 PM Beam Jira Bot  wrote:

> This is your daily summary of Beam's current outages. See
> https://beam.apache.org/contribute/jira-priorities/#p0-outage for the
> meaning and expectations around P0 issues.
>
>

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Brian Hulette

+1, thank you!

On Thu, Mar 18, 2021 at 12:28 PM Daniel Collins 
wrote:

> Sounds good to me, thanks!
>
> On Thu, Mar 18, 2021 at 3:27 PM Fernando Morales Martinez <
> fernando.mora...@wizeline.com> wrote:
>
>> Brian, Daniel,
>> Really appreciate the feedback, thanks!
>>
>> I'll go through the aforementioned PRs to get a better understanding of
>> the proto support effort.
>> If it's ok with you, I'd like to implement and add the protobuf tests to
>> the PubSubTableProviderIT.
>>
>> Please let me know if that works for you.
>>
>> Thanks again,
>> - Fernando Morales
>>
>> On Thu, Mar 18, 2021 at 12:15 PM Daniel Collins 
>> wrote:
>>
>>> Also see https://github.com/apache/beam/pull/13838 which is related
>>>
>>> On Thu, Mar 18, 2021 at 3:06 PM Daniel Collins 
>>> wrote:
>>>
>>>> > appears to be dead code
>>>>
>>>> Its not- its used as the format for the configuration row here
>>>> https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaIOProvider.java#L145
>>>>
>>>> On Thu, Mar 18, 2021 at 2:53 PM Brian Hulette 
>>>> wrote:
>>>>
>>>>> Ah the enum I linked in PubsubSchemaIOProvider appears to be dead
>>>>> code, I don't think it's referenced anywhere. So I think the 
>>>>> implementation
>>>>> should be done, we'd just need to modify PubsubTableProviderIT to exercise
>>>>> the proto forma.
>>>>>
>>>>> On Thu, Mar 18, 2021 at 11:46 AM Daniel Collins 
>>>>> wrote:
>>>>>
>>>>>> > because there are a couple places where Avro, JSON are still
>>>>>> hard-coded [2,3]
>>>>>>
>>>>>> This is not a blocker, its due to the fact that PubsubTableProvider
>>>>>> is just a wrapper for PubsubSchemaIOProvider. SchemaIOProvider requires 
>>>>>> you
>>>>>> to specify all possible options, TableProvider does not.
>>>>>>
>>>>>> On Thu, Mar 18, 2021 at 2:44 PM Brian Hulette 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Fernando,
>>>>>>>
>>>>>>> Daniel Collins actually added the PayloadSerializerProvider concept
>>>>>>> very recently [1], which is why it looks like Piotr's code doesn't apply
>>>>>>> anymore. But the good news is I think that PR gets this task pretty 
>>>>>>> close
>>>>>>> to completion. It doesn't look like the PR *quite* finished adding 
>>>>>>> support
>>>>>>> for Proto to PubSubTableProvider though, because there are a couple 
>>>>>>> places
>>>>>>> where Avro, JSON are still hard-coded [2,3]. For this task to be 
>>>>>>> complete
>>>>>>> we should have tests of protobuf in PubSubTableProviderIT, which is
>>>>>>> parameterized by payload type [3].
>>>>>>>
>>>>>>> Regarding PubSubIO.readProtos: you're right to point out there's
>>>>>>> some overlap between the various PubSub readers/writers and the
>>>>>>> TableProviders. Ideally we'd define the logic for reading/writing Beam 
>>>>>>> Rows
>>>>>>> with each IO in a single place, but right now most of this logic lives 
>>>>>>> in
>>>>>>> SQL's TableProviders, and in a few places it's duplicated into the IOs, 
>>>>>>> as
>>>>>>> with readAvrosWithBeamSchema. For this task I think the right thing to 
>>>>>>> do
>>>>>>> is use PayloadSerializerProvider.
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/pull/13825
>>>>>>> [2]
>>>>>>> https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaIOProvider.java#L111
>>>>>>> [3]
>>>>>>> https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/pubsub/PubsubTableProviderIT.java#L108
>

Re: [Question] What is the best Beam datatype to map to BigQuery's BIGNUMERIC?

2021-03-18 Thread Brian Hulette

Hi Vachan,
I don't think Beam DECIMAL is really a great mapping for either BigQuery's
NUMERIC or BIGNUMERIC type. Beam's DECIMAL represents arbitrary precision
decimals [1] to map well to Java's BigDecimal class [2].

Maybe we should add a fixed-precision decimal logical type [3], then have
specific instances of it with the appropriate precision that map to NUMERIC
and BIGNUMERIC. We could also shunt Beam DECIMAL to BIGNUMERIC for
compatibility.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L424
[2] https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html
[3]
https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes

On Thu, Mar 18, 2021 at 12:00 PM Vachan Shetty  wrote:

> Hello, I am currently trying to add support for BigQuery's new BIGNUMERIC
> datatype [1] in Beam's BigQueryIO. I am currently following the steps that
> were used for adding the NUMERIC datatype [2]. AFAICT Beam's DECIMAL is
> the most appropriate datatype to map to BIGNUMERIC in BQ. However, the
> Beam DECIMAL datatype is already mapped to NUMERIC in BQ [2, 3]. Given
> this, should I simply map all Beam DECIMAL to BQ BIGNUMERIC? Or should this
> conversion be done based on other information? [1]:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#decimal_types
> [2]: https://issues.apache.org/jira/browse/BEAM-4417 [3]:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java#L188
>

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Brian Hulette

Ah the enum I linked in PubsubSchemaIOProvider appears to be dead code, I
don't think it's referenced anywhere. So I think the implementation should
be done, we'd just need to modify PubsubTableProviderIT to exercise the
proto forma.

On Thu, Mar 18, 2021 at 11:46 AM Daniel Collins 
wrote:

> > because there are a couple places where Avro, JSON are still hard-coded
> [2,3]
>
> This is not a blocker, its due to the fact that PubsubTableProvider is
> just a wrapper for PubsubSchemaIOProvider. SchemaIOProvider requires you to
> specify all possible options, TableProvider does not.
>
> On Thu, Mar 18, 2021 at 2:44 PM Brian Hulette  wrote:
>
>> Hi Fernando,
>>
>> Daniel Collins actually added the PayloadSerializerProvider concept very
>> recently [1], which is why it looks like Piotr's code doesn't apply
>> anymore. But the good news is I think that PR gets this task pretty close
>> to completion. It doesn't look like the PR *quite* finished adding support
>> for Proto to PubSubTableProvider though, because there are a couple places
>> where Avro, JSON are still hard-coded [2,3]. For this task to be complete
>> we should have tests of protobuf in PubSubTableProviderIT, which is
>> parameterized by payload type [3].
>>
>> Regarding PubSubIO.readProtos: you're right to point out there's some
>> overlap between the various PubSub readers/writers and the TableProviders.
>> Ideally we'd define the logic for reading/writing Beam Rows with each IO in
>> a single place, but right now most of this logic lives in SQL's
>> TableProviders, and in a few places it's duplicated into the IOs, as with
>> readAvrosWithBeamSchema. For this task I think the right thing to do is use
>> PayloadSerializerProvider.
>>
>> [1] https://github.com/apache/beam/pull/13825
>> [2]
>> https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaIOProvider.java#L111
>> [3]
>> https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/pubsub/PubsubTableProviderIT.java#L108
>>
>> On Thu, Mar 18, 2021 at 10:32 AM Fernando Morales Martinez <
>> fernando.mora...@wizeline.com> wrote:
>>
>>> Hi everyone,
>>> I'm working on the WI mentioned in the subject: "Add Proto support to
>>> Pubsub table provider" and I have a few questions . Sorry for the long mail!
>>>
>>>- The only method in KafkaTableProvider that performs some logic is
>>>buildBeamSqlTable. However, when taking a look at the tests in
>>>KafkaTableProviderIT, the only one that is calling that method appears to
>>>be testFake2. But if I’m not mistaken, that test doesn’t perform any test
>>>for the proto case. The only one that tests the proto case is testFake
>>>test, but that is only creating the KafkaTableProvider and that’s it. I
>>>wanted to base the PubSubTableProvider on the KafkaTableProvider since 
>>> that
>>>one supports Proto and looks like it accomplishes that support by using
>>>PayloadSerializer. Is that correct? Should I follow that path?
>>>
>>>
>>>- I wanted to base the new code on the commits by Piotr, but a lot
>>>of the code he submitted appears to have been removed, but I can’t grasp
>>>the reason. There are several commits referencing one another and the 
>>> Proto
>>>support. Can you shed some light on that?
>>>
>>>
>>>- Back to the PubsubTableProvider: PubSubIO, which contains
>>>reader/writer methods for protos (although via ProtoCoder), is being used
>>>by the PubsubSchemaIOProvider class for other read/write purposes. Why 
>>> are
>>>the protos reader/writer not used? Because of the ProtoCoder? Should we
>>>instead implement new readers/writers by formatting the payload directly,
>>>in a similar fashion to readAvrosWithBeamSchema?
>>>
>>> Thanks a lot for the help!
>>>
>>> - Fernando Morales
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *This email and its contents (including any attachments) are being sent
>>> toyou on the condition of confidentiality and may be protected by
>>> legalprivilege. Access to this email by anyone other than the intended
>>> recipientis unauthorized. If you are not the intended recipient, please
>>> immediatelynotify the sender by replying to this message and delete the
>>> materialimmediately from your system. Any further use, dissemination,
>>> distributionor reproduction of this email is strictly prohibited. Further,
>>> norepresentation is made with respect to any content contained in this
>>> email.*
>>
>>

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Brian Hulette

Hi Fernando,

Daniel Collins actually added the PayloadSerializerProvider concept very
recently [1], which is why it looks like Piotr's code doesn't apply
anymore. But the good news is I think that PR gets this task pretty close
to completion. It doesn't look like the PR *quite* finished adding support
for Proto to PubSubTableProvider though, because there are a couple places
where Avro, JSON are still hard-coded [2,3]. For this task to be complete
we should have tests of protobuf in PubSubTableProviderIT, which is
parameterized by payload type [3].

Regarding PubSubIO.readProtos: you're right to point out there's some
overlap between the various PubSub readers/writers and the TableProviders.
Ideally we'd define the logic for reading/writing Beam Rows with each IO in
a single place, but right now most of this logic lives in SQL's
TableProviders, and in a few places it's duplicated into the IOs, as with
readAvrosWithBeamSchema. For this task I think the right thing to do is use
PayloadSerializerProvider.

[1] https://github.com/apache/beam/pull/13825
[2]
https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubSchemaIOProvider.java#L111
[3]
https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94ce45861857206674/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/meta/provider/pubsub/PubsubTableProviderIT.java#L108

On Thu, Mar 18, 2021 at 10:32 AM Fernando Morales Martinez <
fernando.mora...@wizeline.com> wrote:

> Hi everyone,
> I'm working on the WI mentioned in the subject: "Add Proto support to
> Pubsub table provider" and I have a few questions . Sorry for the long mail!
>
>- The only method in KafkaTableProvider that performs some logic is
>buildBeamSqlTable. However, when taking a look at the tests in
>KafkaTableProviderIT, the only one that is calling that method appears to
>be testFake2. But if I’m not mistaken, that test doesn’t perform any test
>for the proto case. The only one that tests the proto case is testFake
>test, but that is only creating the KafkaTableProvider and that’s it. I
>wanted to base the PubSubTableProvider on the KafkaTableProvider since that
>one supports Proto and looks like it accomplishes that support by using
>PayloadSerializer. Is that correct? Should I follow that path?
>
>
>- I wanted to base the new code on the commits by Piotr, but a lot of
>the code he submitted appears to have been removed, but I can’t grasp the
>reason. There are several commits referencing one another and the Proto
>support. Can you shed some light on that?
>
>
>- Back to the PubsubTableProvider: PubSubIO, which contains
>reader/writer methods for protos (although via ProtoCoder), is being used
>by the PubsubSchemaIOProvider class for other read/write purposes. Why are
>the protos reader/writer not used? Because of the ProtoCoder? Should we
>instead implement new readers/writers by formatting the payload directly,
>in a similar fashion to readAvrosWithBeamSchema?
>
> Thanks a lot for the help!
>
> - Fernando Morales
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: Missing copyright notices due to LICENSE change

2021-03-18 Thread Brian Hulette

Thanks Robert! I'm +1 for reverting and engaging pkg.go.dev

> and probably cherry pick it into the affected release branches.
Even if we do this, the Java artifacts from the affected releases are
missing the additional LICENSE text.

> I do not know how to interpret this ASF guide. As an example from another
project: airflow also has a LICENSE file, NOTICE file, and a licenses
directory. There are even overlapping mentions.
Agreed. I am a software engineer, not a lawyer, and even the ASF's guide
that presumably targets engineers is not particularly clear to me. This was
just my tenuous understanding after a quick review.

On Wed, Mar 17, 2021 at 7:49 PM Ahmet Altay  wrote:

> Thank you Rebo. I agree with reverting first and then figure out the next
> steps.
>
> Here is a PR to revert your change:
> https://github.com/apache/beam/pull/14267
>
> On Wed, Mar 17, 2021 at 4:02 PM Robert Burke  wrote:
>
>> Looking at the history it seems that before the python text was added,
>> pkg.go.dev can parse the license stack just fine. It doesn't recognize
>> the PSF license, and fails closed entirely as a result.
>>
>> I've filed an issue with pkg.go.dev (
>> https://github.com/golang/go/issues/45095). If the bug is fixed, the
>> affected versions will become visible as well.
>>
>> In the meantime, we should revert my change which clobbered the other
>> licenses and probably cherry pick it into the affected release branches.
>>
>> The PSF license is annoying as it's explicitly unique. Nothing but python
>> can use it and call it the PSF license. However it is a redistribution
>> friendly license, which is what matters.
>>
>> On Wed, Mar 17, 2021, 3:00 PM Ahmet Altay  wrote:
>>
>>> Thank you for this email.
>>>
>>>
>>> On Wed, Mar 17, 2021 at 2:32 PM Brian Hulette 
>>> wrote:
>>>
>>>> I just noticed that there was a recent change to our LICENSE file to
>>>> make it exactly match the Apache 2.0 License [1]. This seems to be the
>>>> result of two conflicting LICENSE issues.
>>>>
>>>> Go LICENSE issue: The motivation for [1] was to satisfy pkg.go.dev's
>>>> license policies [2]. Prior to the change our documentation didn't show up
>>>> there [3].
>>>>
>>>> Java artifact LICENSE issue: The removed text contained information
>>>> relevant to "convenience binary distributions". This text was added in [4]
>>>> as a result of this dev@ thread [5], where we noticed that copyright
>>>> notices were missing in binary artifacts. The suggested solution (that we
>>>> went with) was to just add the information to the root (source) LICENSE.
>>>>
>>>
>>> Python distribution is missing both files as well. (
>>> https://issues.apache.org/jira/browse/BEAM-1746)
>>>
>>>
>>>>
>>>> I'm not sure that that solution is consistent with this ASF guide [6]
>>>> which states:
>>>>
>>>> > The LICENSE and NOTICE files must *exactly* represent the contents of
>>>> the distribution they reside in. Only components and resources that are
>>>> actually included in a distribution have any bearing on the content of that
>>>> distribution's NOTICE and LICENSE.
>>>>
>>>> I would argue that *just* Apache 2.0 is the correct text for our root
>>>> (source) LICENSE, and the correct way to deal with binary artifacts is to
>>>> generate per-artifact LICENSE/NOTICE files.
>>>>
>>>
>>> I do not know how to interpret this ASF guide. As an example from
>>> another project: airflow also has a LICENSE file, NOTICE file, and a
>>> licenses directory. There are even overlapping mentions.
>>>
>>>
>>>>
>>>>
>>>> So right now the Go issue is fixed, but the Java artifact issue has
>>>> regressed. I can think of two potential solutions to resolve both:
>>>> 1) Restore the "convenience binary distributions" information, and see
>>>> if we can get pkg.go.dev to allow it.
>>>> 2) Add infrastructure to generate LICENSE and NOTICE files for Java
>>>> binary artifacts.
>>>>
>>>> I have no idea how we might implement (2) so (1) seems more tenable,
>>>> but less correct since it's adding information not relevant to the source
>>>> release.
>>>>
>>>> Brian
>>>>
>>>>
>>>> [1] https://github.com/apache/beam/pull/11657
>>>> [2] https://pkg.go.dev/license-policy
>>>> [3]
>>>> https://pkg.go.dev/github.com/apache/beam@v2.21.0+incompatible/sdks/go/pkg/beam
>>>> [4] https://github.com/apache/beam/pull/5461
>>>> [5]
>>>> https://lists.apache.org/thread.html/6ef6630e908147ee83e1f1efd4befbda43efb2a59271c5cb49473103@%3Cdev.beam.apache.org%3E
>>>> [6] https://infra.apache.org/licensing-howto.html
>>>>
>>>

Missing copyright notices due to LICENSE change

2021-03-17 Thread Brian Hulette

I just noticed that there was a recent change to our LICENSE file to make
it exactly match the Apache 2.0 License [1]. This seems to be the result of
two conflicting LICENSE issues.

Go LICENSE issue: The motivation for [1] was to satisfy pkg.go.dev's
license policies [2]. Prior to the change our documentation didn't show up
there [3].

Java artifact LICENSE issue: The removed text contained information
relevant to "convenience binary distributions". This text was added in [4]
as a result of this dev@ thread [5], where we noticed that copyright
notices were missing in binary artifacts. The suggested solution (that we
went with) was to just add the information to the root (source) LICENSE.

I'm not sure that that solution is consistent with this ASF guide [6] which
states:

> The LICENSE and NOTICE files must *exactly* represent the contents of the
distribution they reside in. Only components and resources that are
actually included in a distribution have any bearing on the content of that
distribution's NOTICE and LICENSE.

I would argue that *just* Apache 2.0 is the correct text for our root
(source) LICENSE, and the correct way to deal with binary artifacts is to
generate per-artifact LICENSE/NOTICE files.


So right now the Go issue is fixed, but the Java artifact issue has
regressed. I can think of two potential solutions to resolve both:
1) Restore the "convenience binary distributions" information, and see if
we can get pkg.go.dev to allow it.
2) Add infrastructure to generate LICENSE and NOTICE files for Java binary
artifacts.

I have no idea how we might implement (2) so (1) seems more tenable, but
less correct since it's adding information not relevant to the source
release.

Brian


[1] https://github.com/apache/beam/pull/11657
[2] https://pkg.go.dev/license-policy
[3]
https://pkg.go.dev/github.com/apache/beam@v2.21.0+incompatible/sdks/go/pkg/beam
[4] https://github.com/apache/beam/pull/5461
[5]
https://lists.apache.org/thread.html/6ef6630e908147ee83e1f1efd4befbda43efb2a59271c5cb49473103@%3Cdev.beam.apache.org%3E
[6] https://infra.apache.org/licensing-howto.html

Re: BEAM-11023: tests failing on Spark Structured Streaming runner

2021-03-17 Thread Brian Hulette

You can look through the history of the PostCommit [1]. We only keep a
couple weeks of history, but it looks like we have one successful run from
Sept 10, 2020, marked as "keep forever", that ran on commit
57055262e7a6bff447eef2df1e6efcda754939ca.
Is that what you're looking for?

(Somewhat related, I was under the impression that Jenkins always kept the
before/after runs around the last state change, but that doesn't seem to be
the case as the first failure we have is [3])

Brian

[1]
https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/
[2]
https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/2049/
[3]
https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/2098/

On Tue, Mar 16, 2021 at 4:36 PM Fernando Morales Martinez <
fernando.mora...@wizeline.com> wrote:

> Hi team,
> it is mentioned in this WI that the tests (GroupByKeyTest
> testLargeKeys100MB and testGroupByKeyWithBadEqualsHashCode) stopped working
> around five months ago.
> I took a look at the PRs prior to that date and couldn't find a report
> stating that they were working.
>
> Is there a way to get reports from before June 2020 (the farthest back I
> was able to navigate) so I can compare the tests succeeding against them
> failing?
>
> Thanks a lot!
> - Fernando Morales
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: Beam Website Feedback

2021-03-15 Thread Brian Hulette

Hm thanks for pointing this out Brian. It looks like the Java
WindowedWordCount example does process an input file while the Java one
processes a PubSub topic. Changing the command would be a good quick fix,
but I think the best fix would actually be to make the python example
mirror the Java one. I'm curious what other dev@ readers think about this
though.

(I also wanted to point out Kyle recently discovered BEAM-11944 which was
introduced in the website revamp - that's why the code block switchers
aren't working on that page)

Brian

On Sun, Mar 14, 2021 at 2:27 PM Mo Brian  wrote:

> Hi team,
>
>
>
> I’m studying the apache beam from
> https://beam.apache.org/get-started/wordcount-example/#windowedwordcount-example
>
>
>
> A bit lost on the windowed_wordcount.py and it’s start command:
>
>
>
> *windowed_wordcount.py input a pubsub message:  *
>
> lines = p | beam.io.ReadFromPubSub(known_args.input_topic)
>
>
>
> *start command provide a file input:*
>
> python -m apache_beam.examples.windowed_wordcount --input YOUR_INPUT_FILE
> \
>
>  --output_table
> PROJECT:DATASET.TABLE \
>
>  --runner DataflowRunner \
>
>  --project YOUR_GCP_PROJECT \
>
>  --temp_location
> gs://YOUR_GCS_BUCKET/tmp/
>
>
>
> Should I change the command here?
>
>
>
> Thanks
>
>
>
> Brian
>
>
>
> Sent from Mail  for
> Windows 10
>
>
>

Re: [VOTE] Release vendor-calcite-1_26_0 version 0.1, release candidate #1

2021-03-09 Thread Brian Hulette

There are several jiras blocked by a Calcite upgrade. See
https://issues.apache.org/jira/browse/BEAM-9379

On Tue, Mar 9, 2021 at 5:17 AM Ismaël Mejía  wrote:

> Just out of curiosity is there some feature we are expecting from
> Calcite that pushes this upgrade or is this just catching up for the
> sake of security improvements + not having old dependencies?
>
>
> On Tue, Mar 9, 2021 at 12:23 AM Ahmet Altay  wrote:
> >
> > +1 (binding)
> >
> > On Mon, Mar 8, 2021 at 3:21 PM Pablo Estrada  wrote:
> >>
> >> +1 (binding) verified signature
> >>
> >> On Mon, Mar 8, 2021 at 12:05 PM Kai Jiang  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> On Mon, Mar 8, 2021 at 11:37 AM Andrew Pilloud 
> wrote:
> 
>  Hi everyone,
>  Please review and vote on the release candidate #1 for
> beam-vendor-calcite-1_26_0 version 0.1, as follows:
>  [ ] +1, Approve the release
>  [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
>  Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found.
> 
>  The complete staging area is available for your review, which
> includes:
>  * the official Apache source release to be deployed to
> dist.apache.org [1], which is signed with the key with fingerprint
> 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [2],
>  * all artifacts to be deployed to the Maven Central Repository [3],
>  * source code commit "0f52187e344dad9bca4c183fe51151b733b24e35" [4].
> 
>  The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  Andrew
> 
>  [1]
> https://dist.apache.org/repos/dist/dev/beam/vendor/beam-vendor-calcite-1_26_0/0.1/
>  [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>  [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1163/
>  [4]
> https://github.com/apache/beam/tree/0f52187e344dad9bca4c183fe51151b733b24e35
>

1 2 3 4 >

1 - 100 of 345 matches

Mail list logo