Re: Ordered PCollections eventually?

2021-05-10 Thread Sam Rohde
Awesome, thanks Pablo!

On Mon, May 10, 2021 at 4:05 PM Pablo Estrada  wrote:

> CDC would also benefit. I am working on a proposal for this that is
> concerned with streaming pipelines, and per-key ordered delivery. I will
> share with you as soon as I have a draft.
> Best
> -P.
>
> On Mon, May 10, 2021 at 2:56 PM Reuven Lax  wrote:
>
>> There has been talk, but nothing concrete.
>>
>> On Mon, May 10, 2021 at 1:42 PM Sam Rohde  wrote:
>>
>>> Hi All,
>>>
>>> I was wondering if there had been any plans for creating ordered
>>> PCollections in the Beam model? Or if there might be plans for them in the
>>> future?
>>>
>>> I know that Beam SQL and Beam DataFrames would directly benefit from an
>>> ordered PCollection.
>>>
>>> Regards,
>>> Sam
>>>
>>


Re: Transform-specific thread pools in Python

2021-05-10 Thread Ahmet Altay
On Mon, May 10, 2021 at 8:01 AM Stephan Hoyer  wrote:

> Hi Beam devs,
>
> I've been exploring recently how to optimize IO bound steps for my Python
> Beam pipelines, and have come up with a solution that I think might make
> sense to upstream into Beam's Python SDK.
>
> It appears that Beam runners (at least the Cloud Dataflow runner)
> typically use only a single thread per Python process.
>

I thought the default was not 1 but something else (12?). Maybe that
changed.


> The number of threads per worker can be adjusted with flags, but only for
> the entire pipeline. This behavior makes sense *in general* under the
> worst-case assumption that user-code in Python is CPU bound and requires
> the GIL.
>
> However, multiple threads can be quite helpful in many cases, e.g.,
> 1. CPU bound tasks that release the GIL. This is typically the case when
> using libraries for numerical computing, such as NumPy and pandas.
> 2. IO bound tasks that can be run asynchronously, e.g., reading/writing
> files or RPCs. This is the use-case for which not using threads can be most
> problematic, e.g., in a recent dataflow pipeline reading/writing lots of
> relatively small files (~1-10 MB) to cloud storage with the default number
> of threads per worker, I found that I was only using ~20% of available CPU.
>
> Because the optimal number of threads for Python code can be quite
> heterogeneous, I would like to be able to indicate that particular steps of
> my Beam pipelines should be executed using more threads. This would be
> particularly valuable for writing libraries of custom IO transforms, which
> should still conservatively assume that *other* steps in user provided
> pipelines may be CPU bound.
>
> The solution I've come up with is to use beam.BatchElements with a ParDo
> function that executes tasks in separate threads (via
> concurrent.futures.ThreadPool). I've used this to make high-level wrappers
> like beam.Map, beam.MapTuple, etc that execute with multiple threads. This
> seems to work pretty well for my use-cases. I can put these in my own
> library, of course, but perhaps these would make sense upstream into Beam's
> Python SDK itself?
>

I believe a related idea (async pardo) was discussed and some work was done
earlier (https://issues.apache.org/jira/browse/BEAM-6550). AFAIK Flink also
has a similar concept (
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/datastream/operators/asyncio/)
as well.

Perhaps you can share a bit more details about your proposal along with
your code and people could provide more feedback on that.


>
> One alternative would be supporting this sort of concurrency control
> inside Beam runners. In principle, I imagine runners could tune thread-pool
> size for each stage automatically, e.g., based on CPU usage. To be honest,
> I'm a little surprised this doesn't happen already, but I'm sure there are
> good reasons why not.
>

Runner support would be the ideal solution. Because runners could decide on
the most optimal pool size based on the real time information. Supporting
and using annotations would provide helpful hints for the runners. At least
the latter part is in progres IIRC.


>
> Let me know what you think!
>
> Cheers,
> Stephan
>
>


Re: Ordered PCollections eventually?

2021-05-10 Thread Pablo Estrada
CDC would also benefit. I am working on a proposal for this that is
concerned with streaming pipelines, and per-key ordered delivery. I will
share with you as soon as I have a draft.
Best
-P.

On Mon, May 10, 2021 at 2:56 PM Reuven Lax  wrote:

> There has been talk, but nothing concrete.
>
> On Mon, May 10, 2021 at 1:42 PM Sam Rohde  wrote:
>
>> Hi All,
>>
>> I was wondering if there had been any plans for creating ordered
>> PCollections in the Beam model? Or if there might be plans for them in the
>> future?
>>
>> I know that Beam SQL and Beam DataFrames would directly benefit from an
>> ordered PCollection.
>>
>> Regards,
>> Sam
>>
>


Flaky test issue report

2021-05-10 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests. These are P1 issues 
because they have a major negative impact on the community and make it hard to 
determine the quality of the software.

BEAM-12311: Python PostCommit are close to timeout 
(https://issues.apache.org/jira/browse/BEAM-12311)
BEAM-12309: PubSubIntegrationTest.test_streaming_data_only flake 
(https://issues.apache.org/jira/browse/BEAM-12309)
BEAM-12307: PubSubBigQueryIT.test_file_loads flake 
(https://issues.apache.org/jira/browse/BEAM-12307)
BEAM-12303: Flake in PubSubIntegrationTest.test_streaming_with_attributes 
(https://issues.apache.org/jira/browse/BEAM-12303)
BEAM-12293: FlinkSavepointTest.testSavepointRestoreLegacy flakes due to 
FlinkJobNotFoundException (https://issues.apache.org/jira/browse/BEAM-12293)
BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (https://issues.apache.org/jira/browse/BEAM-12291)
BEAM-12250: Java ValidatesRunner Postcommits timing out 
(https://issues.apache.org/jira/browse/BEAM-12250)
BEAM-12200: SamzaStoreStateInternalsTest is flaky 
(https://issues.apache.org/jira/browse/BEAM-12200)
BEAM-12163: Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK 
harness startup (https://issues.apache.org/jira/browse/BEAM-12163)
BEAM-12061: beam_PostCommit_SQL failing on 
KafkaTableProviderIT.testFakeNested 
(https://issues.apache.org/jira/browse/BEAM-12061)
BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (https://issues.apache.org/jira/browse/BEAM-12019)
BEAM-11792: Python precommit failed (flaked?) installing package  
(https://issues.apache.org/jira/browse/BEAM-11792)
BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (https://issues.apache.org/jira/browse/BEAM-11666)
BEAM-11662: elasticsearch tests failing 
(https://issues.apache.org/jira/browse/BEAM-11662)
BEAM-11661: hdfsIntegrationTest flake: network not found (py38 postcommit) 
(https://issues.apache.org/jira/browse/BEAM-11661)
BEAM-11646: beam_PostCommit_XVR_Spark failing 
(https://issues.apache.org/jira/browse/BEAM-11646)
BEAM-11645: beam_PostCommit_XVR_Flink failing 
(https://issues.apache.org/jira/browse/BEAM-11645)
BEAM-11541: testTeardownCalledAfterExceptionInProcessElement flakes on 
direct runner. (https://issues.apache.org/jira/browse/BEAM-11541)
BEAM-11540: Linter sometimes flakes on apache_beam.dataframe.frames_test 
(https://issues.apache.org/jira/browse/BEAM-11540)
BEAM-10995: Java + Universal Local Runner: 
WindowingTest.testWindowPreservation fails 
(https://issues.apache.org/jira/browse/BEAM-10995)
BEAM-10987: stager_test.py::StagerTest::test_with_main_session flaky on 
windows py3.6,3.7 (https://issues.apache.org/jira/browse/BEAM-10987)
BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-10968)
BEAM-10955: Flink Java Runner test flake: Could not find Flink job  
(https://issues.apache.org/jira/browse/BEAM-10955)
BEAM-10923: Python requirements installation in docker container is flaky 
(https://issues.apache.org/jira/browse/BEAM-10923)
BEAM-10899: test_FhirIO_exportFhirResourcesGcs flake with OOM 
(https://issues.apache.org/jira/browse/BEAM-10899)
BEAM-10866: PortableRunnerTestWithSubprocesses.test_register_finalizations 
flaky on macOS (https://issues.apache.org/jira/browse/BEAM-10866)
BEAM-10590: BigQueryQueryToTableIT flaky: test_big_query_new_types 
(https://issues.apache.org/jira/browse/BEAM-10590)
BEAM-10519: 
MultipleInputsAndOutputTests.testParDoWithSideInputsIsCumulative flaky on Samza 
(https://issues.apache.org/jira/browse/BEAM-10519)
BEAM-10504: Failure / flake in ElasticSearchIOTest > 
testWriteFullAddressing and testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10504)
BEAM-10501: CheckGrafanaStalenessAlerts and PingGrafanaHttpApi fail with 
Connection refused (https://issues.apache.org/jira/browse/BEAM-10501)
BEAM-10485: Failure / flake: ElasticsearchIOTest > testWriteWithIndexFn 
(https://issues.apache.org/jira/browse/BEAM-10485)
BEAM-10272: Failure in CassandraIOTest init: cannot create cluster due to 
netty link error (https://issues.apache.org/jira/browse/BEAM-10272)
BEAM-9649: beam_python_mongoio_load_test started failing due to mismatched 
results (https://issues.apache.org/jira/browse/BEAM-9649)
BEAM-9392: TestStream tests are all flaky 
(https://issues.apache.org/jira/browse/BEAM-9392)
BEAM-9232: BigQueryWriteIntegrationTests is flaky coercing to Unicode 
(https://issues.apache.org/jira/browse/BEAM-9232)
BEAM-9119: 
apache_beam.runners.portability.fn_api_runner_test.FnApiRunnerTest[...].test_large_elements
 is flaky 

P1 issues report

2021-05-10 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests.

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

BEAM-12321: Failure in test_run_packable_combine_per_key and 
test_run_packable_combine_globally 
(https://issues.apache.org/jira/browse/BEAM-12321)
BEAM-12320: PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing 
in SQL PostCommit (https://issues.apache.org/jira/browse/BEAM-12320)
BEAM-12316: LGPL in bundled dependencies 
(https://issues.apache.org/jira/browse/BEAM-12316)
BEAM-12310: beam_PostCommit_Java_DataflowV2 failing 
(https://issues.apache.org/jira/browse/BEAM-12310)
BEAM-12308: CrossLanguageKafkaIOTest.test_kafkaio flake 
(https://issues.apache.org/jira/browse/BEAM-12308)
BEAM-12290: TestPubsub.assertThatSubscriptionEventuallyCreated timeout does 
not work (https://issues.apache.org/jira/browse/BEAM-12290)
BEAM-12287: beam_PerformanceTests_Kafka_IO failing due to 
:sdks:java:container:pullLicenses failure 
(https://issues.apache.org/jira/browse/BEAM-12287)
BEAM-12279: Implement destination-dependent sharding in FileIO.writeDynamic 
(https://issues.apache.org/jira/browse/BEAM-12279)
BEAM-12258: SQL postcommit timing out 
(https://issues.apache.org/jira/browse/BEAM-12258)
BEAM-12256: PubsubIO.readAvroGenericRecord creates SchemaCoder that fails 
to decode some Avro logical types 
(https://issues.apache.org/jira/browse/BEAM-12256)
BEAM-12231: beam_PostRelease_NightlySnapshot failing 
(https://issues.apache.org/jira/browse/BEAM-12231)
BEAM-1: Dataflow side input translation "Unknown producer for value" 
(https://issues.apache.org/jira/browse/BEAM-1)
BEAM-11959: Python Beam SDK Harness hangs when installing pip packages 
(https://issues.apache.org/jira/browse/BEAM-11959)
BEAM-11906: No trigger early repeatedly for session windows 
(https://issues.apache.org/jira/browse/BEAM-11906)
BEAM-11875: XmlIO.Read does not handle XML encoding per spec 
(https://issues.apache.org/jira/browse/BEAM-11875)
BEAM-11828: JmsIO is not acknowledging messages correctly 
(https://issues.apache.org/jira/browse/BEAM-11828)
BEAM-11755: Cross-language consistency (RequiresStableInputs) is quietly 
broken (at least on portable flink runner) 
(https://issues.apache.org/jira/browse/BEAM-11755)
BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when int 
overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow Runner 
(https://issues.apache.org/jira/browse/BEAM-11576)
BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor 
(https://issues.apache.org/jira/browse/BEAM-11434)
BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 
(https://issues.apache.org/jira/browse/BEAM-11227)
BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink 
(https://issues.apache.org/jira/browse/BEAM-11148)
BEAM-11017: Timer with dataflow runner can be set multiple times (dataflow 
runner) (https://issues.apache.org/jira/browse/BEAM-11017)
BEAM-10861: Adds URNs and payloads to PubSub transforms 
(https://issues.apache.org/jira/browse/BEAM-10861)
BEAM-10670: Make non-portable Splittable DoFn the only option when 
executing Java "Read" transforms 
(https://issues.apache.org/jira/browse/BEAM-10670)
BEAM-10617: python CombineGlobally().with_fanout() cause duplicate combine 
results for sliding windows (https://issues.apache.org/jira/browse/BEAM-10617)
BEAM-10569: SpannerIO tests don't actually assert anything. 
(https://issues.apache.org/jira/browse/BEAM-10569)
BEAM-10288: Quickstart documents are out of date 
(https://issues.apache.org/jira/browse/BEAM-10288)
BEAM-10244: Populate requirements cache fails on poetry-based packages 
(https://issues.apache.org/jira/browse/BEAM-10244)
BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all data 
(https://issues.apache.org/jira/browse/BEAM-10100)
BEAM-9564: Remove insecure ssl options from MongoDBIO 
(https://issues.apache.org/jira/browse/BEAM-9564)
BEAM-9455: Environment-sensitive provisioning for Dataflow 
(https://issues.apache.org/jira/browse/BEAM-9455)
BEAM-9293: Python direct runner doesn't emit empty pane when it should 
(https://issues.apache.org/jira/browse/BEAM-9293)
BEAM-8986: SortValues may not work correct for numerical types 
(https://issues.apache.org/jira/browse/BEAM-8986)
BEAM-8985: SortValues should fail if SecondaryKey coder is not 
deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get marked 
as passing (Direct Runner) (https://issues.apache.org/jira/browse/BEAM-8407)
BEAM-7717: PubsubIO watermark tracking hovers near start of epoch 
(https://issues.apache.org/jira/browse/BEAM-7717)
BEAM-7716: PubsubIO returns empty message 

Re: P0 (outage) report

2021-05-10 Thread Kenneth Knowles
This is very important, but is not an "outage" per
https://beam.apache.org/contribute/jira-priorities/

I have adjusted the Jira. This is P1 and release blocker. Release blockers
are indicated by setting Fix Version field to an unreleased version.

Kenn

On Mon, May 10, 2021 at 10:58 AM Beam Jira Bot  wrote:

> This is your daily summary of Beam's current outages. See
> https://beam.apache.org/contribute/jira-priorities/#p0-outage for the
> meaning and expectations around P0 issues.
>
> BEAM-12316: LGPL in bundled dependencies (
> https://issues.apache.org/jira/browse/BEAM-12316)
>


Ordered PCollections eventually?

2021-05-10 Thread Sam Rohde
Hi All,

I was wondering if there had been any plans for creating ordered
PCollections in the Beam model? Or if there might be plans for them in the
future?

I know that Beam SQL and Beam DataFrames would directly benefit from an
ordered PCollection.

Regards,
Sam


Re: [DISCUSS] Warn when KafkaIO is used as a bounded source

2021-05-10 Thread Boyuan Zhang
Just added more details on BEAM-6466
.  In short, BEAM-6466
 looks more like a FR
instead of a bug to me.

On Fri, Apr 30, 2021 at 12:48 PM Pablo Estrada  wrote:

> I suppose a production-ready bounded KafkaIO may fetch until reaching the
> end of each partition(?), or receive a final offset for each partition?
>
> Let's definitely add the warning.
> Best
> -P.
>
> On Fri, Apr 30, 2021 at 11:33 AM Brian Hulette 
> wrote:
>
>> I guess that is the question. [2] and [3] above make me think that this
>> is experimental and just not labeled as such.
>>
>> It doesn't seem reasonable to have both an open feature request for
>> bounded KafkaIO (BEAM-2185), and a bug report regarding bounded KafkaIO
>> (BEAM-6466).
>>
>> On Fri, Apr 30, 2021 at 11:26 AM Pablo Estrada 
>> wrote:
>>
>>> Are they experimental? I suppose this is a valid use case, right? I am
>>> in favor of adding a warning, but I don't know if I would call them
>>> experimental.
>>>
>>> I suppose a repeated-batch use case may do this repeatedly (though then
>>> users would need to recover the latest offsets for each partition, which I
>>> guess is not possible at the moment?)
>>>
>>> On Thu, Apr 29, 2021 at 4:17 PM Brian Hulette 
>>> wrote:
>>>
 Our oldest open P1 issue is BEAM-6466 - "KafkaIO doesn't commit offsets
 while being used as bounded source" [1]. I'm not sure this is an actual
 issue since KafkaIO doesn't seem to officially support this use-case. The
 relevant parameters indicate they are "mainly used for tests and demo
 applications" [2], and BEAM-2185 - "KafkaIO bounded source" [3] is still
 open.

 I think we should close out BEAM-6466 by more clearly indicating that
 withMaxReadTime() and withMaxRecords() are experimental, and/or logging a
 warning when they are used.

 I'm happy to make such a change, but I wanted to check if there are any
 objections to this first.

 Thanks,
 Brian

 [1] https://issues.apache.org/jira/browse/BEAM-6466
 [2]
 https://github.com/apache/beam/blob/3d4db26cfa4ace0a0f2fbb602f422fe30670c35f/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L960
 [3] https://issues.apache.org/jira/browse/BEAM-2185

>>>


Re: Flake trends - better?

2021-05-10 Thread Ahmet Altay
Any suggestions on how to clean this up? We can organize a cleanup to
reduce the numbers a bit. Ideally we need to find a way to prevent the
future growth but a temporary reduction might make it easier for us to keep
reviewing and closing new issues.

On Mon, May 10, 2021 at 9:11 AM Brian Hulette  wrote:

> In addition to stale flake jiras, I think there are also many tracking
> tests that were disabled years ago due to flakiness.
>
> On Sat, May 8, 2021 at 1:39 PM Kenneth Knowles  wrote:
>
>> Oh the second chart is not automatically associated with the
>> board/filter. Here is the correct link:
>> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=filter-12350547=daily=300=12319527=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin=Next
>>
>> On Sat, May 8, 2021 at 1:37 PM Kenneth Knowles  wrote:
>>
>>> The second chart is clearly bad and getting worse. Our flake bugs are
>>> not getting addressed in a timely manner.
>>>
>>> Zooming in on the first chart for the last 3 months you can see a
>>> notable change:
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464=BEAM=reporting=cumulativeFlowDiagram=1174=1175=2038=2039=2040=90.
>>> This will not change the average in the second chart very quickly.
>>>
>>> It may be just cleanup. That seems likely. Anecdotally, I have done a
>>> lot of triage recently of failures and I know of only two severe flakes
>>> (that you can count on seeing in a day/week). If so, then more cleanup
>>> would be valuable. This is why I ran the second report: I suspected that we
>>> had a lot of very old stale flake bugs that noone is looking at.
>>>
>>> Kenn
>>>
>>> On Fri, May 7, 2021 at 4:37 PM Ahmet Altay  wrote:
>>>
 Thank you for sharing the charts.

 I know you are the messenger here, but I disagree with the message that
 flakes are getting noticeably better. Number of open issues look quite
 large but at least stable. I will guess that some of those are stale and
 seemingly we did a clean up in July 2020. We can try that again. Second
 chart shows a bad picture IMO. Issues staying open for 500-600 days on
 average sounds like really long.

 On Fri, May 7, 2021 at 1:42 PM Kenneth Knowles  wrote:

> Alright, I think it should be fixed. The underlying saved filter had
> not been shared.
>
> Kenn
>
> On Fri, May 7, 2021 at 8:02 AM Brian Hulette 
> wrote:
>
>> The first link doesn't work for me, I just see a blank page with some
>> jira header and navbar. Do I need some additional permissions?
>>
>> If I click over to "Kanban Board" on the toggle at the top right I
>> see a card with "Error: The requested board cannot be viewed because it
>> either does not exist or you do not have permission to view it."
>>
>> Brian
>>
>> On Thu, May 6, 2021 at 5:56 PM Kenneth Knowles 
>> wrote:
>>
>>> I spoke too soon?
>>>
>>>
>>> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=project-12319527=daily=300=12319527=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin=Next
>>>
>>> On Thu, May 6, 2021 at 5:54 PM Kenneth Knowles 
>>> wrote:
>>>
 I made a quick* Jira chart to see how we are doing at flakes:


 https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464=BEAM=reporting=cumulativeFlowDiagram=1174=1175=2038=2039=2040

 Looking a lot better recently at resolving them! (whether these are
 new fixes or just resolving stale bugs, I love it)

 Kenn

 *AFAICT you need to make a saved search, then an agile board based
 on the saved search, then you can look at reports

>>>


Re: Extremely Slow DirectRunner

2021-05-10 Thread Boyuan Zhang
Hi Evan,

What do you mean startup delay? Is it the time that from you start the
pipeline to the time that you notice the first output record from PubSub?

On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:

> Can you try running direct runner with the option
> `--experiments=use_deprecated_read`
>
> Seems like an instance of
> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
> also reported in
> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>
> We should rollback using the SDF wrapper by default because of the
> usability and performance issues reported.
>
>
> On Sat, May 8, 2021 at 12:57 AM Evan Galpin  wrote:
>
>> Hi all,
>>
>> I’m experiencing very slow performance and startup delay when testing a
>> pipeline locally. I’m reading data from a Google PubSub subscription as the
>> data source, and before each pipeline execution I ensure that data is
>> present in the subscription (readable from GCP console).
>>
>> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
>> min). Is that expected? I did find a Jira ticket[1] that at first seemed
>> related, but I think it has more to do with BQ than DirectRunner.
>>
>> I’ve run the pipeline with a debugger connected and confirmed that it’s
>> minutes before the first DoFn in my pipeline receives any data. Is there a
>> way I can profile the direct runner to see what it’s churning on?
>>
>> Thanks,
>> Evan
>>
>> [1]
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>
>


P0 (outage) report

2021-05-10 Thread Beam Jira Bot
This is your daily summary of Beam's current outages. See 
https://beam.apache.org/contribute/jira-priorities/#p0-outage for the meaning 
and expectations around P0 issues.

BEAM-12316: LGPL in bundled dependencies 
(https://issues.apache.org/jira/browse/BEAM-12316)


Re: Flake trends - better?

2021-05-10 Thread Brian Hulette
In addition to stale flake jiras, I think there are also many tracking
tests that were disabled years ago due to flakiness.

On Sat, May 8, 2021 at 1:39 PM Kenneth Knowles  wrote:

> Oh the second chart is not automatically associated with the board/filter.
> Here is the correct link:
> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=filter-12350547=daily=300=12319527=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin=Next
>
> On Sat, May 8, 2021 at 1:37 PM Kenneth Knowles  wrote:
>
>> The second chart is clearly bad and getting worse. Our flake bugs are not
>> getting addressed in a timely manner.
>>
>> Zooming in on the first chart for the last 3 months you can see a notable
>> change:
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464=BEAM=reporting=cumulativeFlowDiagram=1174=1175=2038=2039=2040=90.
>> This will not change the average in the second chart very quickly.
>>
>> It may be just cleanup. That seems likely. Anecdotally, I have done a lot
>> of triage recently of failures and I know of only two severe flakes (that
>> you can count on seeing in a day/week). If so, then more cleanup would be
>> valuable. This is why I ran the second report: I suspected that we had a
>> lot of very old stale flake bugs that noone is looking at.
>>
>> Kenn
>>
>> On Fri, May 7, 2021 at 4:37 PM Ahmet Altay  wrote:
>>
>>> Thank you for sharing the charts.
>>>
>>> I know you are the messenger here, but I disagree with the message that
>>> flakes are getting noticeably better. Number of open issues look quite
>>> large but at least stable. I will guess that some of those are stale and
>>> seemingly we did a clean up in July 2020. We can try that again. Second
>>> chart shows a bad picture IMO. Issues staying open for 500-600 days on
>>> average sounds like really long.
>>>
>>> On Fri, May 7, 2021 at 1:42 PM Kenneth Knowles  wrote:
>>>
 Alright, I think it should be fixed. The underlying saved filter had
 not been shared.

 Kenn

 On Fri, May 7, 2021 at 8:02 AM Brian Hulette 
 wrote:

> The first link doesn't work for me, I just see a blank page with some
> jira header and navbar. Do I need some additional permissions?
>
> If I click over to "Kanban Board" on the toggle at the top right I see
> a card with "Error: The requested board cannot be viewed because it either
> does not exist or you do not have permission to view it."
>
> Brian
>
> On Thu, May 6, 2021 at 5:56 PM Kenneth Knowles 
> wrote:
>
>> I spoke too soon?
>>
>>
>> https://issues.apache.org/jira/secure/ConfigureReport.jspa?projectOrFilterId=project-12319527=daily=300=12319527=com.atlassian.jira.jira-core-reports-plugin%3Aaverageage-report_token=A5KQ-2QAV-T4JA-FDED_ea6ac783c727523cf6bfed04ba94ce91bb62da91_lin=Next
>>
>> On Thu, May 6, 2021 at 5:54 PM Kenneth Knowles 
>> wrote:
>>
>>> I made a quick* Jira chart to see how we are doing at flakes:
>>>
>>>
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=464=BEAM=reporting=cumulativeFlowDiagram=1174=1175=2038=2039=2040
>>>
>>> Looking a lot better recently at resolving them! (whether these are
>>> new fixes or just resolving stale bugs, I love it)
>>>
>>> Kenn
>>>
>>> *AFAICT you need to make a saved search, then an agile board based
>>> on the saved search, then you can look at reports
>>>
>>


Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-10 Thread Kenneth Knowles
If nothing breaks, and we check perf, then absolutely this seems good.

Kenn

On Mon, May 10, 2021 at 12:38 AM Ismaël Mejía  wrote:

> Most issues on the previous migration were related to changes on behavior
> of class-loading on Java 11. It seems Oracle is taking a more backwards
> compatible on latest releases, so let's hope everything will go well. In
> the meantime I tested the upgrade locally and tests are passing ok so we
> should be good to go. I opened a PR [1] for the version upgrade and
> assuming consensus on this proposal I expect we can pass to vote soon.
>
> [1] https://github.com/apache/beam/pull/14766
>
>
> On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:
>
>> We've had some issues in the past with semantic changes in ByteBuddy (I
>> think related to new Java versions) that required rewriting code in Beam.
>>
>> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía  wrote:
>>
>>> What were the issues last time Reuven? I remember that the release and
>>> upgrade PR were pretty smooth, were there unintended consequences from the
>>> library changes themselves?
>>>
>>>
>>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>>
 Sounds good. Based on previous experience though, this might be a
 difficult upgrade to do.

 On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía  wrote:

> The version of bytebuddy Beam is vendoring (1.10.8) is already 16
> months old and
> it is not compatible with more recent versions of Java. I would like
> to propose
> that we upgrade it [1] to the most recent version (1.11.0) [2] so we
> can benefit
> of the latest improvements for Java 16/17 and upgraded ASM.
>
> If everyone agrees I would like to volunteer as the release manager
> for this
> upgrade.
>
> [1] https://issues.apache.org/jira/browse/BEAM-12241
> [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md
>
>


Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Jean-Baptiste Onofre
+1 fully agree.

Regards
JB

> Le 10 mai 2021 à 16:02, Jan Lukavský  a écrit :
> 
> +1 for blocking the release - I think we should not release something about 
> which we _know_ that it might be legally problematic. And we should 
> definitely create a check in the build process that will warn about such 
> issues in the future.
> 
>  Jan
> 
> On 5/10/21 3:44 PM, Ismaël Mejía wrote:
>> Tomo just confirmed in the ticket that if we update the gRPC vendored 
>> version we won't need the JBoss dependency anymore so we should be good to 
>> go with the upgrade. The open question is if this should be blocking for the 
>> upcoming Beam 2.31.0 release or we can fix it afterwards.
>> 
>> 
>> On Mon, May 10, 2021 at 2:46 PM Ismaël Mejía > > wrote:
>> We have been discussing about updating the vendored dependency in BEAM-11227 
>> , if I remember correctly 
>> the newer version of gRPC does not require the jboss dependency, so probably 
>> is the best upgrade path, can you confirm Tomo Suzuki 
>>  ?
>> 
>> On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk > > wrote:
>> Also we have very similar discussion about it in 
>> https://issues.apache.org/jira/browse/LEGAL-572 
>>  
>> Just to be clear about the context of it, it's not a legal requirement of 
>> Apache Licence, it's Apache Software Foundation policy, that we should not 
>> limit   our users in using our software. If the LGPL 
>> dependency is "optional", it's fine to add such optional dependency. If it 
>> is "required" to run your software, then it is not allowed as it limits the 
>> users of ASF software in further redistributing the software in the way they 
>> want (this is at least my understanding of it). 
>> 
>> On Mon, May 10, 2021 at 12:58 PM JB Onofré > > wrote:
>> Hi
>> 
>> You can take a look on
>> 
>> https://www.apache.org/legal/resolved.html 
>> 
>> 
>> Regards 
>> JB
>> 
>>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold >> > a écrit :
>>> 
>>> Anyone have a link to the official Apache policy about this? Thanks.
>>> 
>>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský >> > wrote:
 
 Hi,
 
 we are bundling dependencies with LGPL-2.1, according to license header
 in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
 might be an issue, already reported here: [1]. I created [2] to track it
 on our side.
 
  Jan
 
 [1] https://issues.apache.org/jira/browse/FLINK-22555 
 
 
 [2] https://issues.apache.org/jira/browse/BEAM-12316 
 
 
>>> 
>>> 
>>> -- 
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org 
>> 
>> 
>> -- 
>> +48 660 796 129



Transform-specific thread pools in Python

2021-05-10 Thread Stephan Hoyer
Hi Beam devs,

I've been exploring recently how to optimize IO bound steps for my Python
Beam pipelines, and have come up with a solution that I think might make
sense to upstream into Beam's Python SDK.

It appears that Beam runners (at least the Cloud Dataflow runner) typically
use only a single thread per Python process. The number of threads per
worker can be adjusted with flags, but only for the entire pipeline. This
behavior makes sense *in general* under the worst-case assumption that
user-code in Python is CPU bound and requires the GIL.

However, multiple threads can be quite helpful in many cases, e.g.,
1. CPU bound tasks that release the GIL. This is typically the case when
using libraries for numerical computing, such as NumPy and pandas.
2. IO bound tasks that can be run asynchronously, e.g., reading/writing
files or RPCs. This is the use-case for which not using threads can be most
problematic, e.g., in a recent dataflow pipeline reading/writing lots of
relatively small files (~1-10 MB) to cloud storage with the default number
of threads per worker, I found that I was only using ~20% of available CPU.

Because the optimal number of threads for Python code can be quite
heterogeneous, I would like to be able to indicate that particular steps of
my Beam pipelines should be executed using more threads. This would be
particularly valuable for writing libraries of custom IO transforms, which
should still conservatively assume that *other* steps in user provided
pipelines may be CPU bound.

The solution I've come up with is to use beam.BatchElements with a ParDo
function that executes tasks in separate threads (via
concurrent.futures.ThreadPool). I've used this to make high-level wrappers
like beam.Map, beam.MapTuple, etc that execute with multiple threads. This
seems to work pretty well for my use-cases. I can put these in my own
library, of course, but perhaps these would make sense upstream into Beam's
Python SDK itself?

One alternative would be supporting this sort of concurrency control inside
Beam runners. In principle, I imagine runners could tune thread-pool size
for each stage automatically, e.g., based on CPU usage. To be honest, I'm a
little surprised this doesn't happen already, but I'm sure there are good
reasons why not.

Let me know what you think!

Cheers,
Stephan


Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Alexey Romanenko
+1 to bock a release because of this. Agree with Jan’s arguments.

—
Alexey

> On 10 May 2021, at 16:02, Jan Lukavský  wrote:
> 
> +1 for blocking the release - I think we should not release something about 
> which we _know_ that it might be legally problematic. And we should 
> definitely create a check in the build process that will warn about such 
> issues in the future.
> 
>  Jan
> 
> On 5/10/21 3:44 PM, Ismaël Mejía wrote:
>> Tomo just confirmed in the ticket that if we update the gRPC vendored 
>> version we won't need the JBoss dependency anymore so we should be good to 
>> go with the upgrade. The open question is if this should be blocking for the 
>> upcoming Beam 2.31.0 release or we can fix it afterwards.
>> 
>> 
>> On Mon, May 10, 2021 at 2:46 PM Ismaël Mejía > > wrote:
>> We have been discussing about updating the vendored dependency in BEAM-11227 
>> , if I remember correctly 
>> the newer version of gRPC does not require the jboss dependency, so probably 
>> is the best upgrade path, can you confirm Tomo Suzuki 
>>  ?
>> 
>> On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk > > wrote:
>> Also we have very similar discussion about it in 
>> https://issues.apache.org/jira/browse/LEGAL-572 
>>  
>> Just to be clear about the context of it, it's not a legal requirement of 
>> Apache Licence, it's Apache Software Foundation policy, that we should not 
>> limit our users in using our software. If the LGPL dependency is "optional", 
>> it's fine to add such optional dependency. If it is "required" to run your 
>> software, then it is not allowed as it limits the users of ASF software in 
>> further redistributing the software in the way they want (this is at least 
>> my understanding of it). 
>> 
>> On Mon, May 10, 2021 at 12:58 PM JB Onofré > > wrote:
>> Hi
>> 
>> You can take a look on
>> 
>> https://www.apache.org/legal/resolved.html 
>> 
>> 
>> Regards 
>> JB
>> 
>>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold >> > a écrit :
>>> 
>>> Anyone have a link to the official Apache policy about this? Thanks.
>>> 
>>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský >> > wrote:
 
 Hi,
 
 we are bundling dependencies with LGPL-2.1, according to license header
 in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
 might be an issue, already reported here: [1]. I created [2] to track it
 on our side.
 
  Jan
 
 [1] https://issues.apache.org/jira/browse/FLINK-22555 
 
 
 [2] https://issues.apache.org/jira/browse/BEAM-12316 
 
 
>>> 
>>> 
>>> -- 
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org 
>> 
>> 
>> -- 
>> +48 660 796 129



Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Jan Lukavský
+1 for blocking the release - I think we should not release something 
about which we _know_ that it might be legally problematic. And we 
should definitely create a check in the build process that will warn 
about such issues in the future.


 Jan

On 5/10/21 3:44 PM, Ismaël Mejía wrote:
Tomo just confirmed in the ticket that if we update the gRPC vendored 
version we won't need the JBoss dependency anymore so we should be 
good to go with the upgrade. The open question is if this should be 
blocking for the upcoming Beam 2.31.0 release or we can fix it afterwards.



On Mon, May 10, 2021 at 2:46 PM Ismaël Mejía > wrote:


We have been discussing about updating the vendored dependency in
BEAM-11227 , if
I remember correctly the newer version of gRPC does not require
the jboss dependency, so probably is the best upgrade path, can
you confirm Tomo Suzuki

?

On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk mailto:ja...@potiuk.com>> wrote:

Also we have very similar discussion about it in
https://issues.apache.org/jira/browse/LEGAL-572

Just to be clear about the context of it, it's not a legal
requirement of Apache Licence, it's Apache Software Foundation
policy, that we should not limit our users in using our
software. If the LGPL dependency is "optional", it's fine to
add such optional dependency. If it is "required" to run your
software, then it is not allowed as it limits the users of ASF
software in further redistributing the software in the way
they want (this is at least my understanding of it).

On Mon, May 10, 2021 at 12:58 PM JB Onofré mailto:j...@nanthrax.net>> wrote:

Hi

You can take a look on

https://www.apache.org/legal/resolved.html


Regards
JB


Le 10 mai 2021 à 12:56, Elliotte Rusty Harold
mailto:elh...@ibiblio.org>> a écrit :

Anyone have a link to the official Apache policy about
this? Thanks.

On Mon, May 10, 2021 at 10:07 AM Jan Lukavský
mailto:je...@seznam.cz>> wrote:


Hi,

we are bundling dependencies with LGPL-2.1, according to
license header
in
META-INF/maven/org.jboss.modules/jboss-modules/pom.xml.
I think is
might be an issue, already reported here: [1]. I created
[2] to track it
on our side.

 Jan

[1] https://issues.apache.org/jira/browse/FLINK-22555


[2] https://issues.apache.org/jira/browse/BEAM-12316





-- 
Elliotte Rusty Harold

elh...@ibiblio.org 




-- 
+48 660 796 129




Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Ismaël Mejía
Tomo just confirmed in the ticket that if we update the gRPC vendored
version we won't need the JBoss dependency anymore so we should be good to
go with the upgrade. The open question is if this should be blocking for
the upcoming Beam 2.31.0 release or we can fix it afterwards.


On Mon, May 10, 2021 at 2:46 PM Ismaël Mejía  wrote:

> We have been discussing about updating the vendored dependency in
> BEAM-11227 , if I
> remember correctly the newer version of gRPC does not require the jboss
> dependency, so probably is the best upgrade path, can you confirm Tomo
> Suzuki
>  ?
>
> On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk  wrote:
>
>> Also we have very similar discussion about it in
>> https://issues.apache.org/jira/browse/LEGAL-572
>> Just to be clear about the context of it, it's not a legal requirement of
>> Apache Licence, it's Apache Software Foundation policy, that we should not
>> limit our users in using our software. If the LGPL dependency is
>> "optional", it's fine to add such optional dependency. If it is "required"
>> to run your software, then it is not allowed as it limits the users of ASF
>> software in further redistributing the software in the way they want (this
>> is at least my understanding of it).
>>
>> On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:
>>
>>> Hi
>>>
>>> You can take a look on
>>>
>>> https://www.apache.org/legal/resolved.html
>>>
>>> Regards
>>> JB
>>>
>>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
>>> écrit :
>>>
>>> Anyone have a link to the official Apache policy about this? Thanks.
>>>
>>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>>>
>>>
>>> Hi,
>>>
>>>
>>> we are bundling dependencies with LGPL-2.1, according to license header
>>>
>>> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>>>
>>> might be an issue, already reported here: [1]. I created [2] to track it
>>>
>>> on our side.
>>>
>>>
>>>  Jan
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-22555
>>>
>>>
>>> [2] https://issues.apache.org/jira/browse/BEAM-12316
>>>
>>>
>>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org
>>>
>>>
>>
>> --
>> +48 660 796 129
>>
>


Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

2021-05-10 Thread Tomo Suzuki
I was investigating the strange timeout (
https://github.com/apache/beam/pull/14474) but was occupied with something
else lately.
Let me try the new version today to see any improvements.


On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía  wrote:

> I just saw that gRPC 1.37.1 is out now (and with aarch64 support for
> python!) that made me wonder about this, what is the current status of
> upgrading the vendored dependency Tomo?
>
>
> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki  wrote:
>
>> We observed the cron job of Java Precommit for the master branch started
>> timing out often (not always) since upgrading the gRPC version.
>> https://github.com/apache/beam/pull/14466#issuecomment-815343974
>>
>> Exchanged messages with Kenn, I reverted to the change; now the master
>> branch uses the vendored gRPC 1.26.
>>
>>
>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles  wrote:
>>
>>> Merged. Let's keep an eye for trouble, and I will incorporate to the
>>> release branch.
>>>
>>> Kenn
>>>
>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki  wrote:
>>>
 Regarding troubleshooting on build timeout, it seems that Docker cache
 in Jenkins machines might be playing a role. As I run more "Java
 Presubmit", I no longer observe timeouts in the PR.

 Kenn, would you merge the PR?
 https://github.com/apache/beam/pull/14295 (all checks green, including
 the new Java postcommit checks)

 On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles 
 wrote:

> Yes, I agree this might be a good idea. This is not the only major
> issue on the release-2.29.0 branch.
>
> The counter argument is that we will be pulling in all the bugs
> introduced to `master` since the branch cut.
>
> As far as effort goes, I have been mostly focused on burning down the
> bugs so I would not lose much work in the release process.
>
> Kenn
>
> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía 
> wrote:
>
>> Precommit is quite unstable in the last days, so worth to check if
>> something is wrong in the CI.
>>
>> I have a question Kenn. Given that cherry picking this might be a bit
>> big as a change can we just reconsider cutting the 2.29.0 branch again
>> after the updated gRPC version use gets merged and mark the issues
>> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an
>> easier upgrade path (and we will get some nice fixes/improvements like
>> official Spark 3 support for free on the release).
>>
>> WDYT?
>>
>>
>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki 
>> wrote:
>> >
>> > Update: I observe that Java precommit check is unstable in the PR
>> to upgrade vendored gRPC (compared with an PR with an empty change).
>> There's no constant failures; sometimes it succeeds and other times it
>> faces timeout and flaky test failures.
>> >
>> > https://github.com/apache/beam/pull/14295#issuecomment-806071087
>> >
>> >
>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki 
>> wrote:
>> >>
>> >> Thank you for the voting and I see the artifact available in Maven
>> Central. I'll work on the PR to use the published artifact today.
>> >>
>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar
>> >>
>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles 
>> wrote:
>> >>>
>> >>> Update on this: there are some minor issues and then I'll send
>> out the RC.
>> >>>
>> >>> I think this is worth blocking 2.29.0 release on, so I will do
>> this first. We are still eliminating other blockers from 2.29.0 anyhow.
>> >>>
>> >>> Kenn
>> >>>
>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki 
>> wrote:
>> 
>>  Hi Beam developers,
>> 
>>  I'm working on upgrading the vendored gRPC 1.36.0
>>  https://issues.apache.org/jira/browse/BEAM-11227 (PR:
>> https://github.com/apache/beam/pull/14028)
>>  Let me know if you have any questions or concerns.
>> 
>>  Background:
>>  Exchanged messages with Ismaël in BEAM-11227, it seems that it
>> the ticket created by some automation is false positive, but it's nice to
>> use an artifact without being marked with CVE.
>> 
>>  Kenn offered to work as the release manager (as in
>> https://s.apache.org/beam-release-vendored-artifacts) of the
>> vendored artifact.
>> 
>>  --
>>  Regards,
>>  Tomo
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Tomo
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Tomo
>>
>

 --
 Regards,
 Tomo

>>>
>>
>> --
>> Regards,
>> Tomo
>>
>

-- 
Regards,
Tomo


Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Ismaël Mejía
We have been discussing about updating the vendored dependency in BEAM-11227
, if I remember correctly
the newer version of gRPC does not require the jboss dependency, so
probably is the best upgrade path, can you confirm Tomo Suzuki
 ?

On Mon, May 10, 2021 at 2:33 PM Jarek Potiuk  wrote:

> Also we have very similar discussion about it in
> https://issues.apache.org/jira/browse/LEGAL-572
> Just to be clear about the context of it, it's not a legal requirement of
> Apache Licence, it's Apache Software Foundation policy, that we should not
> limit our users in using our software. If the LGPL dependency is
> "optional", it's fine to add such optional dependency. If it is "required"
> to run your software, then it is not allowed as it limits the users of ASF
> software in further redistributing the software in the way they want (this
> is at least my understanding of it).
>
> On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:
>
>> Hi
>>
>> You can take a look on
>>
>> https://www.apache.org/legal/resolved.html
>>
>> Regards
>> JB
>>
>> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
>> écrit :
>>
>> Anyone have a link to the official Apache policy about this? Thanks.
>>
>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>>
>>
>> Hi,
>>
>>
>> we are bundling dependencies with LGPL-2.1, according to license header
>>
>> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>>
>> might be an issue, already reported here: [1]. I created [2] to track it
>>
>> on our side.
>>
>>
>>  Jan
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-22555
>>
>>
>> [2] https://issues.apache.org/jira/browse/BEAM-12316
>>
>>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>>
>
> --
> +48 660 796 129
>


Beam Dependency Check Report (2021-05-08)

2021-05-10 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
chromedriver-binary
88.0.4324.96.0
91.0.4472.19.0
2021-01-25
2021-04-26BEAM-10426
dill
0.3.1.1
0.3.3
2019-10-07
2020-11-02BEAM-11167
google-cloud-bigtable
1.7.0
2.2.0
2021-04-12
2021-05-03BEAM-8127
google-cloud-datastore
1.15.3
2.1.2
2020-11-16
2021-05-10BEAM-8443
google-cloud-dlp
1.0.0
3.0.1
2020-06-29
2021-02-01BEAM-10344
google-cloud-language
1.3.0
2.0.0
2020-10-26
2020-10-26BEAM-8
google-cloud-pubsub
1.7.0
2.4.1
2020-07-20
2021-04-05BEAM-5539
google-cloud-spanner
1.19.1
3.4.0
2020-11-16
2021-05-03BEAM-10345
google-cloud-videointelligence
1.16.1
2.1.0
2020-11-23
2021-04-05BEAM-11319
google-cloud-vision
1.0.0
2.3.1
2020-03-24
2021-04-19BEAM-9581
idna
2.10
3.1
2021-01-04
2021-01-11BEAM-9328
mock
2.0.0
4.0.3
2019-05-20
2020-12-14BEAM-7369
mypy-protobuf
1.18
2.4
2020-03-24
2021-02-08BEAM-10346
nbconvert
5.6.1
6.0.7
2020-10-05
2020-10-05BEAM-11007
Pillow
7.2.0
8.2.0
2020-10-19
2021-04-05BEAM-11071
PyHamcrest
1.10.1
2.0.2
2020-01-20
2020-07-08BEAM-9155
pytest
4.6.11
6.2.4
2020-07-08
2021-05-10BEAM-8606
pytest-xdist
1.34.0
2.2.1
2020-08-17
2021-02-15BEAM-10713
tenacity
5.1.5
7.0.0
2019-11-11
2021-03-08BEAM-8607
typing-extensions
3.7.4.3
3.10.0.0
2021-05-03
2021-05-03BEAM-12267
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.azure:azure-core
1.6.0
1.16.0
2020-07-02
2021-05-07BEAM-11888
com.azure:azure-identity
1.0.8
1.3.0-beta.2
2020-07-07
2021-03-11BEAM-11814
com.azure:azure-storage-common
12.8.0
12.11.0
2020-08-13
2021-04-29BEAM-11889
com.datastax.cassandra:cassandra-driver-core
3.10.2
4.0.0
2020-08-26
2019-03-18BEAM-8674
com.esotericsoftware:kryo
4.0.2
5.1.1
2018-03-20
2021-05-02BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.33.0
0.38.0
2020-09-14
2021-03-08BEAM-6645
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1
1.17.0
1.20.3
2021-03-30
2021-05-06BEAM-11890
com.google.api.grpc:proto-google-cloud-bigquerystorage-v1beta2
0.117.0
0.120.3
2021-03-30
2021-05-06BEAM-11891
com.google.api.grpc:proto-google-cloud-dlp-v2
1.1.4
2.3.2
2020-05-04
2021-04-27BEAM-11892
com.google.api.grpc:proto-google-cloud-video-intelligence-v1
1.2.0
1.6.3
2020-03-10
2021-04-27BEAM-11894
com.google.api.grpc:proto-google-cloud-vision-v1
1.81.3
1.102.2
2020-04-07
2021-04-27BEAM-11895
com.google.apis:google-api-services-bigquery
v2-rev20210410-1.31.0
v2-rev20210430-1.31.0
2021-04-16
2021-05-07BEAM-8684
com.google.apis:google-api-services-cloudresourcemanager
v1-rev20210331-1.31.0
v3-rev20210411-1.31.0
2021-04-09
2021-04-20BEAM-8751
com.google.apis:google-api-services-dataflow
v1b3-rev20210408-1.31.0
v1beta3-rev12-1.20.0
2021-04-17
2015-04-29BEAM-8752
com.google.apis:google-api-services-healthcare
v1beta1-rev20210407-1.31.0
v1-rev20210414-1.31.0
2021-04-21
2021-04-28BEAM-10349
com.google.auto.service:auto-service
1.0-rc6
1.0
2019-07-16
2021-04-06BEAM-5541
com.google.auto.service:auto-service-annotations
1.0-rc6
1.0
2019-07-16
2021-04-06BEAM-10350
com.google.cloud:google-cloud-bigquerystorage
1.17.0

Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Jarek Potiuk
Also we have very similar discussion about it in
https://issues.apache.org/jira/browse/LEGAL-572
Just to be clear about the context of it, it's not a legal requirement of
Apache Licence, it's Apache Software Foundation policy, that we should not
limit our users in using our software. If the LGPL dependency is
"optional", it's fine to add such optional dependency. If it is "required"
to run your software, then it is not allowed as it limits the users of ASF
software in further redistributing the software in the way they want (this
is at least my understanding of it).

On Mon, May 10, 2021 at 12:58 PM JB Onofré  wrote:

> Hi
>
> You can take a look on
>
> https://www.apache.org/legal/resolved.html
>
> Regards
> JB
>
> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a
> écrit :
>
> Anyone have a link to the official Apache policy about this? Thanks.
>
> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>
>
> Hi,
>
>
> we are bundling dependencies with LGPL-2.1, according to license header
>
> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>
> might be an issue, already reported here: [1]. I created [2] to track it
>
> on our side.
>
>
>  Jan
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-22555
>
>
> [2] https://issues.apache.org/jira/browse/BEAM-12316
>
>
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>
>

-- 
+48 660 796 129


Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread JB Onofré
Hi

You can take a look on

https://www.apache.org/legal/resolved.html

Regards 
JB

> Le 10 mai 2021 à 12:56, Elliotte Rusty Harold  a écrit :
> 
> Anyone have a link to the official Apache policy about this? Thanks.
> 
>> On Mon, May 10, 2021 at 10:07 AM Jan Lukavský  wrote:
>> 
>> Hi,
>> 
>> we are bundling dependencies with LGPL-2.1, according to license header
>> in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is
>> might be an issue, already reported here: [1]. I created [2] to track it
>> on our side.
>> 
>>  Jan
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-22555
>> 
>> [2] https://issues.apache.org/jira/browse/BEAM-12316
>> 
> 
> 
> -- 
> Elliotte Rusty Harold
> elh...@ibiblio.org


Re: LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread JB Onofré
Yeah LGPL is cat X license. So it should not be embedded and distributed. 

Regards 
JB

> Le 10 mai 2021 à 12:07, Jan Lukavský  a écrit :
> 
> Hi,
> 
> we are bundling dependencies with LGPL-2.1, according to license header in 
> META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is might be 
> an issue, already reported here: [1]. I created [2] to track it on our side.
> 
>  Jan
> 
> [1] https://issues.apache.org/jira/browse/FLINK-22555
> 
> [2] https://issues.apache.org/jira/browse/BEAM-12316
> 



LGPL-2.1 in beam-vendor-grpc

2021-05-10 Thread Jan Lukavský

Hi,

we are bundling dependencies with LGPL-2.1, according to license header 
in META-INF/maven/org.jboss.modules/jboss-modules/pom.xml. I think is 
might be an issue, already reported here: [1]. I created [2] to track it 
on our side.


 Jan

[1] https://issues.apache.org/jira/browse/FLINK-22555

[2] https://issues.apache.org/jira/browse/BEAM-12316



Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

2021-05-10 Thread Ismaël Mejía
I just saw that gRPC 1.37.1 is out now (and with aarch64 support for
python!) that made me wonder about this, what is the current status of
upgrading the vendored dependency Tomo?


On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki  wrote:

> We observed the cron job of Java Precommit for the master branch started
> timing out often (not always) since upgrading the gRPC version.
> https://github.com/apache/beam/pull/14466#issuecomment-815343974
>
> Exchanged messages with Kenn, I reverted to the change; now the master
> branch uses the vendored gRPC 1.26.
>
>
> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles  wrote:
>
>> Merged. Let's keep an eye for trouble, and I will incorporate to the
>> release branch.
>>
>> Kenn
>>
>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki  wrote:
>>
>>> Regarding troubleshooting on build timeout, it seems that Docker cache
>>> in Jenkins machines might be playing a role. As I run more "Java
>>> Presubmit", I no longer observe timeouts in the PR.
>>>
>>> Kenn, would you merge the PR?
>>> https://github.com/apache/beam/pull/14295 (all checks green, including
>>> the new Java postcommit checks)
>>>
>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles  wrote:
>>>
 Yes, I agree this might be a good idea. This is not the only major
 issue on the release-2.29.0 branch.

 The counter argument is that we will be pulling in all the bugs
 introduced to `master` since the branch cut.

 As far as effort goes, I have been mostly focused on burning down the
 bugs so I would not lose much work in the release process.

 Kenn

 On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía  wrote:

> Precommit is quite unstable in the last days, so worth to check if
> something is wrong in the CI.
>
> I have a question Kenn. Given that cherry picking this might be a bit
> big as a change can we just reconsider cutting the 2.29.0 branch again
> after the updated gRPC version use gets merged and mark the issues
> already fixed for version 2.30.0 to version 2.29.0 ? Seems like an
> easier upgrade path (and we will get some nice fixes/improvements like
> official Spark 3 support for free on the release).
>
> WDYT?
>
>
> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki 
> wrote:
> >
> > Update: I observe that Java precommit check is unstable in the PR to
> upgrade vendored gRPC (compared with an PR with an empty change). There's
> no constant failures; sometimes it succeeds and other times it faces
> timeout and flaky test failures.
> >
> > https://github.com/apache/beam/pull/14295#issuecomment-806071087
> >
> >
> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki 
> wrote:
> >>
> >> Thank you for the voting and I see the artifact available in Maven
> Central. I'll work on the PR to use the published artifact today.
> >>
> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar
> >>
> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles 
> wrote:
> >>>
> >>> Update on this: there are some minor issues and then I'll send out
> the RC.
> >>>
> >>> I think this is worth blocking 2.29.0 release on, so I will do
> this first. We are still eliminating other blockers from 2.29.0 anyhow.
> >>>
> >>> Kenn
> >>>
> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki 
> wrote:
> 
>  Hi Beam developers,
> 
>  I'm working on upgrading the vendored gRPC 1.36.0
>  https://issues.apache.org/jira/browse/BEAM-11227 (PR:
> https://github.com/apache/beam/pull/14028)
>  Let me know if you have any questions or concerns.
> 
>  Background:
>  Exchanged messages with Ismaël in BEAM-11227, it seems that it
> the ticket created by some automation is false positive, but it's nice to
> use an artifact without being marked with CVE.
> 
>  Kenn offered to work as the release manager (as in
> https://s.apache.org/beam-release-vendored-artifacts) of the vendored
> artifact.
> 
>  --
>  Regards,
>  Tomo
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Tomo
> >
> >
> >
> > --
> > Regards,
> > Tomo
>

>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>
> --
> Regards,
> Tomo
>


Re: [PROPOSAL] Vendored bytebuddy dependency release

2021-05-10 Thread Ismaël Mejía
Most issues on the previous migration were related to changes on behavior
of class-loading on Java 11. It seems Oracle is taking a more backwards
compatible on latest releases, so let's hope everything will go well. In
the meantime I tested the upgrade locally and tests are passing ok so we
should be good to go. I opened a PR [1] for the version upgrade and
assuming consensus on this proposal I expect we can pass to vote soon.

[1] https://github.com/apache/beam/pull/14766


On Sun, May 9, 2021 at 6:13 PM Reuven Lax  wrote:

> We've had some issues in the past with semantic changes in ByteBuddy (I
> think related to new Java versions) that required rewriting code in Beam.
>
> On Sat, May 8, 2021 at 10:46 PM Ismaël Mejía  wrote:
>
>> What were the issues last time Reuven? I remember that the release and
>> upgrade PR were pretty smooth, were there unintended consequences from the
>> library changes themselves?
>>
>>
>> On Sun, May 9, 2021 at 12:36 AM Reuven Lax  wrote:
>>
>>> Sounds good. Based on previous experience though, this might be a
>>> difficult upgrade to do.
>>>
>>> On Sat, May 8, 2021 at 12:57 AM Ismaël Mejía  wrote:
>>>
 The version of bytebuddy Beam is vendoring (1.10.8) is already 16
 months old and
 it is not compatible with more recent versions of Java. I would like to
 propose
 that we upgrade it [1] to the most recent version (1.11.0) [2] so we
 can benefit
 of the latest improvements for Java 16/17 and upgraded ASM.

 If everyone agrees I would like to volunteer as the release manager for
 this
 upgrade.

 [1] https://issues.apache.org/jira/browse/BEAM-12241
 [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md