[PROPOSAL] Re-enable checkerframework by default

2022-10-19 Thread Kenneth Knowles
Hi all,

Some time ago we turned off checker framework locally by default, and only
turn it on with `-PenableCheckerFramework` and also on Jenkins.

My opinion is that this causes more headache than it solves, by delaying
finding out about errors. The increased compilation time of
checkerframework is real. But during iteration almost every step of a
compile is cached so it only matters specifically for :sdks:java:core. My
take is that anyone editing that is probably experienced enough with Beam
to know they can turn it off. So I propose we turn it on by default, with
the option to disable it.

Kenn


Re: Go, Java, & Python Project Starter / Example Using Terraform to build Dataflow Custom Templates

2022-10-19 Thread Damon Douglas via dev
Thank you so much Robert for pointing that out!  I submitted a quick patch
PR to correct this.

On Wed, Oct 19, 2022 at 9:29 AM Robert Burke  wrote:

> Woohoo! Thanks Damon! This will be handy for Beam Go users on Dataflow.
>
> I have one note: It looks like the go.mod is requiring go 1.20, which
> doesn't yet exist:
> https://github.com/GoogleCloudPlatform/professional-services/blob/main/examples/dataflow-custom-templates/go/go.mod#L3
>
> The latest version is Go v1.19.
>
> Cheers,
> Robert Burke
> Beam Go Busybody
>
> On Tue, 18 Oct 2022 at 09:01, Damon Douglas via dev 
> wrote:
>
>> Hello Everyone,
>>
>> You can ignore this email if you do not use the Dataflow runner.
>>
>> Happy to announce that the Go SDK was just added to a previously shared
>> example starter project that provisions Dataflow Custom Templates within a
>> terraform workflow on Cloud Build.
>>
>>
>> https://github.com/GoogleCloudPlatform/professional-services/tree/main/examples/dataflow-custom-templates
>>
>> Best,
>>
>> Damon
>>
>> --
>>
>>
>> *Damon Douglas*
>>
>> Strategic Cloud Engineer, Data & Analytics, Google Cloud
>>
>> damondoug...@google.com
>>
>


Re: Go, Java, & Python Project Starter / Example Using Terraform to build Dataflow Custom Templates

2022-10-19 Thread Robert Burke
Woohoo! Thanks Damon! This will be handy for Beam Go users on Dataflow.

I have one note: It looks like the go.mod is requiring go 1.20, which
doesn't yet exist:
https://github.com/GoogleCloudPlatform/professional-services/blob/main/examples/dataflow-custom-templates/go/go.mod#L3

The latest version is Go v1.19.

Cheers,
Robert Burke
Beam Go Busybody

On Tue, 18 Oct 2022 at 09:01, Damon Douglas via dev 
wrote:

> Hello Everyone,
>
> You can ignore this email if you do not use the Dataflow runner.
>
> Happy to announce that the Go SDK was just added to a previously shared
> example starter project that provisions Dataflow Custom Templates within a
> terraform workflow on Cloud Build.
>
>
> https://github.com/GoogleCloudPlatform/professional-services/tree/main/examples/dataflow-custom-templates
>
> Best,
>
> Damon
>
> --
>
>
> *Damon Douglas*
>
> Strategic Cloud Engineer, Data & Analytics, Google Cloud
>
> damondoug...@google.com
>


Re: [DISCUSS] Jenkins -> GitHub Actions ?

2022-10-19 Thread Danny McCormick via dev
Thanks for kicking this conversation off. I'm +1 on migrating, but only
once we've found a specific replacement for easy observability (which
workflows have been failing lately, and how often) and trigger phrases (for
retries and workflows that aren't automatically kicked off but should be
run for extra validation, e.g. postcommits). Until we have viable
replacements, I don't think we should make the move. Publishing nightly
snapshots is eventually also a must to fully migrate, but probably doesn't
need to block us from making progress here.

With those caveats, the reason that I'm +1 on moving is that our Jenkins
reliability has been rough. Since I joined the project in January, I can
think of 3 different incidents that significantly harmed our ability to do
work.

1. Jenkins triggers cause multi-day outage
 - this
led to a multi-day code freeze, and we lost our trigger functionality for
days afterwards. Investigating/restoring our state ate up a pretty full
week for me.
2. Jenkins plugin cause multi-day outage
 - this led to multiple
days of Jenkins downtime before eventually being resolved by Infra.
3. Cert issues cause many workers to go down - I don't have a thread for
this because I handled most of the investigation the day of, but many of
our workers went down for around a day and nobody noticed until queue time
reached 6+ hours for each workflow.

There may be others that I'm overlooking.

GitHub Actions isn't a magic bullet to fix these problems, but it minimizes
the amount of infra that we're maintaining ourselves, increases the
isolation between workflows (catastrophic failure is less likely), has
uptime guarantees, and is more likely to receive investment going forward
(we're likely to get increasing benefits over time for free). We've also
done a lot of exploration in this area already, so we're not starting from
scratch.

Thanks,
Danny

On Wed, Oct 19, 2022 at 11:32 AM Kenneth Knowles  wrote:

> Hi all,
>
> As you probably noticed, there's a lot of work going on around adding more
> GitHub Actions workflows.
>
> Can we fully migrate to GitHub Actions? Similar to our GitHub Issues
> migration (but less user-facing) it would bring us on to "default"
> infrastructure that more people understand and is maintained by GitHub.
>
> So far we have hit some serious roadblocks. It isn't just a simple
> migration. We have to weigh doing the work to get there.
>
> I started a document with a table of the things we get from Jenkins that
> we need to be sure to have for GitHub Actions before we could think about
> migrating:
>
> https://s.apache.org/beam-jenkins-to-gha
>
> Can you please help me by adding things that we get from Jenkins, and if
> you know how to get them from GitHub Actions add that too.
>
> Thanks!
>
> Kenn
>


[DISCUSS] Jenkins -> GitHub Actions ?

2022-10-19 Thread Kenneth Knowles
Hi all,

As you probably noticed, there's a lot of work going on around adding more
GitHub Actions workflows.

Can we fully migrate to GitHub Actions? Similar to our GitHub Issues
migration (but less user-facing) it would bring us on to "default"
infrastructure that more people understand and is maintained by GitHub.

So far we have hit some serious roadblocks. It isn't just a simple
migration. We have to weigh doing the work to get there.

I started a document with a table of the things we get from Jenkins that we
need to be sure to have for GitHub Actions before we could think about
migrating:

https://s.apache.org/beam-jenkins-to-gha

Can you please help me by adding things that we get from Jenkins, and if
you know how to get them from GitHub Actions add that too.

Thanks!

Kenn


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Reuven Lax via dev
PCollections's usually are persistent within a pipeline, so you can reuse
them in other parts of a pipeline with no problem.

There is no notion of state across pipelines - every pipeline is
independent. If you want state across pipelines you can write the
PCollection out to a set of files which are read back in in the new
pipeline.

On Tue, Oct 18, 2022 at 11:45 PM Ravi Kapoor  wrote:

> Hi Team,
> Can we stage a PCollection or  PCollection data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>


Beam High Priority Issue Report (49)

2022-10-19 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23693 [Bug]: apache_beam.io.kinesis 
module READ_DATA_URN mismatch
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at 
getConnection() in WriteFn
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21561 
ExternalPythonTransformTest.trivialPythonTransform flaky
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21463 NPE in Flink Portable 
ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21364 Flink load tests fail: 
NoClassDefFoundError: MessageBodyReader
https://github.com/apache/beam/issues/21333 Flink testParDoRequiresStableInput 
flaky
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21257 Either Create or DirectRunner fails 
to produce all elements to the following transform
https://github.com/apache/beam/issues/21123 Multiple jobs running on Flink 
session cluster reuse the persistent Python environment.
https://github.com/apache/beam/issues/21113 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/2 Java creates an incorrect pipeline 
proto when core-construction-java jar is not in the CLASSPATH
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest is 
flaky
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20975 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20937 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL PostCommit
https://github.com/apache/beam/issues/20815 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20655 Flink PortableValidatesRunner test 
failure: GroupByKeyTest$BasicTests.testLargeKeys10MB
https://github.com/apache/beam/issues/20269 Flink postcommits failing 
testFlattenWithDifferentInputAndOutputCoders2
https://github.com/apache/beam/issues/20189 Spark failing 
testFlattenWithDifferentInputAndOutputCoders2
https://github.com/apache/beam/issues/20109 SortValues should fail if 
SecondaryKey coder is not deterministic
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19241 Python Dataflow integration tests 
should export the pipeline Job ID and console output to Jenkins Test Result 
section


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/23489 [Bug]: add DebeziumIO to the 
connectors page
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22891 [Bug]: 
beam_PostCommit_XVR_PythonUsingJavaDataflow is flaky
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/22011 [Bug]: 

Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
On Wed, Oct 19, 2022 at 2:43 PM Ravi Kapoor  wrote:

> I am talking about in batch context. Can we do checkpointing in batch mode
> as well?
> I am *not* looking for any failure or retry algorithm.
> The requirement is to simply materialize a PCollection which can be used
> across the jobs /within the job   in some view/temp table which is
> auto deleted
> I believe Reshuffle
> 
>  is
> for streaming. Right?
>
> Thanks,
> Ravi
>
> On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev <
> dev@beam.apache.org> wrote:
>
>> I think that would be a Reshuffle
>> ,
>> but only within the context of the same job (e.g. if there is a failure and
>> a retry, the retry would start from the checkpoint created by the
>> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
>> key, stateful dofns and I think splittable dofns will also have the same
>> effect of creating a checkpoint (any shuffling operation will always create
>> a checkpoint).
>>
>> If you want to start a different job (slightly updated code, starting
>> from a previous point of a previous job), in Dataflow that would be a
>> snapshot ,
>> I think. Snapshots only work in streaming pipelines.
>>
>> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:
>>
>>> Hi Team,
>>> Can we stage a PCollection or  PCollection data? Lets
>>> say to save  the expensive operations between two complex BQ tables time
>>> and again and materialize it in some temp view which will be deleted after
>>> the session.
>>>
>>> Is it possible to do that in the Beam Pipeline?
>>> We can later use the temp view in another pipeline to read the data from
>>> and do processing.
>>>
>>> Or In general I would like to know Do we ever stage the PCollection.
>>> Let's say I want to create another instance of the same job which has
>>> complex processing.
>>> Does the pipeline re perform the computation or would it pick the
>>> already processed data in the previous instance that must be staged
>>> somewhere?
>>>
>>> Like in spark we do have notions of createOrReplaceTempView which is
>>> used to create temp table from a spark dataframe or dataset.
>>>
>>> Please advise.
>>>
>>> --
>>> Thanks,
>>> Ravi Kapoor
>>> +91-9818764564 <+91%2098187%2064564>
>>> kapoorrav...@gmail.com
>>>
>>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564
> kapoorrav...@gmail.com
>


-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
I am talking about in batch context. Can we do checkpointing in batch mode
as well?
I am looking for any failure or retry algorithm.
The requirement is to simply materialize a PCollection which can be used
across the jobs /within the job   in some view/temp table which is
auto deleted
I believe Reshuffle

is
for streaming. Right?

Thanks,
Ravi

On Wed, Oct 19, 2022 at 1:32 PM Israel Herraiz via dev 
wrote:

> I think that would be a Reshuffle
> ,
> but only within the context of the same job (e.g. if there is a failure and
> a retry, the retry would start from the checkpoint created by the
> reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
> key, stateful dofns and I think splittable dofns will also have the same
> effect of creating a checkpoint (any shuffling operation will always create
> a checkpoint).
>
> If you want to start a different job (slightly updated code, starting from
> a previous point of a previous job), in Dataflow that would be a snapshot
> , I think.
> Snapshots only work in streaming pipelines.
>
> On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:
>
>> Hi Team,
>> Can we stage a PCollection or  PCollection data? Lets say
>> to save  the expensive operations between two complex BQ tables time and
>> again and materialize it in some temp view which will be deleted after the
>> session.
>>
>> Is it possible to do that in the Beam Pipeline?
>> We can later use the temp view in another pipeline to read the data from
>> and do processing.
>>
>> Or In general I would like to know Do we ever stage the PCollection.
>> Let's say I want to create another instance of the same job which has
>> complex processing.
>> Does the pipeline re perform the computation or would it pick the already
>> processed data in the previous instance that must be staged somewhere?
>>
>> Like in spark we do have notions of createOrReplaceTempView which is used
>> to create temp table from a spark dataframe or dataset.
>>
>> Please advise.
>>
>> --
>> Thanks,
>> Ravi Kapoor
>> +91-9818764564 <+91%2098187%2064564>
>> kapoorrav...@gmail.com
>>
>

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Israel Herraiz via dev
I think that would be a Reshuffle
,
but only within the context of the same job (e.g. if there is a failure and
a retry, the retry would start from the checkpoint created by the
reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
key, stateful dofns and I think splittable dofns will also have the same
effect of creating a checkpoint (any shuffling operation will always create
a checkpoint).

If you want to start a different job (slightly updated code, starting from
a previous point of a previous job), in Dataflow that would be a snapshot
, I think.
Snapshots only work in streaming pipelines.

On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:

> Hi Team,
> Can we stage a PCollection or  PCollection data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>


Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Ravi Kapoor
Hi Team,
Can we stage a PCollection or  PCollection data? Lets say
to save  the expensive operations between two complex BQ tables time and
again and materialize it in some temp view which will be deleted after the
session.

Is it possible to do that in the Beam Pipeline?
We can later use the temp view in another pipeline to read the data from
and do processing.

Or In general I would like to know Do we ever stage the PCollection.
Let's say I want to create another instance of the same job which has
complex processing.
Does the pipeline re perform the computation or would it pick the already
processed data in the previous instance that must be staged somewhere?

Like in spark we do have notions of createOrReplaceTempView which is used
to create temp table from a spark dataframe or dataset.

Please advise.

-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com