Re: [Question] Beam Schema, fields order

2022-04-05 Thread gaurav mishra
There was an annotation introduced in 2.37 to make sure we get the same
order of fields in schema inferred from a POJO.
https://javadoc.io/doc/org.apache.beam/beam-sdks-java-core/latest/org/apache/beam/sdk/schemas/annotations/SchemaFieldNumber.html

with that annotation schemaRegistry.getSchema(dataClass) should give you
schema with the same field order.


On Wed, Apr 6, 2022 at 1:35 AM Alexey Romanenko 
wrote:

> Thanks for answers, Reuven. Please see the additional questions inline.
>
> On 5 Apr 2022, at 20:07, Reuven Lax  wrote:
>
> On Tue, Apr 5, 2022 at 9:55 AM Alexey Romanenko 
> wrote:
>
>>
>> So, the different fields order matters.
>>
>> Additionally, since "Schema.equals()” is used in "Row.equals()”, then it
>> means that two Rows with different-ordered schemas but the same values will
>> be considered as different rows. Is it correct?
>>
>
> Yes, but there are ways of dealing with this:
>
>
> But what is a point of this? Why the fields order can be important, under
> which circumstances?
>
> 1. If using Dataflow, the pipeline update feature allows you to update to
> a compatible schema (i.e. one in which the fields have the same names but a
> different order)
> 2.You can use the Convert transform to convert rows to a compatible schema
> with a different order.
>
>
> Well, for now it’s mostly related to unit tests (e.g.
> AvroSchemaTest.testPojoRecordToRow()) when we compare a manually created
> row with another row that is created from a POJO with AvroRecordSchema. I’m
> playing with an Avro version upgrade [1] and it fails because there are
> some changes in Avro and it creates an Avro schema with a different order
> of fields. So, actually I’m thinking what we can do here with that.
>
> [1] https://github.com/apache/beam/pull/17246
>
>
>> In the same time, while generating a schema with different schema
>> providers, the order of fields can be non-deterministic for some cases.
>>
>> For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)”
>> says [3] that:
>> *- “schemaFor is non deterministic - it might return fields in an
>> arbitrary order. The reason why is that Java reflection does not guarantee
>> the order in which it returns fields and methods, and these schemas are
>> often based on reflective analysis of classes. “*
>>
>> So, iiuc, it means that potentially we can have the "same" schema but
>> with different fields order for the same, for example, POJO class but
>> generated on different JVMs.
>>
>
> Correct, and see above.
>
>
>>
>> And actually the questions:
>> - Two Rows with the same field values but with two schemas of different
>> fields order should be considered as two different rows or not?
>> - This behaviour explained above - is this that was expected by initial
>> schema design?
>> - If fields order is so important then why?
>>
>> PS: My question is actually related to
>> "AvroRecordSchema().toRowFunction()” but I guess other SchemaProvider’s
>> also can be affected.
>>
>>
>> —
>> Alexey
>>
>> [1]
>> https://beam.apache.org/documentation/programming-guide/#schema-definition
>> [2]
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
>> [3]
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91
>>
>
>


Re: [Question] Beam Schema, fields order

2022-04-05 Thread Alexey Romanenko
Thanks for answers, Reuven. Please see the additional questions inline.

> On 5 Apr 2022, at 20:07, Reuven Lax  wrote:
> 
> On Tue, Apr 5, 2022 at 9:55 AM Alexey Romanenko  > wrote:
> 
> So, the different fields order matters.
> 
> Additionally, since "Schema.equals()” is used in "Row.equals()”, then it 
> means that two Rows with different-ordered schemas but the same values will 
> be considered as different rows. Is it correct?
> 
> Yes, but there are ways of dealing with this:

But what is a point of this? Why the fields order can be important, under which 
circumstances?

> 1. If using Dataflow, the pipeline update feature allows you to update to a 
> compatible schema (i.e. one in which the fields have the same names but a 
> different order)
> 2.You can use the Convert transform to convert rows to a compatible schema 
> with a different order.

Well, for now it’s mostly related to unit tests (e.g. 
AvroSchemaTest.testPojoRecordToRow()) when we compare a manually created row 
with another row that is created from a POJO with AvroRecordSchema. I’m playing 
with an Avro version upgrade [1] and it fails because there are some changes in 
Avro and it creates an Avro schema with a different order of fields. So, 
actually I’m thinking what we can do here with that.

[1] https://github.com/apache/beam/pull/17246

> 
> In the same time, while generating a schema with different schema providers, 
> the order of fields can be non-deterministic for some cases.
> 
> For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)” says 
> [3] that:
> - “schemaFor is non deterministic - it might return fields in an arbitrary 
> order. The reason why is that Java reflection does not guarantee the order in 
> which it returns fields and methods, and these schemas are often based on 
> reflective analysis of classes. “
> 
> So, iiuc, it means that potentially we can have the "same" schema but with 
> different fields order for the same, for example, POJO class but generated on 
> different JVMs. 
> 
> Correct, and see above.
>  
> 
> And actually the questions: 
> - Two Rows with the same field values but with two schemas of different 
> fields order should be considered as two different rows or not?
> - This behaviour explained above - is this that was expected by initial 
> schema design? 
> - If fields order is so important then why?
> 
> PS: My question is actually related to "AvroRecordSchema().toRowFunction()” 
> but I guess other SchemaProvider’s also can be affected.
> 
> 
> —
> Alexey
> 
> [1] 
> https://beam.apache.org/documentation/programming-guide/#schema-definition 
> 
> [2] 
> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
>  
> 
> [3] 
> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91
>  
> 


Beam Metrics Report (2022-04-05)

2022-04-05 Thread Apache Jenkins Server
ERROR: File 'src/.test-infra/jenkins/metrics_report/beam-metrics_report.html' does not exist

Beam Metrics Report (2022-04-05)

2022-04-05 Thread Apache Jenkins Server
ERROR: File 'src/.test-infra/jenkins/metrics_report/beam-metrics_report.html' does not exist

Re: Updating output watermark on bundle boundaries

2022-04-05 Thread Kenneth Knowles
Coming to this a bit late, but to remind that we switched to a whole-window
definition of lateness/droppability. Elements in front of / behind a
watermark or other elements should hardly matter for lateness. Anything
output within a window should be fine as long as the window itself is
treated consistently, no?

I think the idea is to allow runners to have as much flexibility as
possible by making things as unobservable as possible. For example a timer
firing only allows you to bound the input elements' watermark on one side.

Regarding the pure 1:1 mapping transform, we deliberately can observe if a
vanilla DoFn has a startbundle/finishbundle method, by design. So if the
runner knows that the user code cannot possibly observe bundle boundaries,
the runner can have even more flexibility.

Kenn

On Wed, Mar 30, 2022 at 8:43 AM Robert Bradshaw  wrote:

> On Tue, Mar 29, 2022 at 1:11 AM Jan Lukavský  wrote:
> >
> > > There's another interesting API, which is being discussed for the
> > > internal variant of Dataflow, which is that rather than allowing one
> > > to fabricate timestamps (or windows) ex nihilo one would instead need
> > > ot ask for a "timestamped" or "windowed" element in the Process
> > > method, from which one could construct a new timestamped/windowed
> > > element (with a new value, but the same timestamp/window/paneinfo)
> > > that could then be safely emitted. I'm curious how constraining this
> > > would be.
> >
> > I'm not sure I follow. Do you suggest that - for the case of in-memory
> > batching - one would store a TimestampedValue in the buffer and when
> > flushing the buffer one would say "I'm emitting this value, that was
> > created based on this input element"?
>
> Correct. The "that was created based on this input element" is
> implicit--it's just that the only way to get timestamps/windows (in an
> ordinary DoFn) is to get them from an input element. (One would also
> separately have "timestamping" and "windowing" operations that do
> assignment to timestamp/window.)
>
> > That seems to work fine, though I
> > suppose this is probably not the main motivation for such API. :)
>
> Actually, this is the motivation for the API. In order to correctly
> buffer, one needs to keep the timestamp/window/paneinfo information
> associated with each element around, so this encourages (or even
> forces) one to do so in the correct way.
>
> > On 3/28/22 20:54, Robert Bradshaw wrote:
> > > On Mon, Mar 28, 2022 at 11:45 AM Jan Lukavský  wrote:
> > >> On 3/28/22 20:17, Reuven Lax wrote:
> > >>
> > >> On Mon, Mar 28, 2022 at 11:08 AM Robert Bradshaw 
> wrote:
> > >>> On Mon, Mar 28, 2022 at 11:04 AM Reuven Lax 
> wrote:
> >  On Mon, Mar 28, 2022 at 10:59 AM Evan Galpin 
> wrote:
> > > I don't believe that the issue is Flink specific but rather that
> Flink is one example of many potential examples.  Enforcing that watermark
> updates can only happen at bundle boundaries would ensure that any data
> buffered while processing a single bundle in a DoFn could be output
> ON_TIME, especially without any need for a TimerSpec to explicitly hold the
> watermark for that purpose.  This is in reference to data buffered within a
> single bundle, and not cross-bundle buffering such as in the case of
> GroupIntoBatches.
> > 
> >  Any in-flight data (i.e. data being processed that is not yet
> committed back to the runner) must hold up the output watermark. Since in
> the Beam model all records in a bundle are somewhat atomic (e.g. if the
> bundle succeeds, none of of them should be replayed in a proper
> exactly-once runner), I think this implicitly means that any elements in an
> in-flight bundle must hold up the watermark. This doesn't mean that the
> watermark can't advance while the bundle is in flight -just that it can't
> advance past any of the timestamps outstanding in the bundle.
> > >>> Yes. The difficulty is that we don't have much visibility into
> > >>> "timestamps outstanding in the bundle" so we have to take
> > >>> min(timestamps of input elements in the bundle) which is not that
> > >>> different from only having watermark updates at bundle boundaries.
> > >>
> > >> Exactly.
> > >>
> > >> Agree, this works exactly the same. The requirement is not to not
> update the watermark, but not to update it past any on-time element in the
> bundle. Not updating the watermark at all is one solution, computing
> min(timestamps in bundle) works the same. Unfortunately, Flink does not
> construct bundles in advance, it is more an ad-hoc concept. Therefore the
> only way to hold the watermark is not to update it, because the timestamps
> of elements that will be part of the bundle are not known.
> > >>
> > >> Two more questions:
> > >>
> > >>   a) it seems that we are missing some @ValidatesRunner tests for
> this, right?
> > >>
> > >>   b) should we relax the restriction of not allowing
> outputWithTimestamp() output element before the current element? I think it
> should be "before lowest

Re: Re: RE: Re: unvendoring bytebuddy

2022-04-05 Thread Kenneth Knowles
Hmm. Too bad the information on the jira is inadequate to explain or
justify the change. TBH if faced with a conflict between bytebuddy and
mockito, working to use mocks less, or in more straightforward ways, would
have been my preference. This isn't actually a diamond dep problem that
impacts users, or I would feel differently.

I guess we want to have some thorough testing & survey of the versions of
bytebuddy and asm used by transitive deps and any possible breaking
changes. Should be pretty easy since unvendoring is much easier
to experiment with. Apologies if this already exists in some other thread
or on the ticket.

Kenn

On Wed, Mar 23, 2022 at 11:16 AM Reuven Lax  wrote:

> We also use ASM directly. If we unshade Bytebuddy, does that also unshade
> ASM? Does ASM provide similar stability guarantees?
>
> On Wed, Mar 23, 2022 at 10:50 AM Liam Miller-Cushon 
> wrote:
>
>> On 2022/03/21 19:36:29 Robert Bradshaw wrote:
>> > My understanding was that Bytebuddy was originally unvendored, and we
>> > vendored it in reaction to version incompatibility issues (despite the
>> > promise of API stability). I think we should have a good justification
>> > for why we won't get bitten by this again before moving back to
>> > unvendored.
>>
>> Does anyone have context on the specific issues that
>> https://issues.apache.org/jira/browse/BEAM-1019 was a reaction to?
>>
>> The bug was filed in 2016 and mentions upgrading to mockito 2.0. One
>> potential justification for trying to unvendor is that the issue is fairly
>> old, and the library has continued to stabilize, so it might be safer to
>> unvendor today than it was in 2016.
>>
>


P1 issues report (75)

2022-04-05 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-14253: pubsublite.ReadWriteIT 
failing in beam_PostCommit_Java_DataflowV1 (created 2022-04-05)
https://issues.apache.org/jira/browse/BEAM-14252: 
beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors 
(created 2022-04-05)
https://issues.apache.org/jira/browse/BEAM-14244: Processing time timers 
should use outputTimestamp rather than input watermark for their timestamp 
(created 2022-04-04)
https://issues.apache.org/jira/browse/BEAM-14227: CVE-2022-22965 
vulnerability found in java-io-kafka component (created 2022-04-01)
https://issues.apache.org/jira/browse/BEAM-14191: CrossLanguageJdbcIOTest 
broken with "Cannot load JDBC driver class 'com.mysql.cj.jdbc.Driver'" (created 
2022-03-28)
https://issues.apache.org/jira/browse/BEAM-14138: Python PostCommit BQ test 
failures due to NOT_FOUND for Dataset (created 2022-03-21)
https://issues.apache.org/jira/browse/BEAM-14135: BigQuery Storage API 
insert with writeResult retry and write to error table (created 2022-03-20)
https://issues.apache.org/jira/browse/BEAM-14126: Python 3.10 Support 
(created 2022-03-18)
https://issues.apache.org/jira/browse/BEAM-14064: ElasticSearchIO#Write 
buffering and outputting across windows (created 2022-03-07)
https://issues.apache.org/jira/browse/BEAM-14017: 
beam_PreCommit_CommunityMetrics_Cron is failing. (created 2022-03-01)
https://issues.apache.org/jira/browse/BEAM-13953: Document BigQueryIO 
Storage Write API methods (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13950: PVR_Spark2_Streaming 
perma-red (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13920: Beam x-lang Dataflow 
tests failing due to _InactiveRpcError (created 2022-02-10)
https://issues.apache.org/jira/browse/BEAM-13852: 
KafkaIO.read.withDynamicRead() doesn't pick up new TopicPartitions (created 
2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13805: Simplify version override 
for Dev versions of the Go SDK. (created 2022-02-02)
https://issues.apache.org/jira/browse/BEAM-13798: Upgrade Kubernetes 
Clusters (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13747: Add integration testing 
for BQ Storage API  write modes (created 2022-01-26)
https://issues.apache.org/jira/browse/BEAM-13715: Kafka commit offset drop 
data on failure for runners that have non-checkpointing shuffle (created 
2022-01-21)
https://issues.apache.org/jira/browse/BEAM-13657: Sunset Python 3.6 support 
in Beam (created 2022-01-13)
https://issues.apache.org/jira/browse/BEAM-13582: Beam website precommit 
mentions broken links, but passes. (created 2021-12-30)
https://issues.apache.org/jira/browse/BEAM-13487: WriteToBigQuery Dynamic 
table destinations returns wrong tableId (created 2021-12-17)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13204: Missing code snippets in 
website (created 2021-11-08)
https://issues.apache.org/jira/browse/BEAM-13164: Race between member 
variable being accessed due to leaking uninitialized state via 
OutboundObserverFactory (created 2021-11-01)
https://issues.apache.org/jira/browse/BEAM-13132: WriteToBigQuery submits a 
duplicate BQ load job if a 503 error code is returned from googleapi (created 
2021-10-27)
https://issues.apache.org/jira/browse/BEAM-13087: 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible 
(created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13078: Python DirectRunner does 
not emit data at GC time (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13076: Python AfterAny, AfterAll 
do not follow spec (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM

Flaky test issue report (51)

2022-04-05 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-14216: Multiple XVR Suites 
having similar flakes simultaneously (created 2022-03-31)
https://issues.apache.org/jira/browse/BEAM-14172: beam_PreCommit_PythonDocs 
failing (jinja2) (created 2022-03-24)
https://issues.apache.org/jira/browse/BEAM-13952: Dataflow streaming tests 
failing new AfterSynchronizedProcessingTime test (created 2022-02-15)
https://issues.apache.org/jira/browse/BEAM-13859: Test flake: 
test_split_half_sdf (created 2022-02-09)
https://issues.apache.org/jira/browse/BEAM-13850: 
beam_PostCommit_Python_Examples_Dataflow failing (created 2022-02-08)
https://issues.apache.org/jira/browse/BEAM-13822: GBK and CoGBK streaming 
Java load tests failing (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13810: Flaky tests: Gradle build 
daemon disappeared unexpectedly (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13809: beam_PostCommit_XVR_Flink 
flaky: Connection refused (created 2022-02-03)
https://issues.apache.org/jira/browse/BEAM-13797: Flakes: Failed to load 
cache entry (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13783: 
apache_beam.transforms.combinefn_lifecycle_test.LocalCombineFnLifecycleTest.test_combine
 is flaky (created 2022-02-01)
https://issues.apache.org/jira/browse/BEAM-13708: flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored (created 2022-01-20)
https://issues.apache.org/jira/browse/BEAM-13575: Flink 
testParDoRequiresStableInput flaky (created 2021-12-28)
https://issues.apache.org/jira/browse/BEAM-13519: Java precommit flaky 
(timing out) (created 2021-12-22)
https://issues.apache.org/jira/browse/BEAM-13500: NPE in Flink Portable 
ValidatesRunner streaming suite (created 2021-12-21)
https://issues.apache.org/jira/browse/BEAM-13453: Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use 
(created 2021-12-13)
https://issues.apache.org/jira/browse/BEAM-13393: GroupIntoBatchesTest is 
failing (created 2021-12-07)
https://issues.apache.org/jira/browse/BEAM-13367: 
[beam_PostCommit_Python36] [ 
apache_beam.io.gcp.experimental.spannerio_read_it_test] Failure summary 
(created 2021-12-01)
https://issues.apache.org/jira/browse/BEAM-13312: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
 is flaky in Java Spark ValidatesRunner suite  (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13311: 
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
 is flaky in Java ValidatesRunner Flink suite. (created 2021-11-23)
https://issues.apache.org/jira/browse/BEAM-13237: 
org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
 flaky on Dataflow Runner V2 (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13234: Flake in 
StreamingWordCountIT.test_streaming_wordcount_it (created 2021-11-12)
https://issues.apache.org/jira/browse/BEAM-13025: pubsublite.ReadWriteIT 
flaky in beam_PostCommit_Java_DataflowV2   (created 2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12858: 
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler 
is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12793: 
beam_PostRelease_NightlySnapshot failed (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12673: 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey (created 2021-07-28)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: Python precommit flaky: 
Failed to read inputs in th

[Question] Beam Schema, fields order

2022-04-05 Thread Alexey Romanenko
Hi everyone,

I have several Beam Schema-related questions since I didn’t find an exact 
answer for that. Let me give some brief intro before.

The order of fields in Schema - this is what users have to pay attention on, 
iiuc. In other words, two schemas with the same set of fields but with a 
different order of them will be considered as two different schemas, right? 

In Beam Schema doc [1] it’s said:
- “The schema for a PCollection defines elements of that PCollection as an 
ordered list of named fields.”

Also, "Schema.equals(Object)" says [2]:
- “Returns true if two Schemas have the same fields in the same order."

So, the different fields order matters.

Additionally, since "Schema.equals()” is used in "Row.equals()”, then it means 
that two Rows with different-ordered schemas but the same values will be 
considered as different rows. Is it correct?

In the same time, while generating a schema with different schema providers, 
the order of fields can be non-deterministic for some cases.

For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)” says [3] 
that:
- “schemaFor is non deterministic - it might return fields in an arbitrary 
order. The reason why is that Java reflection does not guarantee the order in 
which it returns fields and methods, and these schemas are often based on 
reflective analysis of classes. “

So, iiuc, it means that potentially we can have the "same" schema but with 
different fields order for the same, for example, POJO class but generated on 
different JVMs. 

And actually the questions: 
- Two Rows with the same field values but with two schemas of different fields 
order should be considered as two different rows or not?
- This behaviour explained above - is this that was expected by initial schema 
design? 
- If fields order is so important then why?

PS: My question is actually related to "AvroRecordSchema().toRowFunction()” but 
I guess other SchemaProvider’s also can be affected.


—
Alexey

[1] https://beam.apache.org/documentation/programming-guide/#schema-definition
[2] 
https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
[3] 
https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91

[Question] Connect Cloud SQL postgres instance for load test.

2022-04-05 Thread Andoni Guzman Becerra
Hi All,
I have a question related to Cloud SQL. I'm adding a new load test , this
test will work with debezium, checking that the readFromDebezium works
well. For this I need to create a CloudSQL postgres instance to do some
queries and check that Debezium catches the changes correctly. The problem
is that CloudSQL instance do not accept new autorizathed connections for
this constrain ( Restrict Authorized Networks on Cloud SQL instances) . Is
there a way to connect to the instance? Or the constrain should be removed?
The load test  will run on a jenkins instance and the Cloud SQl will be
created and deleted in each test.

Thank you! Regards

-- 

Andoni Guzman | WIZELINE

Software Engineer II

andoni.guz...@wizeline.com

Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


Re: Beam Metrics Report (2022-04-05)

2022-04-05 Thread Manu Zhang
Is this still working?

Apache Jenkins Server 于2022年4月5日 周二19:56写道:

> ERROR: File
> 'src/.test-infra/jenkins/metrics_report/beam-metrics_report.html' does not
> exist


Beam Metrics Report (2022-04-05)

2022-04-05 Thread Apache Jenkins Server
ERROR: File 'src/.test-infra/jenkins/metrics_report/beam-metrics_report.html' does not exist

Re: [DISCUSS] Deprecation of AWS SDK v1 IO connectors

2022-04-05 Thread Moritz Mack
Thank you, David!
Fan Out Reads are certainly on my mind, though probably nothing I’ll be able to 
tackle soon. I agree, that would be a great feature to get attention for v2.
Best,
Moritz


From: David Hollands 
Date: Monday, 21. March 2022 at 18:53
To: dev@beam.apache.org 
Subject: Re: [DISCUSS] Deprecation of AWS SDK v1 IO connectors
Hi Moritz - I just wanted to say a big THANK YOU for the huge improvements you 
have contributed to the AWS IOs in recent releases! Such great use of 
Testcontainers and localstack! 🙂 +1 to deprecating v1 gracefully. Adding 
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
Exercise caution when opening attachments or clicking any links.
ZjQcmQRYFpfptBannerEnd
Hi Moritz -

I just wanted to say a big THANK YOU for the huge improvements you have 
contributed to the AWS IOs in recent releases!

Such great use of Testcontainers and localstack! 🙂

+1 to deprecating v1 gracefully.  Adding KinesisIO Fan Out Read in v2 and not 
v1 would be a nice carrot to entice some users I know away from v1... 😜

Best, David


Want to work with us? 
https://datalab.rocks

___

David Hollands

Principal Data Engineer, Datalab

BBC Product Group



[signature_347418782]

London:  Broadcast Centre, 201 Wood Lane, London, W12 7TP


From: Moritz Mack 
Sent: 21 March 2022 12:58
To: dev@beam.apache.org 
Subject: Re: [DISCUSS] Deprecation of AWS SDK v1 IO connectors


Thank you both! Absolutely agree on reaching out to users!



The release of 2.38 seems to be a very good time to do so to also announce 
feature parity between the two versions.

I’ll get back to the two of you for some help reaching out to users on twitters 
then!



Thanks, Moritz







From: Alexey Romanenko 
Date: Sunday, 20. March 2022 at 00:23
To: dev@beam.apache.org 
Subject: Re: [DISCUSS] Deprecation of AWS SDK v1 IO connectors

Yes, good point about asking on Twitter, Ahmet! On 19 Mar 2022, at 04:14, Ahmet 
Altay  wrote: Moritz, thank you for your thoughtful approach. 
+1 to Alexey's suggestion. One minor additional suggestion, in addition to 
ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.

Exercise caution when opening attachments or clicking any links.

ZjQcmQRYFpfptBannerEnd

Yes, good point about asking on Twitter, Ahmet!



On 19 Mar 2022, at 04:14, Ahmet Altay 
mailto:al...@google.com>> wrote:



Moritz, thank you for your thoughtful approach.



+1 to Alexey's suggestion. One minor additional suggestion, in addition to 
asking on user@ you can also ask on beam's twitter. Historically we were able 
to reach more users on Twitter compared to the user list.



On Fri, Mar 18, 2022 at 10:01 AM Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:

First of all - many thanks for your work to make "amazon-web-services2” stable 
and up-to-date with already supported features of v1 version.



+1 to deprecate all "amazon-web-services” (AWS SDK v1) IOs  and recommend to 
use only v2 API, but before I’d suggest to ask people on user@ if they have 
some specific reasons to delay it or any other feedback on this that we can 
miss.



Regarding the major releases and removing deprecated code, IIUC, we agreed that 
deprecated code should stay (but not required to be supported), at least, 3 
minor releases (x.y) and it can be finally deleted if we won’t discover any 
regressions or user complains till then.



On the other hand, Beam 3.0 should be the next major release but I’m not sure 
it’s even on distant horizon for now since this is topic that we didn’t discuss 
for a long time (maybe it’s a good time to come back to this).



—

Alexey





On 18 Mar 2022, at 12:19, Moritz Mack 
mailto:mm...@talend.com>> wrote:



Dear all,



I’d like to bring up an old discussion again [1].

Currently we have two different versions of AWS IO connectors in Beam for the 
Java SDK:

  *   amazon-web-services  [2] and kinesis [3] for the AWS Java SDK v1
  *   amazon-web-services2 (including kinesis) [4] for the AWS Java SDK v2



Maintaining two different versions is obviously painful, so working towards 
sunsetting the earlier v1 is important.

Though, historically v1 had (and likely still has) the broader adoption due to 
a lack of awareness, but also certainly a lack of features in v2.



I’ve recently focused a lot on preparing the deprecation of v1, specifically:

  *   implementing all outstanding features in v2 (above all write support for 
KinesisIO) [5]
  *   full integration test coverage for all IOs in v2 (using localstack), but 
also improved general test coverage & quality
  *   more consistent configuration and tons of bug fixes



Where I’m looking for general feedback is how to proceed next: