Re: IO Connector

2021-10-18 Thread Matt Casters
Thanks a lot for the advice given last week.
Just to circle back: I've updated gradle to a recent version which appears
to be 7.1.1. ... to no avail.

tasks like:

./gradlew idea

result simply in

Task 'idea' not found in root project 'beam'.

The same goes for the other suggestions.

As for Google Auto: is this project still maintained? The docs and so on
seem to be getting quite old.
The annotation processor in the latest IntelliJ doesn't seem to get picked
up even if you configure it manually in the settings.
So I'll skip that one for now.

Are there any build instructions I can follow for Beam to at least try to
build the Java SDK and go from there?

Thanks,

Matt


On Tue, Oct 12, 2021 at 9:58 PM Evan Galpin  wrote:

> @Matt have you tried any of the "IDE Tasks" available through gradle?
> "./gradlew tasks" from beam top-level will list available tasks, and the
> IDE Tasks subsection includes tasks specific to trying to bootstrap or
> clean up beam project in either Eclipse or Intellij.  Ex. "./gradlew idea"
> should set up the project files for use in Intellij.  There's also
> "./gradlew cleanIdea" which may be helpful to you.
>
> With respect to Google Auto, I've experienced plenty of IDE complaints
> around missing types and the like, and those will likely persist until the
> code area that you're working on is compiled because the types won't exist
> until the sources are generated at pre-processing stage. Not sure if that
> was the issue you were having, but if so hopefully this helps.
>
> Thanks,
> Evan
>
> On Tue, Oct 12, 2021 at 2:51 PM Matt Casters 
> wrote:
>
>> Thanks Chamikara but I'm quite familiar with the Beam API and the
>> contribution guide did not answer my questions.
>>
>> On Tue, Oct 12, 2021 at 8:49 PM Chamikara Jayalath 
>> wrote:
>>
>>> If you haven't already, going through Beam contribution guide and varils
>>> links from there might help: https://beam.apache.org/contribute/
>>> Regarding developing I/O connectors, please see the guide here:
>>> https://beam.apache.org/documentation/io/developing-io-overview/
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Oct 12, 2021 at 6:08 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Hi Matt,

 On 12 Oct 2021, at 10:02, Matt Casters 
 wrote:

 1) Setting up my Beam development
 
  environment
 for IDEA 2021.2 is something that's going wrong, probably around Gradle
 configurations.

 2) I can't get Google Auto to work in my IDE (IDEA) because of what
 seems outdated documentation
 ?


 Could you elaborate more what is wrong with 1) and 2) ?

>>>
 3) Since I'm obviously planning to generate a PR at the end of this
 exercise: what is the suggested code format for Java in the Beam project?


 Please, run this command before committing your changes:

 ./gradlew spotlessApply && ./gradlew
 -PenableCheckerFramework=true checkstyleMain checkstyleTest javadoc
 spotbugsMain compileJava compileTestJava

 To save a time, run it only against a package where you did the changes.

 —
 Alexey


>>
>> --
>> Neo4j Chief Solutions Architect
>> *✉   *matt.cast...@neo4j.com
>>
>>
>>
>>


Re: Performance of Apache Beam

2021-10-18 Thread Jan Lukavský

Hi Azhar,

-dev  +user 

this kind of question cannot be answered in general. The overhead will 
depend on the job and the SDK you use. Using Java SDK with (classical) 
FlinkRunner should give the best performance on Flink, although the 
overhead will not be completely nullified. The way Beam is constructed - 
with portability being one of the main concerns - necessarily brings 
some overhead compared to the job being written and optimized for single 
runner only (using Flink's native API in this case). I'd suggest you 
evaluate the programming model and portability guarantees, that Apache 
Beam gives you instead of pure performance. On the other hand Apache 
Beam tries hard to minimize the overhead, so you should not expect 
*vastly* worse performance. I'd say the best way to go is to implement a 
simplistic Pipeline somewhat representing your use-case and then measure 
the performance on this specific instance.


Regarding fault-tolerance and backpressure, Apache Beam model does not 
handle those (with the exception of bundles being processed as atomic 
units), so these are delegated to the runner - FlinkRunner will 
therefore behave the way Apache Flink defines these concepts.


Hope this helps,

 Jan

On 10/17/21 17:53, azhar mirza wrote:

Hi Team
Could you please let me know following below answers .

I need to know performance of apache beam vs flink if we use flink as 
runner for Beam, what will be the additional overhead converting Beam 
to flink


How fault tolerance and resiliency handled in apache beam.
How apache beam handles backpressure?

Thanks
Azhar

Beam Dependency Check Report (2021-10-18)

2021-10-18 Thread Apache Jenkins Server
ERROR: File 'src/build/dependencyUpdates/beam-dependency-check-report.html' does not exist

Re: Why is Avro Date field using InstantCoder?

2021-10-18 Thread Cristian Constantinescu
I will have a look after I find a workaround as I really need to deliver
some things and using Avro 1.8 isn't really an option.

But once that's done, I'd love to find ways to make Beam less dependent on
Avro 1.8 considering it was released in 2017.

On Mon, Oct 18, 2021 at 12:34 PM Reuven Lax  wrote:

> Do you know if it's easy to detect which version of Avro is being used?
>
> On Sun, Oct 17, 2021 at 10:20 PM Cristian Constantinescu 
> wrote:
>
>> If I had to change things, I would:
>>
>> 1. When deriving the SCHEMA add a few new types (JAVA_TIME, JAVA_DATE or
>> something along those lines).
>> 2. RowCoderGenerator around line 159 calls
>> "SchemaCoder.coderForFieldType(schema.getField(rowIndex).getType().withNullable(false));"
>> which eventually gets to SchemaCoderHelpers.coderForFieldType. There,
>> CODER_MAP has a hard reference on InstantCoder for DATETIME. Maybe that map
>> can be augmented (possibly dynamically) with new
>> fieldtypes-coder combinations to take care of the new types from #1.
>>
>> I would also like to ask. Looking through the Beam code, I see a lot of
>> static calls. Just wondering why it's done this way. I'm used to projects
>> having some form of dependency injection involved and static calls being
>> frowned upon (lack of mockability, hidden dependencies, tight coupling
>> etc). The only reason I can think of is serializability given Beam's
>> multi-node processing?
>>
>>
>> On Sat, Oct 16, 2021 at 3:11 AM Reuven Lax  wrote:
>>
>>> Is the Schema inference the only reason we can't upgrade Avro, or are
>>> there other blockers? Is there any way we can tell at runtime which version
>>> of Avro is running? Since we generate the conversion code at runtime with
>>> ByteBuddy, we could potentially just generate different conversions
>>> depending on the Avro version.
>>>
>>> On Fri, Oct 15, 2021 at 11:56 PM Cristian Constantinescu <
>>> zei...@gmail.com> wrote:
>>>
 Those are fair points. However please consider that there might be new
 users who will decide that Beam isn't suitable because of things like
 requiring Avro 1.8, Joda time, old Confluent libraries, and, when I started
 using Beam about a year ago, Java 8 (I think we're okay with Java 11 now).

 I guess what I'm saying is that there's definitely a non-negligible
 cost associated with old 3rd party libs in Beam's code (even if efforts are
 put in to minimize them).

 On Sat, Oct 16, 2021 at 2:33 AM Reuven Lax  wrote:

>
>
> On Fri, Oct 15, 2021 at 11:13 PM Cristian Constantinescu <
> zei...@gmail.com> wrote:
>
>> To my knowledge and reading through AVRO's Jira[1], it does not
>> support jodatime anymore.
>>
>> It seems everything related to this Avro 1.8 dependency is tricky. If
>> you recall, it also prevents us from upgrading to the latest Confluent
>> libs... for enabling Beam to use protobufs schemas with the schema
>> registry. (I was also the one who brought that issue up, also made an
>> exploratory PR to move AVRO outside of Beam core.)
>>
>> I understand that Beam tries to maintain public APIs stable, but I'd
>> like to put forward two points:
>> 1) Schemas are experimental, hence there shouldn't be any API
>> stability guarantees there.
>>
>
> Unfortunately at this point, they aren't really. As a community we've
> been bad about removing the Experimental label - many core, core parts of
> Beam are still labeled experimental (sources, triggering, state, timers).
> Realistically they are no longer experimental.
>
> 2) Maybe this is the perfect opportunity for the Beam community to
>> think about the long term effects of old dependencies within Beam's
>> codebase, and especially how to deal with them. Perhaps
>> starting/maintaining an "experimental" branch/maven-published-artifacts
>> where Beam does not guarantee backwards compatibility (or maintains it 
>> for
>> a shorter period of time) is something to think about.
>>
>
> This is why we usually try to prevent third-party libraries from being
> in our public API. However in this case, that's tricky.
>
> The beam community can of course decide to break backwards
> compatibility. However as stated today, it is maintained. The last time we
> broke backwards compatibility was when the old Dataflow API was
> transitioned to Beam, and it was very painful. It took multiple years to
> get some users onto the Beam API due to the code changes required to
> migrate (and those required code changes weren't terribly invasive).
>
>
>>
>> [1] https://issues.apache.org/jira/browse/AVRO-2335
>>
>> On Sat, Oct 16, 2021 at 12:40 AM Reuven Lax  wrote:
>>
>>> Does this mean more recent versions of avro aren't backwards
>>> compatible with avro 1.8? If so, this might be tricky to fix, since Beam
>>> maintains backwards compatibilit

P1 issues report (49)

2021-10-18 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-13074: Metrics are not reported 
by the Flink runner (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13060: Daily Python SDK build is 
not publicly accessible (created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13059: Migrate GKE workloads to 
Containerd (created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13058: Upgrade Kubernetes APIs 
(created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13056: 
DoFnSignature.fieldAccessDeclarations missing implicit accesses (created 
2021-10-14)
https://issues.apache.org/jira/browse/BEAM-13025: 
beam_PostCommit_Java_DataflowV2 failing pubsublite.ReadWriteIT (created 
2021-10-08)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM-12959: Dataflow error in 
CombinePerKey operation (created 2021-09-26)
https://issues.apache.org/jira/browse/BEAM-12867: Either Create or 
DirectRunner fails to produce all elements to the following transform (created 
2021-09-09)
https://issues.apache.org/jira/browse/BEAM-12843: (Broken Pipe induced) 
Bricked Dataflow Pipeline  (created 2021-09-06)
https://issues.apache.org/jira/browse/BEAM-12818: When writing to GCS, 
spread prefix of temporary files and reuse autoscaling of the temporary 
directory (created 2021-08-30)
https://issues.apache.org/jira/browse/BEAM-12807: Java creates an incorrect 
pipeline proto when core-construction-java jar is not in the CLASSPATH (created 
2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12792: Beam worker only installs 
--extra_package once (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12632: ElasticsearchIO: Enabling 
both User/Pass auth and SSL overwrites User/Pass (created 2021-07-16)
https://issues.apache.org/jira/browse/BEAM-12540: 
beam_PostRelease_NightlySnapshot - Task 
:runners:direct-java:runMobileGamingJavaDirect FAILED (created 2021-06-25)
https://issues.apache.org/jira/browse/BEAM-12525: SDF BoundedSource seems 
to execute significantly slower than 'normal' BoundedSource (created 2021-06-22)
https://issues.apache.org/jira/browse/BEAM-12505: codecov/patch has poor 
behavior (created 2021-06-17)
https://issues.apache.org/jira/browse/BEAM-12500: Dataflow SocketException 
(SSLException) error while trying to send message from Cloud Pub/Sub to 
BigQuery (created 2021-06-16)
https://issues.apache.org/jira/browse/BEAM-12484: JdbcIO date conversion is 
sensitive to OS (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12467: 
java.io.InvalidClassException With Flink Kafka (created 2021-06-09)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)
https://issues.apache.org/jira/browse/BEAM-11828: JmsIO is not 
acknowledging messages correctly (created 2021-02-17)
https://issues.apache.org/jira/browse/BEAM-11755: Cross-language 
consistency (RequiresStableInputs) is quietly broken (at least on portable 
flink runner) (created 2021-02-05)
https://issues.apache.org/jira/browse/BEAM-11578: `dataflow_metrics` 
(python) fails with TypeError (when int overflowing?) (created 2021-01-06)
https://issues.apache.org/jira/browse/BEAM-11148: Kafka 
commitOffsetsInFinalize OOM on Flink (created 2020-10-28)
https://issues.apache.org/jira/browse/BEAM-11017: Timer with dataflow 
runner can be set multiple times (dataflow runner) (created 2020-10-05)
https://issues.apache.org/jira/browse/BEAM-10670: Make non-portable 
Splittable DoFn the only option when executing Java "Read" transforms (created 
2020-08-10)
https://issues.apache.org/ji

Flaky test issue report (30)

2021-10-18 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-13025: 
beam_PostCommit_Java_DataflowV2 failing pubsublite.ReadWriteIT (created 
2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12540: 
beam_PostRelease_NightlySnapshot - Task 
:runners:direct-java:runMobileGamingJavaDirect FAILED (created 2021-06-25)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: Python precommit flaky: 
Failed to read inputs in the data plane (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-11837: Java build flakes: 
"Memory constraints are impeding performance" (created 2021-02-18)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11645: beam_PostCommit_XVR_Flink 
failing (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11641: Bigquery Read tests are 
flaky on Flink runner in Python PostCommit suites (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job (FlinkJobNotFoundException) (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed) (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7827: 
MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on 
DirectRunner (created 2019-07-26)
https://issues.apache.org/jira/browse/BEAM-7752: Java Validates 
DirectRunner: testTeardownCalledAfterExceptionInFinishBundleStateful flaky 
(created 2019-07-16)
https://issues.apache.org/jira/browse/BEAM-6804: [beam_PostCommit_Java] 
[PubsubReadIT.testReadPublicData] Timeout waiting on Sub (created 2019-03-11)
https://issues.apache.org/jira/browse/BEAM-5286: 
[beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
 .sh script: text file busy. (created 2018-09-01)
https://issues.apache.org/jira/browse/BEAM-5172: 
org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky (created 
2018-08-20)


MongoDB Change Streams

2021-10-18 Thread David Morales de Frias
Hello everyone,

I would like to read data from change streams to build real-time pipelines and 
I have not found anything related to it.

Is this connector being considered? 

Thanks


Re: MongoDB Change Streams

2021-10-18 Thread David Morales de Frias
Thanks Pablo, our goal is to take advantage of google dataflow to sync mongo 
and bigquery.

Something like this: https://github.com/snarvaez/Atlas-BQ__ODS-EDW

But with a direct approach mongo > bigquery rather than involving Pub/Sub

I have found this attempt 
https://github.com/alexvanboxel/beam-demo/blob/master/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbChangeStreamFn.java

And I was wondering if there is any official initiative to build an official 
connector.


On 2021/10/18 21:27:04, Pablo Estrada  wrote: 
> Hi David,
> 
> Debezium supports reading change streams from MongoDB[1] - you may be able
> to use DebeziumIO to consume these [2]. Have you considered that?
> Best
> -P/
> 
> [1] https://debezium.io/documentation/reference/connectors/mongodb.html
> [2]
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/io/debezium/DebeziumIO.html
> 
> On Mon, Oct 18, 2021 at 2:03 PM David Morales de Frias <
> david.mora...@syncrtc.com> wrote:
> 
> > Hello everyone,
> >
> > I would like to read data from change streams to build real-time pipelines
> > and I have not found anything related to it.
> >
> > Is this connector being considered?
> >
> > Thanks
> >
>