Flaky test issue report

2021-04-16 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests. These are P1 issues 
because they have a major negative impact on the community and make it hard to 
determine the quality of the software.

BEAM-12178: ReadCacheTest flakes on Windows 
(https://issues.apache.org/jira/browse/BEAM-12178)
BEAM-12163: Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK 
harness startup (https://issues.apache.org/jira/browse/BEAM-12163)
BEAM-12096: Flake: test_progress_in_HTML_JS_when_in_notebook 
(https://issues.apache.org/jira/browse/BEAM-12096)
BEAM-12061: beam_PostCommit_SQL failing on 
KafkaTableProviderIT.testFakeNested 
(https://issues.apache.org/jira/browse/BEAM-12061)
BEAM-12020: :sdks:java:container:java8:docker failing missing licenses 
(https://issues.apache.org/jira/browse/BEAM-12020)
BEAM-12019: 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky (https://issues.apache.org/jira/browse/BEAM-12019)
BEAM-11792: Python precommit failed (flaked?) installing package  
(https://issues.apache.org/jira/browse/BEAM-11792)
BEAM-11733: [beam_PostCommit_Java] [testFhirIO_Import|export] flaky 
(https://issues.apache.org/jira/browse/BEAM-11733)
BEAM-11666: 
apache_beam.runners.interactive.recording_manager_test.RecordingManagerTest.test_basic_execution
 is flaky (https://issues.apache.org/jira/browse/BEAM-11666)
BEAM-11662: elasticsearch tests failing 
(https://issues.apache.org/jira/browse/BEAM-11662)
BEAM-11661: hdfsIntegrationTest flake: network not found (py38 postcommit) 
(https://issues.apache.org/jira/browse/BEAM-11661)
BEAM-11646: beam_PostCommit_XVR_Spark failing 
(https://issues.apache.org/jira/browse/BEAM-11646)
BEAM-11645: beam_PostCommit_XVR_Flink failing 
(https://issues.apache.org/jira/browse/BEAM-11645)
BEAM-11541: testTeardownCalledAfterExceptionInProcessElement flakes on 
direct runner. (https://issues.apache.org/jira/browse/BEAM-11541)
BEAM-11540: Linter sometimes flakes on apache_beam.dataframe.frames_test 
(https://issues.apache.org/jira/browse/BEAM-11540)
BEAM-11493: Spark test failure: 
org.apache.beam.sdk.transforms.GroupByKeyTest$WindowTests.testGroupByKeyAndWindows
 (https://issues.apache.org/jira/browse/BEAM-11493)
BEAM-11492: Spark test failure: 
org.apache.beam.sdk.transforms.GroupByKeyTest$WindowTests.testGroupByKeyMergingWindows
 (https://issues.apache.org/jira/browse/BEAM-11492)
BEAM-11491: Spark test failure: 
org.apache.beam.sdk.transforms.GroupByKeyTest$WindowTests.testGroupByKeyMultipleWindows
 (https://issues.apache.org/jira/browse/BEAM-11491)
BEAM-11490: Spark test failure: 
org.apache.beam.sdk.transforms.ReifyTimestampsTest.inValuesSucceeds 
(https://issues.apache.org/jira/browse/BEAM-11490)
BEAM-11489: Spark test failure: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-11489)
BEAM-11488: Spark test failure: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedCounterMetrics
 (https://issues.apache.org/jira/browse/BEAM-11488)
BEAM-11487: Spark test failure: 
org.apache.beam.sdk.transforms.WithTimestampsTest.withTimestampsShouldApplyTimestamps
 (https://issues.apache.org/jira/browse/BEAM-11487)
BEAM-11486: Spark test failure: 
org.apache.beam.sdk.testing.PAssertTest.testSerializablePredicate 
(https://issues.apache.org/jira/browse/BEAM-11486)
BEAM-11485: Spark test failure: 
org.apache.beam.sdk.transforms.CombineFnsTest.testComposedCombineNullValues 
(https://issues.apache.org/jira/browse/BEAM-11485)
BEAM-11484: Spark test failure: 
org.apache.beam.runners.core.metrics.MetricsPusherTest.pushesUserMetrics 
(https://issues.apache.org/jira/browse/BEAM-11484)
BEAM-11483: Spark portable streaming PostCommit Test Improvements 
(https://issues.apache.org/jira/browse/BEAM-11483)
BEAM-10995: Java + Universal Local Runner: 
WindowingTest.testWindowPreservation fails 
(https://issues.apache.org/jira/browse/BEAM-10995)
BEAM-10987: stager_test.py::StagerTest::test_with_main_session flaky on 
windows py3.6,3.7 (https://issues.apache.org/jira/browse/BEAM-10987)
BEAM-10968: flaky test: 
org.apache.beam.sdk.metrics.MetricsTest$AttemptedMetricTests.testAttemptedDistributionMetrics
 (https://issues.apache.org/jira/browse/BEAM-10968)
BEAM-10955: Flink Java Runner test flake: Could not find Flink job  
(https://issues.apache.org/jira/browse/BEAM-10955)
BEAM-10923: Python requirements installation in docker container is flaky 
(https://issues.apache.org/jira/browse/BEAM-10923)
BEAM-10901: Flaky test: 
PipelineInstrumentTest.test_able_to_cache_intermediate_unbounded_source_pcollection
 (https://issues.apache.org/jira/browse/BEAM-10901)
BEAM-10899: test_FhirIO_exportFhirResourcesGcs flake with OOM 
(https://issues.apache.org/jira/browse/BEAM-10899)
BEAM-10866: 

P1 issues report

2021-04-16 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests.

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

BEAM-12177: [beam_PerformanceTests_*] Missing output IP address 
(https://issues.apache.org/jira/browse/BEAM-12177)
BEAM-12174: Samza Portable Runner Support 
(https://issues.apache.org/jira/browse/BEAM-12174)
BEAM-12170: [beam_PostCommit_Java_PVR_Spark_Batch] Multiple entries with 
same key (https://issues.apache.org/jira/browse/BEAM-12170)
BEAM-11959: Python Beam SDK Harness hangs when installing pip packages 
(https://issues.apache.org/jira/browse/BEAM-11959)
BEAM-11906: No trigger early repeatedly for session windows 
(https://issues.apache.org/jira/browse/BEAM-11906)
BEAM-11875: XmlIO.Read does not handle XML encoding per spec 
(https://issues.apache.org/jira/browse/BEAM-11875)
BEAM-11828: JmsIO is not acknowledging messages correctly 
(https://issues.apache.org/jira/browse/BEAM-11828)
BEAM-11772: GCP BigQuery sink (file loads) uses runner determined sharding 
for unbounded data (https://issues.apache.org/jira/browse/BEAM-11772)
BEAM-11755: Cross-language consistency (RequiresStableInputs) is quietly 
broken (at least on portable flink runner) 
(https://issues.apache.org/jira/browse/BEAM-11755)
BEAM-11578: `dataflow_metrics` (python) fails with TypeError (when int 
overflowing?) (https://issues.apache.org/jira/browse/BEAM-11578)
BEAM-11576: Go ValidatesRunner failure: TestFlattenDup on Dataflow Runner 
(https://issues.apache.org/jira/browse/BEAM-11576)
BEAM-11434: Expose Spanner admin/batch clients in Spanner Accessor 
(https://issues.apache.org/jira/browse/BEAM-11434)
BEAM-11227: Upgrade beam-vendor-grpc-1_26_0-0.3 to fix CVE-2020-27216 
(https://issues.apache.org/jira/browse/BEAM-11227)
BEAM-11148: Kafka commitOffsetsInFinalize OOM on Flink 
(https://issues.apache.org/jira/browse/BEAM-11148)
BEAM-11017: Timer with dataflow runner can be set multiple times (dataflow 
runner) (https://issues.apache.org/jira/browse/BEAM-11017)
BEAM-10861: Adds URNs and payloads to PubSub transforms 
(https://issues.apache.org/jira/browse/BEAM-10861)
BEAM-10617: python CombineGlobally().with_fanout() cause duplicate combine 
results for sliding windows (https://issues.apache.org/jira/browse/BEAM-10617)
BEAM-10573: CSV files are loaded several times if they are too large 
(https://issues.apache.org/jira/browse/BEAM-10573)
BEAM-10569: SpannerIO tests don't actually assert anything. 
(https://issues.apache.org/jira/browse/BEAM-10569)
BEAM-10288: Quickstart documents are out of date 
(https://issues.apache.org/jira/browse/BEAM-10288)
BEAM-10244: Populate requirements cache fails on poetry-based packages 
(https://issues.apache.org/jira/browse/BEAM-10244)
BEAM-10100: FileIO writeDynamic with AvroIO.sink not writing all data 
(https://issues.apache.org/jira/browse/BEAM-10100)
BEAM-9564: Remove insecure ssl options from MongoDBIO 
(https://issues.apache.org/jira/browse/BEAM-9564)
BEAM-9455: Environment-sensitive provisioning for Dataflow 
(https://issues.apache.org/jira/browse/BEAM-9455)
BEAM-9293: Python direct runner doesn't emit empty pane when it should 
(https://issues.apache.org/jira/browse/BEAM-9293)
BEAM-8986: SortValues may not work correct for numerical types 
(https://issues.apache.org/jira/browse/BEAM-8986)
BEAM-8985: SortValues should fail if SecondaryKey coder is not 
deterministic (https://issues.apache.org/jira/browse/BEAM-8985)
BEAM-8407: [SQL] Some Hive tests throw NullPointerException, but get marked 
as passing (Direct Runner) (https://issues.apache.org/jira/browse/BEAM-8407)
BEAM-7717: PubsubIO watermark tracking hovers near start of epoch 
(https://issues.apache.org/jira/browse/BEAM-7717)
BEAM-7716: PubsubIO returns empty message bodies for all messages read 
(https://issues.apache.org/jira/browse/BEAM-7716)
BEAM-7195: BigQuery - 404 errors for 'table not found' when using dynamic 
destinations - sometimes, new table fails to get created 
(https://issues.apache.org/jira/browse/BEAM-7195)
BEAM-6839: User reports protobuf ClassChangeError running against 2.6.0 or 
above (https://issues.apache.org/jira/browse/BEAM-6839)
BEAM-6466: KafkaIO doesn't commit offsets while being used as bounded 
source (https://issues.apache.org/jira/browse/BEAM-6466)


Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-16 Thread Tyson Hamilton
+1 (non-binding)

I ran the java local quickstarts and verified the nexmark tests.

On Fri, Apr 16, 2021 at 1:50 PM Valentyn Tymofieiev 
wrote:

> Hi Kenn,
>
> There is a version mismatch between Beam version and Dataflow Python
> worker version (the latter currently has version 2.28.0.dev), you can fix
> the Dataflow containers themselves without rebuilding the RC.
>
> Also Dataflow containers for Python 3.6 do not include 'dataclasses' - a
> Beam's dependency, which will cause Dataflow jobs to fail in an environment
> without internet.
>
> Thanks,
> Valentyn
>
> On Fri, Apr 16, 2021 at 1:06 PM Pablo Estrada  wrote:
>
>> +1 (binding)
>> I built and ran basic tests with existing Dataflow Templates.
>> Best
>> -P.
>>
>> On Fri, Apr 16, 2021 at 3:42 AM Elliotte Rusty Harold 
>> wrote:
>>
>>> On Fri, Apr 16, 2021 at 4:02 AM Kenneth Knowles  wrote:
>>>
>>> > The complete staging area is available for your review, which includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint
>>> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.29.0-RC1" [5],
>>> > * website pull request listing the release [6], publishing the API
>>> reference manual [7], and the blog post [8].
>>> > * Java artifacts were built with Maven MAVEN_VERSION and
>>> OpenJDK/Oracle JDK JDK_VERSION.
>>>
>>> Are the MAVEN_VERSION and OpenJDK/Oracle JDK JDK_VERSION supposed to
>>> be filled in with numbers?
>>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org
>>>
>>


Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-16 Thread Valentyn Tymofieiev
Hi Kenn,

There is a version mismatch between Beam version and Dataflow Python worker
version (the latter currently has version 2.28.0.dev), you can fix the
Dataflow containers themselves without rebuilding the RC.

Also Dataflow containers for Python 3.6 do not include 'dataclasses' - a
Beam's dependency, which will cause Dataflow jobs to fail in an environment
without internet.

Thanks,
Valentyn

On Fri, Apr 16, 2021 at 1:06 PM Pablo Estrada  wrote:

> +1 (binding)
> I built and ran basic tests with existing Dataflow Templates.
> Best
> -P.
>
> On Fri, Apr 16, 2021 at 3:42 AM Elliotte Rusty Harold 
> wrote:
>
>> On Fri, Apr 16, 2021 at 4:02 AM Kenneth Knowles  wrote:
>>
>> > The complete staging area is available for your review, which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint
>> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C [3],
>> > * all artifacts to be deployed to the Maven Central Repository [4],
>> > * source code tag "v2.29.0-RC1" [5],
>> > * website pull request listing the release [6], publishing the API
>> reference manual [7], and the blog post [8].
>> > * Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle
>> JDK JDK_VERSION.
>>
>> Are the MAVEN_VERSION and OpenJDK/Oracle JDK JDK_VERSION supposed to
>> be filled in with numbers?
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>


Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-16 Thread Pablo Estrada
+1 (binding)
I built and ran basic tests with existing Dataflow Templates.
Best
-P.

On Fri, Apr 16, 2021 at 3:42 AM Elliotte Rusty Harold 
wrote:

> On Fri, Apr 16, 2021 at 4:02 AM Kenneth Knowles  wrote:
>
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint
> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.29.0-RC1" [5],
> > * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> > * Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle
> JDK JDK_VERSION.
>
> Are the MAVEN_VERSION and OpenJDK/Oracle JDK JDK_VERSION supposed to
> be filled in with numbers?
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


Re: [PROPOSAL] Upgrade Cassandra driver from 3.x to 4.x in CassandraIO

2021-04-16 Thread Alexey Romanenko
Thank you for design doc and starting a discussion on mailing list!

I’m the next after Kenn to ask about the potential breaking changes with this 
upgrade. Could you elaborate a bit on this and can we support both versions in 
the same time?

Alexey

> On 15 Apr 2021, at 12:32, S Bhandiwad, Satwik (Nokia - IN/Bangalore) 
>  wrote:
> 
> Hi All,
>  
> We would like to upgrade Cassandra driver version from 3.x to 4.x in 
> CassandraIO Connector.
> Design Document - link 
> 
> Pull Request - https://github.com/apache/beam/pull/14457/ 
> 
>  
> Please go through the design doc & PR and let us know your thoughts.
>  
> Regards,
> Satwik



Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Alexey Romanenko
I’d also recommend to take a look on batch writes in SnowflakeIO connector [1] 
- it supports the batch load only from GCS for now but it could be a good 
reference too.

[1] 
https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L1051

> On 16 Apr 2021, at 16:55, Gabriel Levcovitz  wrote:
> 
> On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía  > wrote:
> I had not seen that the query API of Neptune is Gremlin based so this
> could be an even more generic IO connector.
> That's probably beyond scope because you care most for the write but
> interesting anyway.
> 
> https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html
>  
> 
> 
> Well, in theory the Gremlin API could even be used for writing too, but I 
> know for a fact that it's not very performatic and Amazon recommends using 
> the Bulk Loader when creating a lot of vertices/edges at once. But, if they 
> optimize this in the future, it could be even more interesting.
> 
> Gabriel
>  
> 
> On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía  > wrote:
> >
> > Hello Gabriel,
> >
> > Other interesting reference because of the Batch loads API like use +
> > Amazon is the unfinished Amazon Redshift connector PR from this ticket
> > https://issues.apache.org/jira/browse/BEAM-3032 
> > 
> >
> > The reason why that one was not merged into Beam is because if lacked tests.
> > You should probably look at how to test Neptune in advance, it seems
> > that localstack does not support neptune (only on the paying version)
> > so probably mocking would be the right way.
> >
> > We will be really interested in case you want to contribute the
> > NeptuneIO connector into Beam so don't hesitate to contact us.
> >
> >
> > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz  > > wrote:
> > >
> > > Hi Daniel, Kenneth,
> > >
> > > Thank you very much for your answers! I'll be looking carefully into the 
> > > info you've provided and if we eventually decide it's worth implementing, 
> > > I'll get back to you.
> > >
> > > Best,
> > > Gabriel
> > >
> > >
> > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles  > > > wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins  > >> > wrote:
> > >>>
> > >>> Hi Gabriel,
> > >>>
> > >>> Write-side adapters for systems tend to be easier than read-side 
> > >>> adapters to implement. That being said, looking at the documentation 
> > >>> for neptune, it looks to me like there's no direct data load API, only 
> > >>> a batch data load from a file on S3? This is usable but perhaps a bit 
> > >>> more difficult to work with.
> > >>>
> > >>> You could implement a write side adapter for neptune (either on your 
> > >>> own or as a contribution to beam) by writing a standard DoFn which, in 
> > >>> its ProcessElement method, buffers received records in memory, and in 
> > >>> its FinishBundle method, writes all collected records to a file on S3, 
> > >>> notifies neptune, and waits for neptune to ingest them. You can see 
> > >>> documentation on the DoFn API here. Someone else here might have more 
> > >>> experience working with microbatch-style APIs like this, and could have 
> > >>> more suggestions.
> > >>
> > >>
> > >> In fact, our BigQueryIO connector has a mode of operation that does 
> > >> batch loads from files on GCS: 
> > >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> > >>  
> > >> 
> > >>
> > >> The connector overall is large and complex, because it is old and 
> > >> mature. But it may be helpful as a point of reference.
> > >>
> > >> Kenn
> > >>
> > >>>
> > >>> A read-side API would likely be only a minimally higher lift. This 
> > >>> could be done in a simple loading step (Create with a single element 
> > >>> followed by MapElements), although much of the complexity likely lies 
> > >>> around how to provide the necessary properties to the cluster 
> > >>> construction on the beam worker task, and how to define the query the 
> > >>> user would need to execute. I'd also wonder if this could be done in an 
> > >>> engine-agnostic way, "TinkerPopIO" instead of "NeptuneIO".
> > >>>
> > >>> If you'd like to pursue adding such an integration, 
> > >>> https://beam.apache.org/contribute/ 
> > >>>  provides documentation on the 
> > >>> contribution process. Contributions to beam are always appreciated!
> > >>>
> > >>> -Daniel
> > >>>
> > 

Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Brian Hulette
Yeah I can see the advantage in tooling like this for easy upgrades. I
suspect many of the outdated Python dependencies fall under this category,
but the toil of creating a PR and verifying it passes tests is enough of a
barrier that we just haven't done it. Having a bot create the PR and
trigger CI to verify it would be helpful IMO.

Some questions/concerns I have:
- I think many python upgrades will still require manual work:
  - We also have pinned versions for some Python dependencies in
base_image_requirements.txt [1]
  - We test with multiple major versions of pyarrow. We'd want to add a new
test environment [2] when bumping to the next major version
- Will dependabot work ok with the version ranges that we specify? For
example some Python dependencies have upper bounds for the next major
version, some for the next minor version. Is dependabot smart enough to try
bumping the appropriate version number?

Brian

[1]
https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt

[2]
https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/python/tox.ini#L231

On Fri, Apr 16, 2021 at 7:16 AM Ismaël Mejía  wrote:

> Oh forgot to mention one alternative that we do in the Avro project,
> it is that we don't create issues for the dependabot PRs and then we
> search all the commits authored by dependabot and include them in the
> release notes to track dependency upgrades.
>
> On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
> >
> > > Quite often, dependency upgrade to latest versions leads to either
> compilation errors or failed tests and it should be resolved manually or
> declined. Having this, maybe I miss something, but I don’t see what kind of
> advantages automatic upgrade will bring to us except that we don’t need to
> create a PR manually (which is a not big deal).
> >
> > The advantage is exactly that, that we don't have to create and track
> > dependency updates manually, it will be done by the bot and we will
> > only have to do the review and guarantee that no issues are
> > introduced. I forgot to mention but we can create exception rules so
> > no further upgrades will be proposed for some dependencies e.g.
> > Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
> > advantage that is the detailed security report that will help us
> > prioritize dependency upgrades.
> >
> > > Regarding another issue - it’s already a problem, imho. Since we have
> a one Jira per package upgrade now and usually it “accumulates” all package
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable
> way to notify in release notes about all dependency upgrades for current
> release. One of the way is to mention the package upgrade in CHANGES.md
> which seems not very relible because it's quite easy to forget to do. I’d
> prefer to have a dedicated Jira issue for every upgrade and it will be
> included into releases notes almost automatically.
> >
> > Yes it seems the best for release note tracking to create the issue
> > and rename the PR title for this, but that would be part of the
> > review/merge process, so up to the Beam committers to do it
> > systematically and given how well we respect the commit naming /
> > squashing rules I am not sure if we will win much by having another
> > useless rule.
> >
> > On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
> >  wrote:
> > >
> > > Quite often, dependency upgrade to latest versions leads to either
> compilation errors or failed tests and it should be resolved manually or
> declined. Having this, maybe I miss something, but I don’t see what kind of
> advantages automatic upgrade will bring to us except that we don’t need to
> create a PR manually (which is a not big deal).
> > >
> > > Regarding another issue - it’s already a problem, imho. Since we have
> a one Jira per package upgrade now and usually it “accumulates” all package
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable
> way to notify in release notes about all dependency upgrades for current
> release. One of the way is to mention the package upgrade in CHANGES.md
> which seems not very relible because it's quite easy to forget to do. I’d
> prefer to have a dedicated Jira issue for every upgrade and it will be
> included into releases notes almost automatically.
> > >
> > > > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> > > >
> > > > Hello,
> > > >
> > > > Github has a bot that creates automatically Dependency Update PRs and
> > > > report security issues called dependabot.
> > > >
> > > > I was wondering if we should enable it for Beam. I tested it in my
> > > > personal Beam fork and it seems to be working well, it created
> > > > dependency updates for both Python and JS (website) dependencies.
> > > > The bot seems to be having problems to understand our gradle
> > > > dependency definitions for Java but that's something we can address
> in
> > > > the future to benefit of 

Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Gabriel Levcovitz
On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía  wrote:

> I had not seen that the query API of Neptune is Gremlin based so this
> could be an even more generic IO connector.
> That's probably beyond scope because you care most for the write but
> interesting anyway.
>
>
> https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html


Well, in theory the Gremlin API could even be used for writing too, but I
know for a fact that it's not very performatic and Amazon recommends using
the Bulk Loader when creating a lot of vertices/edges at once. But, if they
optimize this in the future, it could be even more interesting.

Gabriel


On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía  wrote:
> >
> > Hello Gabriel,
> >
> > Other interesting reference because of the Batch loads API like use +
> > Amazon is the unfinished Amazon Redshift connector PR from this ticket
> > https://issues.apache.org/jira/browse/BEAM-3032
> >
> > The reason why that one was not merged into Beam is because if lacked
> tests.
> > You should probably look at how to test Neptune in advance, it seems
> > that localstack does not support neptune (only on the paying version)
> > so probably mocking would be the right way.
> >
> > We will be really interested in case you want to contribute the
> > NeptuneIO connector into Beam so don't hesitate to contact us.
> >
> >
> > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz 
> wrote:
> > >
> > > Hi Daniel, Kenneth,
> > >
> > > Thank you very much for your answers! I'll be looking carefully into
> the info you've provided and if we eventually decide it's worth
> implementing, I'll get back to you.
> > >
> > > Best,
> > > Gabriel
> > >
> > >
> > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles 
> wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins 
> wrote:
> > >>>
> > >>> Hi Gabriel,
> > >>>
> > >>> Write-side adapters for systems tend to be easier than read-side
> adapters to implement. That being said, looking at the documentation for
> neptune, it looks to me like there's no direct data load API, only a batch
> data load from a file on S3? This is usable but perhaps a bit more
> difficult to work with.
> > >>>
> > >>> You could implement a write side adapter for neptune (either on your
> own or as a contribution to beam) by writing a standard DoFn which, in its
> ProcessElement method, buffers received records in memory, and in its
> FinishBundle method, writes all collected records to a file on S3, notifies
> neptune, and waits for neptune to ingest them. You can see documentation on
> the DoFn API here. Someone else here might have more experience working
> with microbatch-style APIs like this, and could have more suggestions.
> > >>
> > >>
> > >> In fact, our BigQueryIO connector has a mode of operation that does
> batch loads from files on GCS:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> > >>
> > >> The connector overall is large and complex, because it is old and
> mature. But it may be helpful as a point of reference.
> > >>
> > >> Kenn
> > >>
> > >>>
> > >>> A read-side API would likely be only a minimally higher lift. This
> could be done in a simple loading step (Create with a single element
> followed by MapElements), although much of the complexity likely lies
> around how to provide the necessary properties to the cluster construction
> on the beam worker task, and how to define the query the user would need to
> execute. I'd also wonder if this could be done in an engine-agnostic way,
> "TinkerPopIO" instead of "NeptuneIO".
> > >>>
> > >>> If you'd like to pursue adding such an integration,
> https://beam.apache.org/contribute/ provides documentation on the
> contribution process. Contributions to beam are always appreciated!
> > >>>
> > >>> -Daniel
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz <
> g.levcov...@gmail.com> wrote:
> > 
> >  Dear Beam Dev community,
> > 
> >  I'm working on a project where we have a graph database on Amazon
> Neptune (https://aws.amazon.com/neptune) and we have data coming from
> Google Cloud.
> > 
> >  So I was wondering if anyone has ever worked with a similar
> architecture and has considered developing an Amazon Neptune custom Beam
> I/O connector. Is it feasible? Is it worth it?
> > 
> >  Honestly I'm not that experienced with Apache Beam / Dataflow, so
> I'm not sure if something like that would make sense. Currently we're
> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
> > 
> >  Thank you all very much in advance!
> > 
> >  Best,
> >  Gabriel Levcovitz
>


Re: Contribution apache beam

2021-04-16 Thread Alexey Romanenko
Hi,

I added you as a contributor. Welcome to Beam!

Alexey

> On 16 Apr 2021, at 16:19, Iurii Kazanov  wrote:
> 
> Hi!
> My name is Iurii Kazanov and username is iurii.kazanov. I want to contribute 
> apache beam. Can someone add me as a contributor for Beam's Jira issue 
> tracker? I would like to create/assign tickets for my work.
> Thanks!



Contribution apache beam

2021-04-16 Thread Iurii Kazanov
Hi!

My name is Iurii Kazanov and username is iurii.kazanov. I want to contribute 
apache beam. Can someone add me as a contributor for Beam's Jira issue tracker? 
I would like to create/assign tickets for my work.

Thanks!


Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía
Oh forgot to mention one alternative that we do in the Avro project,
it is that we don't create issues for the dependabot PRs and then we
search all the commits authored by dependabot and include them in the
release notes to track dependency upgrades.

On Fri, Apr 16, 2021 at 4:02 PM Ismaël Mejía  wrote:
>
> > Quite often, dependency upgrade to latest versions leads to either 
> > compilation errors or failed tests and it should be resolved manually or 
> > declined. Having this, maybe I miss something, but I don’t see what kind of 
> > advantages automatic upgrade will bring to us except that we don’t need to 
> > create a PR manually (which is a not big deal).
>
> The advantage is exactly that, that we don't have to create and track
> dependency updates manually, it will be done by the bot and we will
> only have to do the review and guarantee that no issues are
> introduced. I forgot to mention but we can create exception rules so
> no further upgrades will be proposed for some dependencies e.g.
> Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
> advantage that is the detailed security report that will help us
> prioritize dependency upgrades.
>
> > Regarding another issue - it’s already a problem, imho. Since we have a one 
> > Jira per package upgrade now and usually it “accumulates” all package 
> > upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> > way to notify in release notes about all dependency upgrades for current 
> > release. One of the way is to mention the package upgrade in CHANGES.md 
> > which seems not very relible because it's quite easy to forget to do. I’d 
> > prefer to have a dedicated Jira issue for every upgrade and it will be 
> > included into releases notes almost automatically.
>
> Yes it seems the best for release note tracking to create the issue
> and rename the PR title for this, but that would be part of the
> review/merge process, so up to the Beam committers to do it
> systematically and given how well we respect the commit naming /
> squashing rules I am not sure if we will win much by having another
> useless rule.
>
> On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
>  wrote:
> >
> > Quite often, dependency upgrade to latest versions leads to either 
> > compilation errors or failed tests and it should be resolved manually or 
> > declined. Having this, maybe I miss something, but I don’t see what kind of 
> > advantages automatic upgrade will bring to us except that we don’t need to 
> > create a PR manually (which is a not big deal).
> >
> > Regarding another issue - it’s already a problem, imho. Since we have a one 
> > Jira per package upgrade now and usually it “accumulates” all package 
> > upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> > way to notify in release notes about all dependency upgrades for current 
> > release. One of the way is to mention the package upgrade in CHANGES.md 
> > which seems not very relible because it's quite easy to forget to do. I’d 
> > prefer to have a dedicated Jira issue for every upgrade and it will be 
> > included into releases notes almost automatically.
> >
> > > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> > >
> > > Hello,
> > >
> > > Github has a bot that creates automatically Dependency Update PRs and
> > > report security issues called dependabot.
> > >
> > > I was wondering if we should enable it for Beam. I tested it in my
> > > personal Beam fork and it seems to be working well, it created
> > > dependency updates for both Python and JS (website) dependencies.
> > > The bot seems to be having problems to understand our gradle
> > > dependency definitions for Java but that's something we can address in
> > > the future to benefit of the updates. Also it did not propose go-lang
> > > updates (probably for the same reason).
> > >
> > > If the community agrees I will create a ticket for INFRA to enable it.
> > > We might be getting extra PRs (at the beginning) and we have to be
> > > cautious about updates that might have unintended consequences for
> > > example we should not merge non stable dependency updates (those
> > > ending on -rc1 or -beta on Java) that
> > > might be proposed or dependencies that committers are aware we should
> > > not update for example projects where their main stable version is not
> > > the most recent one like Hadoop or dependencies that do not support
> > > our ongoing language target version (e.g. Java 11 only deps).
> > >
> > > Another issue is that these dependency updates might not get a JIRA
> > > associated with them so we need to decide if (1) we create one and
> > > rename/associate the PR with it, or (2) we just decide not to have
> > > JIRAs for dependency updates.
> > >
> > > WDYT? other pros/cons that I can be missing?
> > >
> > > Ismaël
> >


Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía
> Quite often, dependency upgrade to latest versions leads to either 
> compilation errors or failed tests and it should be resolved manually or 
> declined. Having this, maybe I miss something, but I don’t see what kind of 
> advantages automatic upgrade will bring to us except that we don’t need to 
> create a PR manually (which is a not big deal).

The advantage is exactly that, that we don't have to create and track
dependency updates manually, it will be done by the bot and we will
only have to do the review and guarantee that no issues are
introduced. I forgot to mention but we can create exception rules so
no further upgrades will be proposed for some dependencies e.g.
Hadoop, Netty (Java 11 flavor) etc. I forgot to mention another big
advantage that is the detailed security report that will help us
prioritize dependency upgrades.

> Regarding another issue - it’s already a problem, imho. Since we have a one 
> Jira per package upgrade now and usually it “accumulates” all package 
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> way to notify in release notes about all dependency upgrades for current 
> release. One of the way is to mention the package upgrade in CHANGES.md which 
> seems not very relible because it's quite easy to forget to do. I’d prefer to 
> have a dedicated Jira issue for every upgrade and it will be included into 
> releases notes almost automatically.

Yes it seems the best for release note tracking to create the issue
and rename the PR title for this, but that would be part of the
review/merge process, so up to the Beam committers to do it
systematically and given how well we respect the commit naming /
squashing rules I am not sure if we will win much by having another
useless rule.

On Fri, Apr 16, 2021 at 3:24 PM Alexey Romanenko
 wrote:
>
> Quite often, dependency upgrade to latest versions leads to either 
> compilation errors or failed tests and it should be resolved manually or 
> declined. Having this, maybe I miss something, but I don’t see what kind of 
> advantages automatic upgrade will bring to us except that we don’t need to 
> create a PR manually (which is a not big deal).
>
> Regarding another issue - it’s already a problem, imho. Since we have a one 
> Jira per package upgrade now and usually it “accumulates” all package 
> upgrades and it’s not closed once upgrade is done, we don’t have a reliable 
> way to notify in release notes about all dependency upgrades for current 
> release. One of the way is to mention the package upgrade in CHANGES.md which 
> seems not very relible because it's quite easy to forget to do. I’d prefer to 
> have a dedicated Jira issue for every upgrade and it will be included into 
> releases notes almost automatically.
>
> > On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> >
> > Hello,
> >
> > Github has a bot that creates automatically Dependency Update PRs and
> > report security issues called dependabot.
> >
> > I was wondering if we should enable it for Beam. I tested it in my
> > personal Beam fork and it seems to be working well, it created
> > dependency updates for both Python and JS (website) dependencies.
> > The bot seems to be having problems to understand our gradle
> > dependency definitions for Java but that's something we can address in
> > the future to benefit of the updates. Also it did not propose go-lang
> > updates (probably for the same reason).
> >
> > If the community agrees I will create a ticket for INFRA to enable it.
> > We might be getting extra PRs (at the beginning) and we have to be
> > cautious about updates that might have unintended consequences for
> > example we should not merge non stable dependency updates (those
> > ending on -rc1 or -beta on Java) that
> > might be proposed or dependencies that committers are aware we should
> > not update for example projects where their main stable version is not
> > the most recent one like Hadoop or dependencies that do not support
> > our ongoing language target version (e.g. Java 11 only deps).
> >
> > Another issue is that these dependency updates might not get a JIRA
> > associated with them so we need to decide if (1) we create one and
> > rename/associate the PR with it, or (2) we just decide not to have
> > JIRAs for dependency updates.
> >
> > WDYT? other pros/cons that I can be missing?
> >
> > Ismaël
>


Re: [DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Alexey Romanenko
Quite often, dependency upgrade to latest versions leads to either compilation 
errors or failed tests and it should be resolved manually or declined. Having 
this, maybe I miss something, but I don’t see what kind of advantages automatic 
upgrade will bring to us except that we don’t need to create a PR manually 
(which is a not big deal).

Regarding another issue - it’s already a problem, imho. Since we have a one 
Jira per package upgrade now and usually it “accumulates” all package upgrades 
and it’s not closed once upgrade is done, we don’t have a reliable way to 
notify in release notes about all dependency upgrades for current release. One 
of the way is to mention the package upgrade in CHANGES.md which seems not very 
relible because it's quite easy to forget to do. I’d prefer to have a dedicated 
Jira issue for every upgrade and it will be included into releases notes almost 
automatically.  

> On 16 Apr 2021, at 14:15, Ismaël Mejía  wrote:
> 
> Hello,
> 
> Github has a bot that creates automatically Dependency Update PRs and
> report security issues called dependabot.
> 
> I was wondering if we should enable it for Beam. I tested it in my
> personal Beam fork and it seems to be working well, it created
> dependency updates for both Python and JS (website) dependencies.
> The bot seems to be having problems to understand our gradle
> dependency definitions for Java but that's something we can address in
> the future to benefit of the updates. Also it did not propose go-lang
> updates (probably for the same reason).
> 
> If the community agrees I will create a ticket for INFRA to enable it.
> We might be getting extra PRs (at the beginning) and we have to be
> cautious about updates that might have unintended consequences for
> example we should not merge non stable dependency updates (those
> ending on -rc1 or -beta on Java) that
> might be proposed or dependencies that committers are aware we should
> not update for example projects where their main stable version is not
> the most recent one like Hadoop or dependencies that do not support
> our ongoing language target version (e.g. Java 11 only deps).
> 
> Another issue is that these dependency updates might not get a JIRA
> associated with them so we need to decide if (1) we create one and
> rename/associate the PR with it, or (2) we just decide not to have
> JIRAs for dependency updates.
> 
> WDYT? other pros/cons that I can be missing?
> 
> Ismaël



[DISCUSS] Enable automatic dependency updates with Github's dependabot

2021-04-16 Thread Ismaël Mejía
Hello,

Github has a bot that creates automatically Dependency Update PRs and
report security issues called dependabot.

I was wondering if we should enable it for Beam. I tested it in my
personal Beam fork and it seems to be working well, it created
dependency updates for both Python and JS (website) dependencies.
The bot seems to be having problems to understand our gradle
dependency definitions for Java but that's something we can address in
the future to benefit of the updates. Also it did not propose go-lang
updates (probably for the same reason).

If the community agrees I will create a ticket for INFRA to enable it.
We might be getting extra PRs (at the beginning) and we have to be
cautious about updates that might have unintended consequences for
example we should not merge non stable dependency updates (those
ending on -rc1 or -beta on Java) that
might be proposed or dependencies that committers are aware we should
not update for example projects where their main stable version is not
the most recent one like Hadoop or dependencies that do not support
our ongoing language target version (e.g. Java 11 only deps).

Another issue is that these dependency updates might not get a JIRA
associated with them so we need to decide if (1) we create one and
rename/associate the PR with it, or (2) we just decide not to have
JIRAs for dependency updates.

WDYT? other pros/cons that I can be missing?

Ismaël


Re: [VOTE] Release 2.29.0, release candidate #1

2021-04-16 Thread Elliotte Rusty Harold
On Fri, Apr 16, 2021 at 4:02 AM Kenneth Knowles  wrote:

> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2], 
> which is signed with the key with fingerprint 
> 03DBA3E6ABDD04BFD1558DC16ED551A8AE02461C [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.29.0-RC1" [5],
> * website pull request listing the release [6], publishing the API reference 
> manual [7], and the blog post [8].
> * Java artifacts were built with Maven MAVEN_VERSION and OpenJDK/Oracle JDK 
> JDK_VERSION.

Are the MAVEN_VERSION and OpenJDK/Oracle JDK JDK_VERSION supposed to
be filled in with numbers?


--
Elliotte Rusty Harold
elh...@ibiblio.org


Re: Long term support versions of Beam Java

2021-04-16 Thread Ismaël Mejía
As Kenn points clearly, everyone can do an Apache release of an
earlier version, so this should cover most maintenance fixes for old
versions. So any person (or company) can decide to work on supporting
one version.

The real deal of having a LTS "backed by the community" is that ALL
the community should care about backporting issues and that's exactly
what made the previous LTS trial fail.

Why should I (contributor/maintainer) care about backporting stuff if
users (or my employer) can move rapidly to a more recent version and
get even more additional benefits? It does not help the fact that
backporting fixes has been absolutely painful in the past due to the
rapid changes of Beam internals + build system + CI runs.

But even if most people (or just one company) are interested on
backporting issues and maintaining a LTS there is still more to
clarify. The devil in the details: What are the goals or guarantees of
the LTS version and how this impact the backporting of issues? Some
already mentioned upgrade and state compatibility as candidates (that
of course will require additional regression tests), but I can think
about more mundane ones like can we upgrade dependency versions in the
LTS for reasons other than security to make migration easier? What
happens if changes in master introduce incompatible transitive (or
not) APIs into the LTS version, should they be backported or no? There
are without doubts more details to clarify.

Other aspect of having a LTS version that has not been mentioned is
that all the bugs that users report in advance because they are moving
versions more regularly will be discovered and reported now one year
later. This has a negative impact for the project quality too. We are
not perfect and errors will happen, so the earlier we can find and fix
them the better. It is up to us in the project that the quality is
good enough so users are motivated to upgrade and don't prefer to stay
in older versions just because of fear.

On Tue, Apr 13, 2021 at 2:22 AM Robert Burke  wrote:
>
> I'll note that "binary compatibility" is can be substituted substitute for 
> "upgrade compatibility" or "state compatibility".
>
> On Mon, Apr 12, 2021, 5:04 PM Brian Hulette  wrote:
>>
>> > Beam is also multi language, which adjusts concerns. How does GRPC handle 
>> > that? Or protos? (I'm sure there are other examples we can pull from...)
>>
>> I'm not sure those are good projects to look at. They're likely much more 
>> concerned with binary compatibility and there's probably not much change in 
>> the user-facing API.
>> Arrow is another multi-language project but I don't think we can learn much 
>> from it's versioning policy [1], which is much more concerned with binary 
>> compatibility than it is with API compatibility (for now). Perhaps one 
>> lesson is that they track a separate format version and API version. We 
>> could do something similar and have a separate version number for the Beam 
>> model protos. I'm not sure if that's relevant for this discussion or not.
>>
>> Spark may be a reasonable comparison since it provides an API in multiple 
>> languages, but that doesn't seem to have any bearing on it's versioning 
>> policy [2]. It sounds similar to Flink in that every minor release gets 
>> backported bugfixes for 18 months, but releases are slower (~6 months) so 
>> that's not as much of a burden.
>>
>> Brian
>>
>> [1] 
>> https://arrow.apache.org/docs/format/Versioning.html#backward-compatibility
>> [2] https://spark.apache.org/versioning-policy.html
>>
>> On Thu, Apr 8, 2021 at 1:18 PM Robert Bradshaw  wrote:
>>>
>>> Python (again a language) has a slower release cycle, fairly strict 
>>> backwards compatibility stance (with the ability to opt-in before changes 
>>> become the default) and clear ownership for maintenance of each minor 
>>> version until end-of-life (so each could be considered an LTS release). 
>>> https://devguide.python.org/devcycle/
>>>
>>> Cython is more similar to Beam: best-effort compatibility, no LTS, but as a 
>>> code-generater rather than a runtime library a developer is mostly free to 
>>> upgrade at their own cadence regardless of the surrounding ecosystem 
>>> (including downstream projects that take them on as a dependency).
>>>
>>> IIRC, Flink supports the latest N (3?) releases, which are infrequent 
>>> enough to cover about 12-18 months.
>>>
>>> My take is that Beam should be supportive of LTS releases, but we're not in 
>>> a position to commit to it (to the same level we commit to the 6-week 
>>> cut-from-head release cycle). But certain users of Beam (which have a large 
>>> overlap with the Beam community) could make such commitments as it helps 
>>> them (directly or indirectly). Let's give it a try.
>>>
>>>
>>> On Thu, Apr 8, 2021 at 1:00 PM Robert Burke  wrote:

 I don't know about other Apache projects but the Go Programming Language 
 uses a slower release cadence, two releases a year. Only the last two 
 releases are 

Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Ismaël Mejía
I had not seen that the query API of Neptune is Gremlin based so this
could be an even more generic IO connector.
That's probably beyond scope because you care most for the write but
interesting anyway.

https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html

On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía  wrote:
>
> Hello Gabriel,
>
> Other interesting reference because of the Batch loads API like use +
> Amazon is the unfinished Amazon Redshift connector PR from this ticket
> https://issues.apache.org/jira/browse/BEAM-3032
>
> The reason why that one was not merged into Beam is because if lacked tests.
> You should probably look at how to test Neptune in advance, it seems
> that localstack does not support neptune (only on the paying version)
> so probably mocking would be the right way.
>
> We will be really interested in case you want to contribute the
> NeptuneIO connector into Beam so don't hesitate to contact us.
>
>
> On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz  
> wrote:
> >
> > Hi Daniel, Kenneth,
> >
> > Thank you very much for your answers! I'll be looking carefully into the 
> > info you've provided and if we eventually decide it's worth implementing, 
> > I'll get back to you.
> >
> > Best,
> > Gabriel
> >
> >
> > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles  wrote:
> >>
> >>
> >>
> >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins  
> >> wrote:
> >>>
> >>> Hi Gabriel,
> >>>
> >>> Write-side adapters for systems tend to be easier than read-side adapters 
> >>> to implement. That being said, looking at the documentation for neptune, 
> >>> it looks to me like there's no direct data load API, only a batch data 
> >>> load from a file on S3? This is usable but perhaps a bit more difficult 
> >>> to work with.
> >>>
> >>> You could implement a write side adapter for neptune (either on your own 
> >>> or as a contribution to beam) by writing a standard DoFn which, in its 
> >>> ProcessElement method, buffers received records in memory, and in its 
> >>> FinishBundle method, writes all collected records to a file on S3, 
> >>> notifies neptune, and waits for neptune to ingest them. You can see 
> >>> documentation on the DoFn API here. Someone else here might have more 
> >>> experience working with microbatch-style APIs like this, and could have 
> >>> more suggestions.
> >>
> >>
> >> In fact, our BigQueryIO connector has a mode of operation that does batch 
> >> loads from files on GCS: 
> >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> >>
> >> The connector overall is large and complex, because it is old and mature. 
> >> But it may be helpful as a point of reference.
> >>
> >> Kenn
> >>
> >>>
> >>> A read-side API would likely be only a minimally higher lift. This could 
> >>> be done in a simple loading step (Create with a single element followed 
> >>> by MapElements), although much of the complexity likely lies around how 
> >>> to provide the necessary properties to the cluster construction on the 
> >>> beam worker task, and how to define the query the user would need to 
> >>> execute. I'd also wonder if this could be done in an engine-agnostic way, 
> >>> "TinkerPopIO" instead of "NeptuneIO".
> >>>
> >>> If you'd like to pursue adding such an integration, 
> >>> https://beam.apache.org/contribute/ provides documentation on the 
> >>> contribution process. Contributions to beam are always appreciated!
> >>>
> >>> -Daniel
> >>>
> >>>
> >>>
> >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz 
> >>>  wrote:
> 
>  Dear Beam Dev community,
> 
>  I'm working on a project where we have a graph database on Amazon 
>  Neptune (https://aws.amazon.com/neptune) and we have data coming from 
>  Google Cloud.
> 
>  So I was wondering if anyone has ever worked with a similar architecture 
>  and has considered developing an Amazon Neptune custom Beam I/O 
>  connector. Is it feasible? Is it worth it?
> 
>  Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm 
>  not sure if something like that would make sense. Currently we're 
>  connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
> 
>  Thank you all very much in advance!
> 
>  Best,
>  Gabriel Levcovitz


Re: [Question] Amazon Neptune I/O connector

2021-04-16 Thread Ismaël Mejía
Hello Gabriel,

Other interesting reference because of the Batch loads API like use +
Amazon is the unfinished Amazon Redshift connector PR from this ticket
https://issues.apache.org/jira/browse/BEAM-3032

The reason why that one was not merged into Beam is because if lacked tests.
You should probably look at how to test Neptune in advance, it seems
that localstack does not support neptune (only on the paying version)
so probably mocking would be the right way.

We will be really interested in case you want to contribute the
NeptuneIO connector into Beam so don't hesitate to contact us.


On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz  wrote:
>
> Hi Daniel, Kenneth,
>
> Thank you very much for your answers! I'll be looking carefully into the info 
> you've provided and if we eventually decide it's worth implementing, I'll get 
> back to you.
>
> Best,
> Gabriel
>
>
> On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles  wrote:
>>
>>
>>
>> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins  wrote:
>>>
>>> Hi Gabriel,
>>>
>>> Write-side adapters for systems tend to be easier than read-side adapters 
>>> to implement. That being said, looking at the documentation for neptune, it 
>>> looks to me like there's no direct data load API, only a batch data load 
>>> from a file on S3? This is usable but perhaps a bit more difficult to work 
>>> with.
>>>
>>> You could implement a write side adapter for neptune (either on your own or 
>>> as a contribution to beam) by writing a standard DoFn which, in its 
>>> ProcessElement method, buffers received records in memory, and in its 
>>> FinishBundle method, writes all collected records to a file on S3, notifies 
>>> neptune, and waits for neptune to ingest them. You can see documentation on 
>>> the DoFn API here. Someone else here might have more experience working 
>>> with microbatch-style APIs like this, and could have more suggestions.
>>
>>
>> In fact, our BigQueryIO connector has a mode of operation that does batch 
>> loads from files on GCS: 
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
>>
>> The connector overall is large and complex, because it is old and mature. 
>> But it may be helpful as a point of reference.
>>
>> Kenn
>>
>>>
>>> A read-side API would likely be only a minimally higher lift. This could be 
>>> done in a simple loading step (Create with a single element followed by 
>>> MapElements), although much of the complexity likely lies around how to 
>>> provide the necessary properties to the cluster construction on the beam 
>>> worker task, and how to define the query the user would need to execute. 
>>> I'd also wonder if this could be done in an engine-agnostic way, 
>>> "TinkerPopIO" instead of "NeptuneIO".
>>>
>>> If you'd like to pursue adding such an integration, 
>>> https://beam.apache.org/contribute/ provides documentation on the 
>>> contribution process. Contributions to beam are always appreciated!
>>>
>>> -Daniel
>>>
>>>
>>>
>>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz  
>>> wrote:

 Dear Beam Dev community,

 I'm working on a project where we have a graph database on Amazon Neptune 
 (https://aws.amazon.com/neptune) and we have data coming from Google Cloud.

 So I was wondering if anyone has ever worked with a similar architecture 
 and has considered developing an Amazon Neptune custom Beam I/O connector. 
 Is it feasible? Is it worth it?

 Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm not 
 sure if something like that would make sense. Currently we're connecting 
 Beam to AWS Kinesis and AWS S3, and from there, to Neptune.

 Thank you all very much in advance!

 Best,
 Gabriel Levcovitz