I’d also recommend to take a look on batch writes in SnowflakeIO connector [1] - it supports the batch load only from GCS for now but it could be a good reference too.
[1] https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L1051 > On 16 Apr 2021, at 16:55, Gabriel Levcovitz <[email protected]> wrote: > > On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía <[email protected] > <mailto:[email protected]>> wrote: > I had not seen that the query API of Neptune is Gremlin based so this > could be an even more generic IO connector. > That's probably beyond scope because you care most for the write but > interesting anyway. > > https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html > > <https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html> > > Well, in theory the Gremlin API could even be used for writing too, but I > know for a fact that it's not very performatic and Amazon recommends using > the Bulk Loader when creating a lot of vertices/edges at once. But, if they > optimize this in the future, it could be even more interesting. > > Gabriel > > > On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía <[email protected] > <mailto:[email protected]>> wrote: > > > > Hello Gabriel, > > > > Other interesting reference because of the Batch loads API like use + > > Amazon is the unfinished Amazon Redshift connector PR from this ticket > > https://issues.apache.org/jira/browse/BEAM-3032 > > <https://issues.apache.org/jira/browse/BEAM-3032> > > > > The reason why that one was not merged into Beam is because if lacked tests. > > You should probably look at how to test Neptune in advance, it seems > > that localstack does not support neptune (only on the paying version) > > so probably mocking would be the right way. > > > > We will be really interested in case you want to contribute the > > NeptuneIO connector into Beam so don't hesitate to contact us. > > > > > > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <[email protected] > > <mailto:[email protected]>> wrote: > > > > > > Hi Daniel, Kenneth, > > > > > > Thank you very much for your answers! I'll be looking carefully into the > > > info you've provided and if we eventually decide it's worth implementing, > > > I'll get back to you. > > > > > > Best, > > > Gabriel > > > > > > > > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <[email protected] > > > <mailto:[email protected]>> wrote: > > >> > > >> > > >> > > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <[email protected] > > >> <mailto:[email protected]>> wrote: > > >>> > > >>> Hi Gabriel, > > >>> > > >>> Write-side adapters for systems tend to be easier than read-side > > >>> adapters to implement. That being said, looking at the documentation > > >>> for neptune, it looks to me like there's no direct data load API, only > > >>> a batch data load from a file on S3? This is usable but perhaps a bit > > >>> more difficult to work with. > > >>> > > >>> You could implement a write side adapter for neptune (either on your > > >>> own or as a contribution to beam) by writing a standard DoFn which, in > > >>> its ProcessElement method, buffers received records in memory, and in > > >>> its FinishBundle method, writes all collected records to a file on S3, > > >>> notifies neptune, and waits for neptune to ingest them. You can see > > >>> documentation on the DoFn API here. Someone else here might have more > > >>> experience working with microbatch-style APIs like this, and could have > > >>> more suggestions. > > >> > > >> > > >> In fact, our BigQueryIO connector has a mode of operation that does > > >> batch loads from files on GCS: > > >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java > > >> > > >> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java> > > >> > > >> The connector overall is large and complex, because it is old and > > >> mature. But it may be helpful as a point of reference. > > >> > > >> Kenn > > >> > > >>> > > >>> A read-side API would likely be only a minimally higher lift. This > > >>> could be done in a simple loading step (Create with a single element > > >>> followed by MapElements), although much of the complexity likely lies > > >>> around how to provide the necessary properties to the cluster > > >>> construction on the beam worker task, and how to define the query the > > >>> user would need to execute. I'd also wonder if this could be done in an > > >>> engine-agnostic way, "TinkerPopIO" instead of "NeptuneIO". > > >>> > > >>> If you'd like to pursue adding such an integration, > > >>> https://beam.apache.org/contribute/ > > >>> <https://beam.apache.org/contribute/> provides documentation on the > > >>> contribution process. Contributions to beam are always appreciated! > > >>> > > >>> -Daniel > > >>> > > >>> > > >>> > > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz > > >>> <[email protected] <mailto:[email protected]>> wrote: > > >>>> > > >>>> Dear Beam Dev community, > > >>>> > > >>>> I'm working on a project where we have a graph database on Amazon > > >>>> Neptune (https://aws.amazon.com/neptune > > >>>> <https://aws.amazon.com/neptune>) and we have data coming from Google > > >>>> Cloud. > > >>>> > > >>>> So I was wondering if anyone has ever worked with a similar > > >>>> architecture and has considered developing an Amazon Neptune custom > > >>>> Beam I/O connector. Is it feasible? Is it worth it? > > >>>> > > >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm > > >>>> not sure if something like that would make sense. Currently we're > > >>>> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune. > > >>>> > > >>>> Thank you all very much in advance! > > >>>> > > >>>> Best, > > >>>> Gabriel Levcovitz
