I’d also recommend to take a look on batch writes in SnowflakeIO connector [1] 
- it supports the batch load only from GCS for now but it could be a good 
reference too.

[1] 
https://github.com/apache/beam/blob/985e2f095d150261e998f58cf048e48a909d5b2b/sdks/java/io/snowflake/src/main/java/org/apache/beam/sdk/io/snowflake/SnowflakeIO.java#L1051

> On 16 Apr 2021, at 16:55, Gabriel Levcovitz <[email protected]> wrote:
> 
> On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía <[email protected] 
> <mailto:[email protected]>> wrote:
> I had not seen that the query API of Neptune is Gremlin based so this
> could be an even more generic IO connector.
> That's probably beyond scope because you care most for the write but
> interesting anyway.
> 
> https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html
>  
> <https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html>
> 
> Well, in theory the Gremlin API could even be used for writing too, but I 
> know for a fact that it's not very performatic and Amazon recommends using 
> the Bulk Loader when creating a lot of vertices/edges at once. But, if they 
> optimize this in the future, it could be even more interesting.
> 
> Gabriel
>  
> 
> On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía <[email protected] 
> <mailto:[email protected]>> wrote:
> >
> > Hello Gabriel,
> >
> > Other interesting reference because of the Batch loads API like use +
> > Amazon is the unfinished Amazon Redshift connector PR from this ticket
> > https://issues.apache.org/jira/browse/BEAM-3032 
> > <https://issues.apache.org/jira/browse/BEAM-3032>
> >
> > The reason why that one was not merged into Beam is because if lacked tests.
> > You should probably look at how to test Neptune in advance, it seems
> > that localstack does not support neptune (only on the paying version)
> > so probably mocking would be the right way.
> >
> > We will be really interested in case you want to contribute the
> > NeptuneIO connector into Beam so don't hesitate to contact us.
> >
> >
> > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <[email protected] 
> > <mailto:[email protected]>> wrote:
> > >
> > > Hi Daniel, Kenneth,
> > >
> > > Thank you very much for your answers! I'll be looking carefully into the 
> > > info you've provided and if we eventually decide it's worth implementing, 
> > > I'll get back to you.
> > >
> > > Best,
> > > Gabriel
> > >
> > >
> > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <[email protected] 
> > > <mailto:[email protected]>> wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <[email protected] 
> > >> <mailto:[email protected]>> wrote:
> > >>>
> > >>> Hi Gabriel,
> > >>>
> > >>> Write-side adapters for systems tend to be easier than read-side 
> > >>> adapters to implement. That being said, looking at the documentation 
> > >>> for neptune, it looks to me like there's no direct data load API, only 
> > >>> a batch data load from a file on S3? This is usable but perhaps a bit 
> > >>> more difficult to work with.
> > >>>
> > >>> You could implement a write side adapter for neptune (either on your 
> > >>> own or as a contribution to beam) by writing a standard DoFn which, in 
> > >>> its ProcessElement method, buffers received records in memory, and in 
> > >>> its FinishBundle method, writes all collected records to a file on S3, 
> > >>> notifies neptune, and waits for neptune to ingest them. You can see 
> > >>> documentation on the DoFn API here. Someone else here might have more 
> > >>> experience working with microbatch-style APIs like this, and could have 
> > >>> more suggestions.
> > >>
> > >>
> > >> In fact, our BigQueryIO connector has a mode of operation that does 
> > >> batch loads from files on GCS: 
> > >> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> > >>  
> > >> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java>
> > >>
> > >> The connector overall is large and complex, because it is old and 
> > >> mature. But it may be helpful as a point of reference.
> > >>
> > >> Kenn
> > >>
> > >>>
> > >>> A read-side API would likely be only a minimally higher lift. This 
> > >>> could be done in a simple loading step (Create with a single element 
> > >>> followed by MapElements), although much of the complexity likely lies 
> > >>> around how to provide the necessary properties to the cluster 
> > >>> construction on the beam worker task, and how to define the query the 
> > >>> user would need to execute. I'd also wonder if this could be done in an 
> > >>> engine-agnostic way, "TinkerPopIO" instead of "NeptuneIO".
> > >>>
> > >>> If you'd like to pursue adding such an integration, 
> > >>> https://beam.apache.org/contribute/ 
> > >>> <https://beam.apache.org/contribute/> provides documentation on the 
> > >>> contribution process. Contributions to beam are always appreciated!
> > >>>
> > >>> -Daniel
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz 
> > >>> <[email protected] <mailto:[email protected]>> wrote:
> > >>>>
> > >>>> Dear Beam Dev community,
> > >>>>
> > >>>> I'm working on a project where we have a graph database on Amazon 
> > >>>> Neptune (https://aws.amazon.com/neptune 
> > >>>> <https://aws.amazon.com/neptune>) and we have data coming from Google 
> > >>>> Cloud.
> > >>>>
> > >>>> So I was wondering if anyone has ever worked with a similar 
> > >>>> architecture and has considered developing an Amazon Neptune custom 
> > >>>> Beam I/O connector. Is it feasible? Is it worth it?
> > >>>>
> > >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm 
> > >>>> not sure if something like that would make sense. Currently we're 
> > >>>> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
> > >>>>
> > >>>> Thank you all very much in advance!
> > >>>>
> > >>>> Best,
> > >>>> Gabriel Levcovitz

Reply via email to