Hi Daniel, Kenneth, Thank you very much for your answers! I'll be looking carefully into the info you've provided and if we eventually decide it's worth implementing, I'll get back to you.
Best, Gabriel On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <k...@apache.org> wrote: > > > On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <dpcoll...@google.com> > wrote: > >> Hi Gabriel, >> >> Write-side adapters for systems tend to be easier than read-side adapters >> to implement. That being said, looking at the documentation for neptune, it >> looks to me like there's no direct data load API, only a batch data load >> from a file on S3 >> <https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-data.html>? >> This is usable but perhaps a bit more difficult to work with. >> >> You could implement a write side adapter for neptune (either on your own >> or as a contribution to beam) by writing a standard DoFn which, in its >> ProcessElement method, buffers received records in memory, and in its >> FinishBundle method, writes all collected records to a file on S3, notifies >> neptune, and waits for neptune to ingest them. You can see documentation on >> the DoFn API here >> <https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/DoFn.html>. >> Someone else here might have more experience working with microbatch-style >> APIs like this, and could have more suggestions. >> > > In fact, our BigQueryIO connector has a mode of operation that does batch > loads from files on GCS: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java > > The connector overall is large and complex, because it is old and mature. > But it may be helpful as a point of reference. > > Kenn > > >> A read-side API would likely be only a minimally higher lift. This could >> be done in a simple loading step (Create with a single element followed by >> MapElements), although much of the complexity likely lies around how to >> provide the necessary properties to the cluster construction >> <https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html> >> on >> the beam worker task, and how to define the query the user would need to >> execute. I'd also wonder if this could be done in an engine-agnostic way, >> "TinkerPopIO" instead of "NeptuneIO". >> >> If you'd like to pursue adding such an integration, >> https://beam.apache.org/contribute/ provides documentation on the >> contribution process. Contributions to beam are always appreciated! >> >> -Daniel >> >> >> >> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz <g.levcov...@gmail.com> >> wrote: >> >>> Dear Beam Dev community, >>> >>> I'm working on a project where we have a graph database on Amazon >>> Neptune (https://aws.amazon.com/neptune) and we have data coming from >>> Google Cloud. >>> >>> So I was wondering if anyone has ever worked with a similar architecture >>> and has considered developing an Amazon Neptune custom Beam I/O connector. >>> Is it feasible? Is it worth it? >>> >>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm >>> not sure if something like that would make sense. Currently we're >>> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune. >>> >>> Thank you all very much in advance! >>> >>> Best, >>> Gabriel Levcovitz >>> >>