On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> I had not seen that the query API of Neptune is Gremlin based so this > could be an even more generic IO connector. > That's probably beyond scope because you care most for the write but > interesting anyway. > > > https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html Well, in theory the Gremlin API could even be used for writing too, but I know for a fact that it's not very performatic and Amazon recommends using the Bulk Loader when creating a lot of vertices/edges at once. But, if they optimize this in the future, it could be even more interesting. Gabriel On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía <ieme...@gmail.com> wrote: > > > > Hello Gabriel, > > > > Other interesting reference because of the Batch loads API like use + > > Amazon is the unfinished Amazon Redshift connector PR from this ticket > > https://issues.apache.org/jira/browse/BEAM-3032 > > > > The reason why that one was not merged into Beam is because if lacked > tests. > > You should probably look at how to test Neptune in advance, it seems > > that localstack does not support neptune (only on the paying version) > > so probably mocking would be the right way. > > > > We will be really interested in case you want to contribute the > > NeptuneIO connector into Beam so don't hesitate to contact us. > > > > > > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <g.levcov...@gmail.com> > wrote: > > > > > > Hi Daniel, Kenneth, > > > > > > Thank you very much for your answers! I'll be looking carefully into > the info you've provided and if we eventually decide it's worth > implementing, I'll get back to you. > > > > > > Best, > > > Gabriel > > > > > > > > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <k...@apache.org> > wrote: > > >> > > >> > > >> > > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <dpcoll...@google.com> > wrote: > > >>> > > >>> Hi Gabriel, > > >>> > > >>> Write-side adapters for systems tend to be easier than read-side > adapters to implement. That being said, looking at the documentation for > neptune, it looks to me like there's no direct data load API, only a batch > data load from a file on S3? This is usable but perhaps a bit more > difficult to work with. > > >>> > > >>> You could implement a write side adapter for neptune (either on your > own or as a contribution to beam) by writing a standard DoFn which, in its > ProcessElement method, buffers received records in memory, and in its > FinishBundle method, writes all collected records to a file on S3, notifies > neptune, and waits for neptune to ingest them. You can see documentation on > the DoFn API here. Someone else here might have more experience working > with microbatch-style APIs like this, and could have more suggestions. > > >> > > >> > > >> In fact, our BigQueryIO connector has a mode of operation that does > batch loads from files on GCS: > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java > > >> > > >> The connector overall is large and complex, because it is old and > mature. But it may be helpful as a point of reference. > > >> > > >> Kenn > > >> > > >>> > > >>> A read-side API would likely be only a minimally higher lift. This > could be done in a simple loading step (Create with a single element > followed by MapElements), although much of the complexity likely lies > around how to provide the necessary properties to the cluster construction > on the beam worker task, and how to define the query the user would need to > execute. I'd also wonder if this could be done in an engine-agnostic way, > "TinkerPopIO" instead of "NeptuneIO". > > >>> > > >>> If you'd like to pursue adding such an integration, > https://beam.apache.org/contribute/ provides documentation on the > contribution process. Contributions to beam are always appreciated! > > >>> > > >>> -Daniel > > >>> > > >>> > > >>> > > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz < > g.levcov...@gmail.com> wrote: > > >>>> > > >>>> Dear Beam Dev community, > > >>>> > > >>>> I'm working on a project where we have a graph database on Amazon > Neptune (https://aws.amazon.com/neptune) and we have data coming from > Google Cloud. > > >>>> > > >>>> So I was wondering if anyone has ever worked with a similar > architecture and has considered developing an Amazon Neptune custom Beam > I/O connector. Is it feasible? Is it worth it? > > >>>> > > >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so > I'm not sure if something like that would make sense. Currently we're > connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune. > > >>>> > > >>>> Thank you all very much in advance! > > >>>> > > >>>> Best, > > >>>> Gabriel Levcovitz >