Re: [Question] Amazon Neptune I/O connector

Gabriel Levcovitz Fri, 16 Apr 2021 07:56:11 -0700

On Fri, Apr 16, 2021 at 6:36 AM Ismaël Mejía <ieme...@gmail.com> wrote:


> I had not seen that the query API of Neptune is Gremlin based so this
> could be an even more generic IO connector.
> That's probably beyond scope because you care most for the write but
> interesting anyway.
>
>
> https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-java.html


Well, in theory the Gremlin API could even be used for writing too, but I
know for a fact that it's not very performatic and Amazon recommends using
the Bulk Loader when creating a lot of vertices/edges at once. But, if they
optimize this in the future, it could be even more interesting.

Gabriel


On Fri, Apr 16, 2021 at 9:58 AM Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> > Hello Gabriel,
> >
> > Other interesting reference because of the Batch loads API like use +
> > Amazon is the unfinished Amazon Redshift connector PR from this ticket
> > https://issues.apache.org/jira/browse/BEAM-3032
> >
> > The reason why that one was not merged into Beam is because if lacked
> tests.
> > You should probably look at how to test Neptune in advance, it seems
> > that localstack does not support neptune (only on the paying version)
> > so probably mocking would be the right way.
> >
> > We will be really interested in case you want to contribute the
> > NeptuneIO connector into Beam so don't hesitate to contact us.
> >
> >
> > On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <g.levcov...@gmail.com>
> wrote:
> > >
> > > Hi Daniel, Kenneth,
> > >
> > > Thank you very much for your answers! I'll be looking carefully into
> the info you've provided and if we eventually decide it's worth
> implementing, I'll get back to you.
> > >
> > > Best,
> > > Gabriel
> > >
> > >
> > > On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <k...@apache.org>
> wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <dpcoll...@google.com>
> wrote:
> > >>>
> > >>> Hi Gabriel,
> > >>>
> > >>> Write-side adapters for systems tend to be easier than read-side
> adapters to implement. That being said, looking at the documentation for
> neptune, it looks to me like there's no direct data load API, only a batch
> data load from a file on S3? This is usable but perhaps a bit more
> difficult to work with.
> > >>>
> > >>> You could implement a write side adapter for neptune (either on your
> own or as a contribution to beam) by writing a standard DoFn which, in its
> ProcessElement method, buffers received records in memory, and in its
> FinishBundle method, writes all collected records to a file on S3, notifies
> neptune, and waits for neptune to ingest them. You can see documentation on
> the DoFn API here. Someone else here might have more experience working
> with microbatch-style APIs like this, and could have more suggestions.
> > >>
> > >>
> > >> In fact, our BigQueryIO connector has a mode of operation that does
> batch loads from files on GCS:
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
> > >>
> > >> The connector overall is large and complex, because it is old and
> mature. But it may be helpful as a point of reference.
> > >>
> > >> Kenn
> > >>
> > >>>
> > >>> A read-side API would likely be only a minimally higher lift. This
> could be done in a simple loading step (Create with a single element
> followed by MapElements), although much of the complexity likely lies
> around how to provide the necessary properties to the cluster construction
> on the beam worker task, and how to define the query the user would need to
> execute. I'd also wonder if this could be done in an engine-agnostic way,
> "TinkerPopIO" instead of "NeptuneIO".
> > >>>
> > >>> If you'd like to pursue adding such an integration,
> https://beam.apache.org/contribute/ provides documentation on the
> contribution process. Contributions to beam are always appreciated!
> > >>>
> > >>> -Daniel
> > >>>
> > >>>
> > >>>
> > >>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz <
> g.levcov...@gmail.com> wrote:
> > >>>>
> > >>>> Dear Beam Dev community,
> > >>>>
> > >>>> I'm working on a project where we have a graph database on Amazon
> Neptune (https://aws.amazon.com/neptune) and we have data coming from
> Google Cloud.
> > >>>>
> > >>>> So I was wondering if anyone has ever worked with a similar
> architecture and has considered developing an Amazon Neptune custom Beam
> I/O connector. Is it feasible? Is it worth it?
> > >>>>
> > >>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so
> I'm not sure if something like that would make sense. Currently we're
> connecting Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
> > >>>>
> > >>>> Thank you all very much in advance!
> > >>>>
> > >>>> Best,
> > >>>> Gabriel Levcovitz
>

Re: [Question] Amazon Neptune I/O connector

Reply via email to