Hello Gabriel,

Other interesting reference because of the Batch loads API like use +
Amazon is the unfinished Amazon Redshift connector PR from this ticket
https://issues.apache.org/jira/browse/BEAM-3032

The reason why that one was not merged into Beam is because if lacked tests.
You should probably look at how to test Neptune in advance, it seems
that localstack does not support neptune (only on the paying version)
so probably mocking would be the right way.

We will be really interested in case you want to contribute the
NeptuneIO connector into Beam so don't hesitate to contact us.


On Fri, Apr 16, 2021 at 5:41 AM Gabriel Levcovitz <g.levcov...@gmail.com> wrote:
>
> Hi Daniel, Kenneth,
>
> Thank you very much for your answers! I'll be looking carefully into the info 
> you've provided and if we eventually decide it's worth implementing, I'll get 
> back to you.
>
> Best,
> Gabriel
>
>
> On Thu, Apr 15, 2021 at 2:32 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>
>>
>> On Wed, Apr 14, 2021 at 11:07 PM Daniel Collins <dpcoll...@google.com> wrote:
>>>
>>> Hi Gabriel,
>>>
>>> Write-side adapters for systems tend to be easier than read-side adapters 
>>> to implement. That being said, looking at the documentation for neptune, it 
>>> looks to me like there's no direct data load API, only a batch data load 
>>> from a file on S3? This is usable but perhaps a bit more difficult to work 
>>> with.
>>>
>>> You could implement a write side adapter for neptune (either on your own or 
>>> as a contribution to beam) by writing a standard DoFn which, in its 
>>> ProcessElement method, buffers received records in memory, and in its 
>>> FinishBundle method, writes all collected records to a file on S3, notifies 
>>> neptune, and waits for neptune to ingest them. You can see documentation on 
>>> the DoFn API here. Someone else here might have more experience working 
>>> with microbatch-style APIs like this, and could have more suggestions.
>>
>>
>> In fact, our BigQueryIO connector has a mode of operation that does batch 
>> loads from files on GCS: 
>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java
>>
>> The connector overall is large and complex, because it is old and mature. 
>> But it may be helpful as a point of reference.
>>
>> Kenn
>>
>>>
>>> A read-side API would likely be only a minimally higher lift. This could be 
>>> done in a simple loading step (Create with a single element followed by 
>>> MapElements), although much of the complexity likely lies around how to 
>>> provide the necessary properties to the cluster construction on the beam 
>>> worker task, and how to define the query the user would need to execute. 
>>> I'd also wonder if this could be done in an engine-agnostic way, 
>>> "TinkerPopIO" instead of "NeptuneIO".
>>>
>>> If you'd like to pursue adding such an integration, 
>>> https://beam.apache.org/contribute/ provides documentation on the 
>>> contribution process. Contributions to beam are always appreciated!
>>>
>>> -Daniel
>>>
>>>
>>>
>>> On Thu, Apr 15, 2021 at 12:44 AM Gabriel Levcovitz <g.levcov...@gmail.com> 
>>> wrote:
>>>>
>>>> Dear Beam Dev community,
>>>>
>>>> I'm working on a project where we have a graph database on Amazon Neptune 
>>>> (https://aws.amazon.com/neptune) and we have data coming from Google Cloud.
>>>>
>>>> So I was wondering if anyone has ever worked with a similar architecture 
>>>> and has considered developing an Amazon Neptune custom Beam I/O connector. 
>>>> Is it feasible? Is it worth it?
>>>>
>>>> Honestly I'm not that experienced with Apache Beam / Dataflow, so I'm not 
>>>> sure if something like that would make sense. Currently we're connecting 
>>>> Beam to AWS Kinesis and AWS S3, and from there, to Neptune.
>>>>
>>>> Thank you all very much in advance!
>>>>
>>>> Best,
>>>> Gabriel Levcovitz

Reply via email to