[ 
https://issues.apache.org/jira/browse/BEAM-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208098#comment-16208098
 ] 

Alexander Hoem Rosbach edited comment on BEAM-3039 at 10/17/17 6:28 PM:
------------------------------------------------------------------------

Would it offer any advantage to use GroupByKey instead of Distinct?

What do you think about including features in the DatastoreIO to avoid the 
issue? It could be optional parameters passed to the write-function if you 
don't agree that it is a bug. In my opinion it is a bug since I would assume 
that it is a common use case for dataflow implementations to stream data from 
pubsub into datastore.

For instance:
{code}
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()).removeDuplicatesWithinCommits());
{code}


was (Author: routsi):
Would it offer any advantage to use GroupByKey instead of Distinct?

What do you think about including features in the DatastoreIO to avoid the 
issue? It could be optional parameters passed to the write-function if you 
don't agree that it is a bug. In my opinion it is a bug that what I would 
assume is a common use case for dataflow implementations, streaming data from 
pubsub into datastore.

For instance:
{code}
.apply(DatastoreIO.v1().write().withProjectId(options.getProject()).removeDuplicatesWithinCommits());
{code}

> DatastoreIO.Write fails multiple mutations of same entity
> ---------------------------------------------------------
>
>                 Key: BEAM-3039
>                 URL: https://issues.apache.org/jira/browse/BEAM-3039
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>    Affects Versions: 2.1.0
>            Reporter: Alexander Hoem Rosbach
>            Assignee: Chamikara Jayalath
>            Priority: Minor
>
> When streaming messages from a source that doesn't guarantee 
> once-only-delivery, but has at-least-once-delivery, then the 
> DatastoreIO.Write will throw an exception which leads to Dataflow retrying 
> the same commit multiple times before giving up. This leads to a significant 
> bottleneck in the pipeline, with the end-result that the data is dropped. 
> This should be handled better.
> There are a number of ways to fix this. One of them could be to drop any 
> duplicate mutations within one batch. Non-duplicates should also be handled 
> in some way. Perhaps a use NON-TRANSACTIONAL commit, or make sure the 
> mutations are commited in different commits.
> {code}
> com.google.datastore.v1.client.DatastoreException: A non-transactional commit 
> may not contain multiple mutations affecting the same entity., 
> code=INVALID_ARGUMENT
>         
> com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
>         
> com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
>         com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
>         com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
>         
> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
>         
> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.processElement(DatastoreV1.java:1253)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to