Thanks. IMHO it might be better to perform this splitting as a part of your
pipeline instead of making source splitting customizable. The reason is,
it's easy for users to shoot themselves on the foot if we allow specifying
a custom splitter. A bug in a custom QuerySplitter can result in a hard to
catch data loss or data duplication bug. So I'd rather not make it a part
of the user API.

I can think of two ways for performing this splitting as a part of your
pipeline.
(1) Split the query during job construction and create a source per query.
This can be followed by a Flatten transform that creates a single
PCollection. (Once caveat is, you might run into 10MB request size limit if
you create two many splits here. So try reducing the number of splits if
you ran into this).
(2) Add a ReadAll transform to DatastoreIO. This will allow you to precede
the step that performs reading by a ParDo step that splits your query and
create a PCollection of queries. You should not run into size limits here
since splitting happens in the data plane.

Thanks,
Cham

On Wed, May 2, 2018 at 12:50 PM Frank Yellin <[email protected]> wrote:

> TLDR:
> Is it okay for me to expose Datastore in apache beam's DatastoreIO, and
> thus indirectly expose com.google.rpc.Code?
> Is there a better solution?
>
>
> As I explain in Beam 4186
> <https://issues.apache.org/jira/browse/BEAM-4186>, I would like to be
> able to extend DatastoreV1.Read to have a
>        withQuerySplitter(QuerrySplitter querySplitter)
> method, which would use an alternative query splitter.   The standard one
> shards by key and is very limited.
>
> I have already written such a query splitter.  In fact, the query splitter
> I've written goes further than specified in the beam, and reads the minimum
> or maximum value of the field from the datastore if no minimum or maximum
> is specified in the query, and uses that value for the sharding.   I can
> write:
>        SELECT * FROM ledger where type = 'purchase'
> and then ask it to shard on the eventTime, and it will shard nicely!  I
> am working with the Datastore folks to separately add my new query splitter
> as an option in DatastoreHelper.
>
>
> I have already written the code to add withQuerySplitter.
>
>        https://github.com/apache/beam/pull/5246
>
> However the problem is that I am increasing the "surface API" of
> Dataflow.
>        QuerySplitter exposes Datastore  exposes DatastoreException
> exposes com.google.rpc.Code
> and com.google.rpc.Code is not (yet) part of the API surface.
>
> As a solution, I've added package com.google.rpc to the list of classes
> exposed.  This package contains protobuf enums.  Is this okay?  Is there a
> better solution?
>
> Thanks.
>
>

Reply via email to