[ 
https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-2661:
-------------------------------
    Component/s: io-java-kudu

> Add KuduIO
> ----------
>
>                 Key: BEAM-2661
>                 URL: https://issues.apache.org/jira/browse/BEAM-2661
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-ideas, io-java-kudu
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Tim Robertson
>            Priority: Major
>             Fix For: 2.7.0
>
>          Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> New IO for Apache Kudu ([https://kudu.apache.org/overview.html]).
> This work is in progress [on this 
> branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with 
> design aspects documented below.
> h2. The API
> The {{KuduIO}} API requires the user to provide a function to convert objects 
> into operations. This is similar to the {{JdbcIO}} but different to others, 
> such as {{HBaseIO}} which requires a pre-transform stage beforehand to 
> convert into the mutations to apply. It was originally intended to copy the 
> {{HBaseIO}} approach, but this was not possible:
>  # The Kudu 
> [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html]
>  is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It 
> holds RPC logic, callbacks and a Kudu client. Because of this the 
> {{Operation}} does not serialize and furthermore, the logic for encoding the 
> operations (Insert, Upsert etc) in the Kudu Java API are one way only (no 
> decode) because the server is written in C++.
>  # An alternative could be to introduce a new object to beam (e.g. 
> {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable 
> {{PCollection<KuduOperation>}}. This was considered but was discounted 
> because:
>  ## It is not a familiar API to those already knowing Kudu
>  ## It still requires serialization and deserialization of the operations. 
> Using the existing Kudu approach of serializing into compact byte arrays 
> would require a decoder along the lines of [this almost complete 
> example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e].
>  This is possible but has fragilities given the Kudu code itself continues to 
> evolve. 
>  ## It becomes a trivial codebase in Beam to maintain by defer the object to 
> mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the 
> precedent to do this.
> h2. Testing framework
> {{Kudu}} is written in C++. While a 
> [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java]
>  does exist in Java, it requires binaries to be available for the target 
> environment which is not portable (edit: this is now a [work in 
> progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore 
> we opt for the following:
>  # Unit tests will use a mock Kudu client
>  # Integration tests will cover the full aspects of the {{KuduIO}} and use a 
> Docker based Kudu instance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to