[ https://issues.apache.org/jira/browse/BEAM-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Robertson resolved BEAM-2661. --------------------------------- Resolution: Fixed Fix Version/s: 2.7.0 > Add KuduIO > ---------- > > Key: BEAM-2661 > URL: https://issues.apache.org/jira/browse/BEAM-2661 > Project: Beam > Issue Type: New Feature > Components: io-ideas > Reporter: Jean-Baptiste Onofré > Assignee: Tim Robertson > Priority: Major > Fix For: 2.7.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > New IO for Apache Kudu ([https://kudu.apache.org/overview.html]). > This work is in progress [on this > branch|https://github.com/timrobertson100/beam/tree/BEAM-2661-KuduIO] with > design aspects documented below. > h2. The API > The {{KuduIO}} API requires the user to provide a function to convert objects > into operations. This is similar to the {{JdbcIO}} but different to others, > such as {{HBaseIO}} which requires a pre-transform stage beforehand to > convert into the mutations to apply. It was originally intended to copy the > {{HBaseIO}} approach, but this was not possible: > # The Kudu > [Operation|https://kudu.apache.org/apidocs/org/apache/kudu/client/Operation.html] > is a fat class, and is a subclass of {{KuduRpc<OperationResponse>}}. It > holds RPC logic, callbacks and a Kudu client. Because of this the > {{Operation}} does not serialize and furthermore, the logic for encoding the > operations (Insert, Upsert etc) in the Kudu Java API are one way only (no > decode) because the server is written in C++. > # An alternative could be to introduce a new object to beam (e.g. > {{o.a.b.sdk.io.kudu.KuduOperation}}) to enable > {{PCollection<KuduOperation>}}. This was considered but was discounted > because: > ## It is not a familiar API to those already knowing Kudu > ## It still requires serialization and deserialization of the operations. > Using the existing Kudu approach of serializing into compact byte arrays > would require a decoder along the lines of [this almost complete > example|https://gist.github.com/timrobertson100/df77d1337ba8f5609319751ee7c6e01e]. > This is possible but has fragilities given the Kudu code itself continues to > evolve. > ## It becomes a trivial codebase in Beam to maintain by defer the object to > mutation mapping to within the KuduIO transform. {{JdbcIO}} gives us the > precedent to do this. > h2. Testing framework > {{Kudu}} is written in C++. While a > [TestMiniKuduCluster|https://github.com/cloudera/kudu/blob/master/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKuduCluster.java] > does exist in Java, it requires binaries to be available for the target > environment which is not portable (edit: this is now a [work in > progress|https://issues.apache.org/jira/browse/KUDU-2411] in Kudu). Therefore > we opt for the following: > # Unit tests will use a mock Kudu client > # Integration tests will cover the full aspects of the {{KuduIO}} and use a > Docker based Kudu instance -- This message was sent by Atlassian JIRA (v7.6.3#76005)