Hi,
We are proposing to develop connector for AWS Aurora. Aurora being cluster
for relational database (MySQL) has no Java api for reading/writing other
than jdbc client. Although there is a JdbcIO available, it looks like it
doesn't work in parallel. The proposal is to provide split functionality
and then use transform to parallelize the operation. As mentioned above,
this is typical sql based database and not comparable with likes of Hive.
Hive implementation is based on abstraction over Hdfs file system of
Hadoop, which provides splits. Here none of these are applicable.
During implementation of Hive connector there was lot of discussion as how
to implement connector while strictly following Beam design principal using
Bounded source. I am not sure how Aurora connector will fit into these
design principals.
Here is our proposal.
1. Split functionality: If the table contains 'x' rows, it will be split
into 'n' bundles in the split method. This would be done like follows :
noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
2. Then each of these 'pseudo' splits would be read in parallel
3. Each of these reads will use db connection from connection pool.
This will provide better bench marking. Please, let know your views.

Thanks
Madhu Borkar

Reply via email to