Hi, We are proposing to develop connector for AWS Aurora. Aurora being cluster for relational database (MySQL) has no Java api for reading/writing other than jdbc client. Although there is a JdbcIO available, it looks like it doesn't work in parallel. The proposal is to provide split functionality and then use transform to parallelize the operation. As mentioned above, this is typical sql based database and not comparable with likes of Hive. Hive implementation is based on abstraction over Hdfs file system of Hadoop, which provides splits. Here none of these are applicable. During implementation of Hive connector there was lot of discussion as how to implement connector while strictly following Beam design principal using Bounded source. I am not sure how Aurora connector will fit into these design principals. Here is our proposal. 1. Split functionality: If the table contains 'x' rows, it will be split into 'n' bundles in the split method. This would be done like follows : noOfSplits = 'x' * size of a single row / bundleSize hint from runner. 2. Then each of these 'pseudo' splits would be read in parallel 3. Each of these reads will use db connection from connection pool. This will provide better bench marking. Please, let know your views.
Thanks Madhu Borkar
