I'm trying to figure out the best approach to getting sharded data from PostgreSQL into Spark.
Our production PGSQL cluster has 12 shards with TiB of data on each shard. (I won't be accessing all of the data on a shard at once, but I don't think its feasible to use Sqoop to copy tables who's data will be out of date rather quickly.) We are using RDS replication in AWS so read-heavy queries against the sharded data is okay. I see that the JdbcRDD is really designed to take data in chunks from a single datasource, where partitioning will spread the chunks across the cluster. This is neat for a single JDBC datasource but inconvenient when the data is already sharded. My current plan is to create a small text file with the shard names of our cluster and partition it across the Spark cluster. From there I will use custom code to process a SQL statement in the context of a JdbcRDD, generating 1..*n* Connections (and 1..*n *JdbcRDDs) with a partition size of 1 so each worker will handle 1..*n* shards [ideally 1]; those RDDs will then be unioned together [when more than 1 RDD] to get a shard's worth of data satisfying the SQL query into the worker for further processing. It seems like there should already be an established solution for this pattern though so I want to see if I am going about this entirely the wrong way and should instead be using something else. (If it matters we also have a second datacenter Cassandra cluster for executing analytic queries against that I could use if necessary. Originally I was going to ETL the PGSQL data into this cluster but that poses its own set of challenges.)