I'm trying to figure out the best approach to getting sharded data from
PostgreSQL into Spark.

Our production PGSQL cluster has 12 shards with TiB of data on each shard.
(I won't be accessing all of the data on a shard at once, but I don't think
its feasible to use Sqoop to copy tables who's data will be out of date
rather quickly.) We are using RDS replication in AWS so read-heavy queries
against the sharded data is okay.

I see that the JdbcRDD is really designed to take data in chunks from a
single datasource, where partitioning will spread the chunks across the
cluster. This is neat for a single JDBC datasource but inconvenient when
the data is already sharded.

My current plan is to create a small text file with the shard names of our
cluster and partition it across the Spark cluster. From there I will use
custom code to process a SQL statement in the context of a JdbcRDD,
generating 1..*n* Connections (and 1..*n *JdbcRDDs) with a partition size
of 1 so each worker will handle 1..*n* shards [ideally 1]; those RDDs will
then be unioned together [when more than 1 RDD] to get a shard's worth of
data satisfying the SQL query into the worker for further processing.

It seems like there should already be an established solution for this
pattern though so I want to see if I am going about this entirely the wrong
way and should instead be using something else.

(If it matters we also have a second datacenter Cassandra cluster for
executing analytic queries against that I could use if necessary.
Originally I was going to ETL the PGSQL data into this cluster but that
poses its own set of challenges.)

Reply via email to