I'm facing a similar problem except my data is already pre-sharded in PostgreSQL.
I'm going to attempt to solve it like this: - Submit the shard names (database names) across the Spark cluster as a text file and partition it so workers get 0 or more - hopefully 1 - shard name. In this case you could partition ranges - if your primary key is a datetime, then a start/end datetime pair; or if its a long then a start/end long pair. (You may need to run a separate job to get your overall start/end pair and then calculate how many partitions you need from there.) - Write the job so that the worker loads data from its shard(s) and unions the RDDs together. In the case of pairs the concept is the same. Basically look at how the JdbcRDD constructor requires a start, end, and query (disregard numPartitions in this case since we're manually partitioning in the step above). Your query will be its initial filter conditions plus a between condition for the primary key and its pair. - Operate on the union RDDs with other transformations or filters. If everything works as planned then the data should be spread out across the cluster and no one node will be responsible for loading TiBs of data and then distributing it to its peers. That should help with your OOM problem. Of course this does not guarantee that the data is balanced across nodes. With a large amount of data it should balance well enough to get the job done though. (You may need to run several refinements against the complete dataset to figure out the appropriate start/end pair values to get an RDD that is partitioned and balanced across the workers. This is a task best performed using aggregate query logic or stored procedures. With my shard problem I don't have this option available.) Unless someone has a better idea, in which case I'd love to hear it. On Sun Jan 25 2015 at 4:19:38 AM Denis Mikhalkin <deni...@yahoo.com.invalid> wrote: > Hi Nicholas, > > thanks for your reply. I checked spark-redshift - it's just for the unload > data files stored on hadoop, not for online result sets from DB. > > Do you know of any example of a custom RDD which fetches the data on the > fly (not reading from HDFS)? > > Thanks. > > Denis > > ------------------------------ > *From:* Nicholas Chammas <nicholas.cham...@gmail.com> > *To:* Denis Mikhalkin <deni...@yahoo.com>; "user@spark.apache.org" < > user@spark.apache.org> > *Sent:* Sunday, 25 January 2015, 3:06 > *Subject:* Re: Analyzing data from non-standard data sources (e.g. AWS > Redshift) > > I believe databricks provides an rdd interface to redshift. Did you check > spark-packages.org? > On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <deni...@yahoo.com.invalid> > wrote: > > Hello, > > we've got some analytics data in AWS Redshift. The data is being > constantly updated. > > I'd like to be able to write a query against Redshift which would return a > subset of data, and then run a Spark job (Pyspark) to do some analysis. > > I could not find an RDD which would let me do it OOB (Python), so I tried > writing my own. For example, tried combination of a generator (via yield) > with parallelize. It appears though that "parallelize" reads all the data > first into memory as I get either OOM or Python swaps as soon as I increase > the number of rows beyond trivial limits. > > I've also looked at Java RDDs (there is an example of MySQL RDD) but it > seems that it also reads all the data into memory. > > So my question is - how to correctly feed Spark with huge datasets which > don't initially reside in HDFS/S3 (ideally for Pyspark, but would > appreciate any tips)? > > Thanks. > > Denis > > > > >