I'm facing a similar problem except my data is already pre-sharded in
PostgreSQL.

I'm going to attempt to solve it like this:

- Submit the shard names (database names) across the Spark cluster as a
text file and partition it so workers get 0 or more - hopefully 1 - shard
name. In this case you could partition ranges - if your primary key is a
datetime, then a start/end datetime pair; or if its a long then a start/end
long pair. (You may need to run a separate job to get your overall
start/end pair and then calculate how many partitions you need from there.)

- Write the job so that the worker loads data from its shard(s) and unions
the RDDs together. In the case of pairs the concept is the same. Basically
look at how the JdbcRDD constructor requires a start, end, and query
(disregard numPartitions in this case since we're manually partitioning in
the step above). Your query will be its initial filter conditions plus a
between condition for the primary key and its pair.

- Operate on the union RDDs with other transformations or filters.

If everything works as planned then the data should be spread out across
the cluster and no one node will be responsible for loading TiBs of data
and then distributing it to its peers. That should help with your OOM
problem.

Of course this does not guarantee that the data is balanced across nodes.
With a large amount of data it should balance well enough to get the job
done though.

(You may need to run several refinements against the complete dataset to
figure out the appropriate start/end pair values to get an RDD that is
partitioned and balanced across the workers. This is a task best performed
using aggregate query logic or stored procedures. With my shard problem I
don't have this option available.)

Unless someone has a better idea, in which case I'd love to hear it.


On Sun Jan 25 2015 at 4:19:38 AM Denis Mikhalkin <deni...@yahoo.com.invalid>
wrote:

> Hi Nicholas,
>
> thanks for your reply. I checked spark-redshift - it's just for the unload
> data files stored on hadoop, not for online result sets from DB.
>
> Do you know of any example of a custom RDD which fetches the data on the
> fly (not reading from HDFS)?
>
> Thanks.
>
> Denis
>
>   ------------------------------
>  *From:* Nicholas Chammas <nicholas.cham...@gmail.com>
> *To:* Denis Mikhalkin <deni...@yahoo.com>; "user@spark.apache.org" <
> user@spark.apache.org>
> *Sent:* Sunday, 25 January 2015, 3:06
> *Subject:* Re: Analyzing data from non-standard data sources (e.g. AWS
> Redshift)
>
> I believe databricks provides an rdd interface to redshift. Did you check
> spark-packages.org?
> On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin <deni...@yahoo.com.invalid>
> wrote:
>
> Hello,
>
> we've got some analytics data in AWS Redshift. The data is being
> constantly updated.
>
> I'd like to be able to write a query against Redshift which would return a
> subset of data, and then run a Spark job (Pyspark) to do some analysis.
>
> I could not find an RDD which would let me do it OOB (Python), so I tried
> writing my own. For example, tried combination of a generator (via yield)
> with parallelize. It appears though that "parallelize" reads all the data
> first into memory as I get either OOM or Python swaps as soon as I increase
> the number of rows beyond trivial limits.
>
> I've also looked at Java RDDs (there is an example of MySQL RDD) but it
> seems that it also reads all the data into memory.
>
> So my question is - how to correctly feed Spark with huge datasets which
> don't initially reside in HDFS/S3 (ideally for Pyspark, but would
> appreciate any tips)?
>
> Thanks.
>
> Denis
>
>
>
>
>

Reply via email to