My first problem was somewhat similar to yours. You won't find a whole lot
of JDBC to Spark examples since I think a lot of the adoption for Spark is
from teams already experienced with Hadoop and already have an established
big data solution (so their data is already extracted from whatever
sources, e.g., log files, Hive, other M/R jobs). JDBC support is
somewhat... lacking.

Our application uses a 12 node PostgreSQL distributed RDBMS that is sharded
at the application tier. I had to write my own JDBC RDD to support this
logical schema. However because you are coming from a single MySQL DB you
should be able to get away with using the JdbcRDD[1]... but I cannot find a
reference to it for the Python API so someone familiar with using Python
and Spark will have to chime in on that.

You need to consider _how_ the data gets from MySQL to the workers. It
might work to pull all of the data to a single node and then parallelize
that data across the cluster but its not going to be as efficient as range
querying from each worker in the cluster to the database. If you're working
with TBs of data then you will see very big benefits by distributing the
data across workers from the get go; if you don't it will take however long
it takes to copy all the data to a single worker and distribute as your
startup code for each execution. (By range querying what I mean is
basically what the JdbcRDD does - it forces you to include a conditional
statement like "id > ? AND id <= ?" in your SQL which it formats at each
worker so each worker only gets a piece of the pie). The JdbcRDD makes
assumptions about numeric keys for range querying.

The next thing to consider is if you're going against your production
database, will massive reads cause degradation for production users? I am
using read replicas to mitigate this for our production installation, as
copying TBs of data out of PostgreSQL would have some negative effect on
our users. Running your jobs during low traffic is obviously an option
here, as is restoring a read-only version from backup and explicitly
querying that instance (in which case parallelizing user IDs and querying
MySQL directly might get you near to the JdbcRDD behavior). And of course
if the MySQL instance is already your analytics solution then query on.

1.
https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/rdd/JdbcRDD.html

On Mon Feb 16 2015 at 4:42:30 PM Eric Bell <e...@ericjbell.com> wrote:

> Thanks Charles. I just realized a few minutes ago that I neglected to
> show the step where I generated the key on the person ID. Thanks for the
> pointer on the HDFS URL.
>
> Next step is to process data from multiple RDDS. My data originates from
> 7 tables in a MySQL database. I used sqoop to create avro files from
> these tables, and in turn created RDDs using SparkSQL from the avro
> files. Since the groupByKey only operates on a single RDD, I'm not quite
> sure yet how I'm going to process 7 tables as a transformation to get
> all the data I need into my objects.
>
> I'm vascillating on whether I should be doing it this way, or if it
> would be a lot simpler to query MySQL to get all the Person IDs,
> parallelize them, and have my Person class make queries directly to the
> MySQL database. Since in theory I only have to do this once, I'm not
> sure there's much to be gained in moving the data from MySQL to Spark
> first.
>
> I have yet to find any non-trivial examples of ETL logic on the web ...
> it seems like it's mostly word count map-reduce replacements.
>
> On 02/16/2015 01:32 PM, Charles Feduke wrote:
> > I cannot comment about the correctness of Python code. I will assume
> > your caper_kv is keyed on something that uniquely identifies all the
> > rows that make up the person's record so your group by key makes
> > sense, as does the map. (I will also assume all of the rows that
> > comprise a single person's record will always fit in memory. If not
> > you will need another approach.)
> >
> > You should be able to get away with removing the "localhost:9000" from
> > your HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let your
> > HDFS configuration for Spark supply the missing pieces.
> >
>
>

Reply via email to