I'm a spark newbie working on his first attempt to do write an ETL
program. I could use some feedback to make sure I'm on the right path.
I've written a basic proof of concept that runs without errors and seems
to work, although I might be missing some issues when this is actually
run on more
I cannot comment about the correctness of Python code. I will assume your
caper_kv is keyed on something that uniquely identifies all the rows that
make up the person's record so your group by key makes sense, as does the
map. (I will also assume all of the rows that comprise a single person's
My first problem was somewhat similar to yours. You won't find a whole lot
of JDBC to Spark examples since I think a lot of the adoption for Spark is
from teams already experienced with Hadoop and already have an established
big data solution (so their data is already extracted from whatever
Thanks Charles. I just realized a few minutes ago that I neglected to
show the step where I generated the key on the person ID. Thanks for the
pointer on the HDFS URL.
Next step is to process data from multiple RDDS. My data originates from
7 tables in a MySQL database. I used sqoop to create