Spark newbie desires feedback on first program

2015-02-16 Thread Eric Bell
I'm a spark newbie working on his first attempt to do write an ETL program. I could use some feedback to make sure I'm on the right path. I've written a basic proof of concept that runs without errors and seems to work, although I might be missing some issues when this is actually run on more

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
I cannot comment about the correctness of Python code. I will assume your caper_kv is keyed on something that uniquely identifies all the rows that make up the person's record so your group by key makes sense, as does the map. (I will also assume all of the rows that comprise a single person's

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
My first problem was somewhat similar to yours. You won't find a whole lot of JDBC to Spark examples since I think a lot of the adoption for Spark is from teams already experienced with Hadoop and already have an established big data solution (so their data is already extracted from whatever

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Eric Bell
Thanks Charles. I just realized a few minutes ago that I neglected to show the step where I generated the key on the person ID. Thanks for the pointer on the HDFS URL. Next step is to process data from multiple RDDS. My data originates from 7 tables in a MySQL database. I used sqoop to create