I cannot comment about the correctness of Python code. I will assume your
caper_kv is keyed on something that uniquely identifies all the rows that
make up the person's record so your group by key makes sense, as does the
map. (I will also assume all of the rows that comprise a single person's
record will always fit in memory. If not you will need another approach.)

You should be able to get away with removing the "localhost:9000" from your
HDFS URL, i.e., "hdfs:///sma/processJSON/people" and let your HDFS
configuration for Spark supply the missing pieces.

On Mon Feb 16 2015 at 3:38:31 PM Eric Bell <e...@ericjbell.com> wrote:

> I'm a spark newbie working on his first attempt to do write an ETL
> program. I could use some feedback to make sure I'm on the right path.
> I've written a basic proof of concept that runs without errors and seems
> to work, although I might be missing some issues when this is actually
> run on more than a single node.
>
> I am working with data about people (actually healthcare patients). I
> have an RDD that contains multiple rows per person. My overall goal is
> to create a single Person object for each person in my data. In this
> example, I am serializing to JSON, mostly because this is what I know
> how to do at the moment.
>
> Other than general feedback, is my use of the groupByKey() and
> mapValues() methods appropriate?
>
> Thanks!
>
>
> import json
>
> class Person:
>      def __init__(self):
>          self.mydata={}
>          self.cpts = []
>          self.mydata['cpt']=self.cpts
>      def addRowData(self, dataRow):
>          # Get the CPT codes
>          cpt = dataRow.CPT_1
>          if cpt:
>              self.cpts.append(cpt)
>      def serializeToJSON(self):
>          return json.dumps(self.mydata)
>
> def makeAPerson(rows):
>      person = Person()
>      for row in rows:
>          person.addRowData(row)
>      return person.serializeToJSON()
>
> peopleRDD = caper_kv.groupByKey().mapValues(lambda personDataRows:
> makeAPerson(personDataRows))
> peopleRDD.saveAsTextFile("hdfs://localhost:9000/sma/processJSON/people")
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to