Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files.
from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster("spark://cluster-master:7077").setAppName("example").set("spark.executor.memory", "1g") sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {"key": {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'])))) tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] <ml-node+s1001560n2752...@n3.nabble.com> wrote: > I don't actually have any data. I'm writing a course that teaches > students how to do this sort of thing and am interested in looking at a > variety of real life examples of people doing things like that. I'd love > to see some working code implementing the "obvious work-around" you > mention...do you have any to share? It's an approach that makes a lot of > sense, and as I said, I'd love to not have to re-invent the wheel if > someone else has already written that code. Thanks! > > Diana > > > On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <[hidden > email]<http://user/SendEmail.jtp?type=node&node=2752&i=0> > > wrote: > >> There was a previous discussion about this here: >> >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html >> >> How big are the XML or JSON files you're looking to deal with? >> >> It may not be practical to deserialize the entire document at once. In >> that case an obvious work-around would be to have some kind of >> pre-processing step that separates XML nodes/JSON objects with newlines so >> that you *can* analyze the data with Spark in a "line-oriented format". >> Your preprocessor wouldn't have to parse/deserialize the massive document; >> it would just have to track open/closed tags/braces to know when to insert >> a newline. >> >> Then you'd just open the line-delimited result and deserialize the >> individual objects/nodes with map(). >> >> Nick >> >> >> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <[hidden >> email]<http://user/SendEmail.jtp?type=node&node=2752&i=1> >> > wrote: >> >>> Has anyone got a working example of a Spark application that analyzes >>> data in a non-line-oriented format, such as XML or JSON? I'd like to do >>> this without re-inventing the wheel...anyone care to share? Thanks! >>> >>> Diana >>> >> >> > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click > here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0> > . > NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2754.html Sent from the Apache Spark User List mailing list archive at Nabble.com.