Katrina,

Not sure if this is what you had in mind, but here's some simple pyspark
code that I recently wrote to deal with JSON files.

from pyspark import SparkContext, SparkConf
from operator import add
import json
import random
import numpy as np

def concatenate_paragraphs(sentence_array):

        return ' '.join(sentence_array).split(' ')

logFile = 'foo.json'
conf = SparkConf()
conf.setMaster("spark://cluster-master:7077").setAppName("example").set("spark.executor.memory",
"1g")
sc = SparkContext(conf=conf)
logData = sc.textFile(logFile).cache()
num_lines = logData.count()
print 'Number of lines: %d' % num_lines
# JSON object has the structure: {"key": {'paragraphs': [sentence1,
sentence2, ...]}}
tm = logData.map(lambda s: (json.loads(s)['key'],
len(concatenate_paragraphs(json.loads(s)['paragraphs']))))
tm = tm.reduceByKey(lambda _, x: _ + x)
op = tm.collect()
for key, num_words in op:
        print 'state: %s, num_words: %d' % (state, num_words)





On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User
List] <ml-node+s1001560n2752...@n3.nabble.com> wrote:

> I don't actually have any data.  I'm writing a course that teaches
> students how to do this sort of thing and am interested in looking at a
> variety of real life examples of people doing things like that.  I'd love
> to see some working code implementing the "obvious work-around" you
> mention...do you have any to share?  It's an approach that makes a lot of
> sense, and as I said, I'd love to not have to re-invent the wheel if
> someone else has already written that code.  Thanks!
>
> Diana
>
>
> On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2752&i=0>
> > wrote:
>
>> There was a previous discussion about this here:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
>>
>> How big are the XML or JSON files you're looking to deal with?
>>
>> It may not be practical to deserialize the entire document at once. In
>> that case an obvious work-around would be to have some kind of
>> pre-processing step that separates XML nodes/JSON objects with newlines so
>> that you *can* analyze the data with Spark in a "line-oriented format".
>> Your preprocessor wouldn't have to parse/deserialize the massive document;
>> it would just have to track open/closed tags/braces to know when to insert
>> a newline.
>>
>> Then you'd just open the line-delimited result and deserialize the
>> individual objects/nodes with map().
>>
>> Nick
>>
>>
>> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=2752&i=1>
>> > wrote:
>>
>>> Has anyone got a working example of a Spark application that analyzes
>>> data in a non-line-oriented format, such as XML or JSON?  I'd like to do
>>> this without re-inventing the wheel...anyone care to share?  Thanks!
>>>
>>> Diana
>>>
>>
>>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
>  To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click 
> here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2hhbmthcmsrc3lzQGdtYWlsLmNvbXwxfDk3NjU5Mzg0>
> .
> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2754.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to