Thanks Matei.  That makes sense.  I have here a dataset of many many
smallish XML files, so using mapPartitions that way would make sense.  I'd
love to see a code example though ...It's not as obvious to me how to do
that as I probably should be.

Thanks,
Diana


On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Hi Diana,
>
> Non-text input formats are only supported in Java and Scala right now,
> where you can use sparkContext.hadoopFile or .hadoopDataset to load data
> with any InputFormat that Hadoop MapReduce supports. In Python, you
> unfortunately only have textFile, which gives you one record per line. For
> JSON, you'd have to fit the whole JSON object on one line as you said.
> Hopefully we'll also have some other forms of input soon.
>
> If your input is a collection of separate files (say many .xml files), you
> can also use mapPartitions on it to group together the lines because each
> input file will end up being a single dataset partition (or map task). This
> will let you concatenate the lines in each file and parse them as one XML
> object.
>
> Matei
>
> On Mar 17, 2014, at 9:52 AM, Diana Carroll <dcarr...@cloudera.com> wrote:
>
> Thanks, Krakna, very helpful.  The way I read the code, it looks like you
> are assuming that each line in foo.log contains a complete json object?
>  (That is, that the data doesn't contain any records that are split into
> multiple lines.)  If so, is that because you know that to be true of your
> data?  Or did you do as Nicholas suggests and have some preprocessing on
> the text input to flatten the data in that way?
>
> Thanks,
> Diana
>
>
> On Mon, Mar 17, 2014 at 12:09 PM, Krakna H <shankark+...@gmail.com> wrote:
>
>> Katrina,
>>
>> Not sure if this is what you had in mind, but here's some simple pyspark
>> code that I recently wrote to deal with JSON files.
>>
>> from pyspark import SparkContext, SparkConf
>>
>> from operator import add
>> import json
>>
>> import random
>> import numpy as np
>>
>>
>> def concatenate_paragraphs(sentence_array):
>>
>>
>>      return ' '.join(sentence_array).split(' ')
>>
>>
>> logFile = 'foo.json'
>> conf = SparkConf()
>>
>> conf.setMaster("spark://cluster-master:7077").setAppName("example").set("spark.executor.memory",
>>  "1g")
>>
>>
>>
>>
>> sc = SparkContext(conf=conf)
>>
>> logData = sc.textFile(logFile).cache()
>>
>> num_lines = logData.count()
>> print 'Number of lines: %d' % num_lines
>>
>>
>>
>>
>> # JSON object has the structure: {"key": {'paragraphs': [sentence1, 
>> sentence2, ...]}}
>> tm = logData.map(lambda s: (json.loads(s)['key'], 
>> len(concatenate_paragraphs(json.loads(s)['paragraphs']))))
>>
>>
>>
>>
>> tm = tm.reduceByKey(lambda _, x: _ + x)
>>
>>
>>
>>
>> op = tm.collect()
>> for key, num_words in op:
>>
>>      print 'state: %s, num_words: %d' % (state, num_words)
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User
>> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=2754&i=0>
>> > wrote:
>>
>>> I don't actually have any data.  I'm writing a course that teaches
>>> students how to do this sort of thing and am interested in looking at a
>>> variety of real life examples of people doing things like that.  I'd love
>>> to see some working code implementing the "obvious work-around" you
>>> mention...do you have any to share?  It's an approach that makes a lot of
>>> sense, and as I said, I'd love to not have to re-invent the wheel if
>>> someone else has already written that code.  Thanks!
>>>
>>> Diana
>>>
>>>
>>> On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <[hidden 
>>> email]<http://user/SendEmail.jtp?type=node&node=2752&i=0>
>>> > wrote:
>>>
>>>> There was a previous discussion about this here:
>>>>
>>>>
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
>>>>
>>>> How big are the XML or JSON files you're looking to deal with?
>>>>
>>>> It may not be practical to deserialize the entire document at once. In
>>>> that case an obvious work-around would be to have some kind of
>>>> pre-processing step that separates XML nodes/JSON objects with newlines so
>>>> that you *can* analyze the data with Spark in a "line-oriented
>>>> format". Your preprocessor wouldn't have to parse/deserialize the massive
>>>> document; it would just have to track open/closed tags/braces to know when
>>>> to insert a newline.
>>>>
>>>> Then you'd just open the line-delimited result and deserialize the
>>>> individual objects/nodes with map().
>>>>
>>>> Nick
>>>>
>>>>
>>>> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <[hidden 
>>>> email]<http://user/SendEmail.jtp?type=node&node=2752&i=1>
>>>> > wrote:
>>>>
>>>>> Has anyone got a working example of a Spark application that analyzes
>>>>> data in a non-line-oriented format, such as XML or JSON?  I'd like to do
>>>>> this without re-inventing the wheel...anyone care to share?  Thanks!
>>>>>
>>>>> Diana
>>>>>
>>>>
>>>>
>>>
>>>
>>> ------------------------------
>>>  If you reply to this email, your message will be added to the
>>> discussion below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
>>>  To start a new topic under Apache Spark User List, email [hidden 
>>> email]<http://user/SendEmail.jtp?type=node&node=2754&i=1>
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> ------------------------------
>> View this message in context: Re: example of non-line oriented input
>> data?<http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2754.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
>> Nabble.com.
>>
>
>
>

Reply via email to