Thanks Matei. That makes sense. I have here a dataset of many many smallish XML files, so using mapPartitions that way would make sense. I'd love to see a code example though ...It's not as obvious to me how to do that as I probably should be.
Thanks, Diana On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Hi Diana, > > Non-text input formats are only supported in Java and Scala right now, > where you can use sparkContext.hadoopFile or .hadoopDataset to load data > with any InputFormat that Hadoop MapReduce supports. In Python, you > unfortunately only have textFile, which gives you one record per line. For > JSON, you'd have to fit the whole JSON object on one line as you said. > Hopefully we'll also have some other forms of input soon. > > If your input is a collection of separate files (say many .xml files), you > can also use mapPartitions on it to group together the lines because each > input file will end up being a single dataset partition (or map task). This > will let you concatenate the lines in each file and parse them as one XML > object. > > Matei > > On Mar 17, 2014, at 9:52 AM, Diana Carroll <dcarr...@cloudera.com> wrote: > > Thanks, Krakna, very helpful. The way I read the code, it looks like you > are assuming that each line in foo.log contains a complete json object? > (That is, that the data doesn't contain any records that are split into > multiple lines.) If so, is that because you know that to be true of your > data? Or did you do as Nicholas suggests and have some preprocessing on > the text input to flatten the data in that way? > > Thanks, > Diana > > > On Mon, Mar 17, 2014 at 12:09 PM, Krakna H <shankark+...@gmail.com> wrote: > >> Katrina, >> >> Not sure if this is what you had in mind, but here's some simple pyspark >> code that I recently wrote to deal with JSON files. >> >> from pyspark import SparkContext, SparkConf >> >> from operator import add >> import json >> >> import random >> import numpy as np >> >> >> def concatenate_paragraphs(sentence_array): >> >> >> return ' '.join(sentence_array).split(' ') >> >> >> logFile = 'foo.json' >> conf = SparkConf() >> >> conf.setMaster("spark://cluster-master:7077").setAppName("example").set("spark.executor.memory", >> "1g") >> >> >> >> >> sc = SparkContext(conf=conf) >> >> logData = sc.textFile(logFile).cache() >> >> num_lines = logData.count() >> print 'Number of lines: %d' % num_lines >> >> >> >> >> # JSON object has the structure: {"key": {'paragraphs': [sentence1, >> sentence2, ...]}} >> tm = logData.map(lambda s: (json.loads(s)['key'], >> len(concatenate_paragraphs(json.loads(s)['paragraphs'])))) >> >> >> >> >> tm = tm.reduceByKey(lambda _, x: _ + x) >> >> >> >> >> op = tm.collect() >> for key, num_words in op: >> >> print 'state: %s, num_words: %d' % (state, num_words) >> >> >> >> >> >> >> >> >> On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User >> List] <[hidden email] <http://user/SendEmail.jtp?type=node&node=2754&i=0> >> > wrote: >> >>> I don't actually have any data. I'm writing a course that teaches >>> students how to do this sort of thing and am interested in looking at a >>> variety of real life examples of people doing things like that. I'd love >>> to see some working code implementing the "obvious work-around" you >>> mention...do you have any to share? It's an approach that makes a lot of >>> sense, and as I said, I'd love to not have to re-invent the wheel if >>> someone else has already written that code. Thanks! >>> >>> Diana >>> >>> >>> On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas <[hidden >>> email]<http://user/SendEmail.jtp?type=node&node=2752&i=0> >>> > wrote: >>> >>>> There was a previous discussion about this here: >>>> >>>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html >>>> >>>> How big are the XML or JSON files you're looking to deal with? >>>> >>>> It may not be practical to deserialize the entire document at once. In >>>> that case an obvious work-around would be to have some kind of >>>> pre-processing step that separates XML nodes/JSON objects with newlines so >>>> that you *can* analyze the data with Spark in a "line-oriented >>>> format". Your preprocessor wouldn't have to parse/deserialize the massive >>>> document; it would just have to track open/closed tags/braces to know when >>>> to insert a newline. >>>> >>>> Then you'd just open the line-delimited result and deserialize the >>>> individual objects/nodes with map(). >>>> >>>> Nick >>>> >>>> >>>> On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll <[hidden >>>> email]<http://user/SendEmail.jtp?type=node&node=2752&i=1> >>>> > wrote: >>>> >>>>> Has anyone got a working example of a Spark application that analyzes >>>>> data in a non-line-oriented format, such as XML or JSON? I'd like to do >>>>> this without re-inventing the wheel...anyone care to share? Thanks! >>>>> >>>>> Diana >>>>> >>>> >>>> >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the >>> discussion below: >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html >>> To start a new topic under Apache Spark User List, email [hidden >>> email]<http://user/SendEmail.jtp?type=node&node=2754&i=1> >>> To unsubscribe from Apache Spark User List, click here. >>> NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> ------------------------------ >> View this message in context: Re: example of non-line oriented input >> data?<http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2754.html> >> Sent from the Apache Spark User List mailing list >> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at >> Nabble.com. >> > > >