Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
 them as one XML object.
  
   Matei
  
   On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com 
   wrote:
  
   Thanks, Krakna, very helpful.  The way I read the code, it looks 
   like you are assuming that each line in foo.log contains a complete 
   json object?  (That is, that the data doesn't contain any records 
   that are split into multiple lines.)  If so, is that because you 
   know that to be true of your data?  Or did you do as Nicholas 
   suggests and have some preprocessing on the text input to flatten 
   the data in that way?
  
   Thanks,
   Diana
  
  
   On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com 
   wrote:
   Katrina,
  
   Not sure if this is what you had in mind, but here's some simple 
   pyspark code that I recently wrote to deal with JSON files.
  
   from pyspark import SparkContext, SparkConf
  
  
  
   from operator import add
   import json
  
  
  
   import random
   import numpy as np
  
  
  
  
   def concatenate_paragraphs(sentence_array):
  
  
  
   return ' '.join(sentence_array).split(' ')
  
  
  
  
   logFile = 'foo.json'
   conf = SparkConf()
  
  
  
   conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory,
1g)
  
  
  
  
  
  
  
   sc = SparkContext(conf=conf)
  
  
  
   logData = sc.textFile(logFile).cache()
  
  
  
   num_lines = logData.count()
   print 'Number of lines: %d' % num_lines
  
  
  
  
  
  
  
   # JSON object has the structure: {key: {'paragraphs': [sentence1, 
   sentence2, ...]}}
   tm = logData.map(lambda s: (json.loads(s)['key'], 
   len(concatenate_paragraphs(json.loads(s)['paragraphs']
  
  
  
  
  
  
  
   tm = tm.reduceByKey(lambda _, x: _ + x)
  
  
  
  
  
  
  
   op = tm.collect()
   for key, num_words in op:
  
  
  
print 'state: %s, num_words: %d' % (state, num_words)
  
  
  
  
  
  
  
  
  
  
  
   On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark 
   User List] [hidden email] wrote:
   I don't actually have any data.  I'm writing a course that teaches 
   students how to do this sort of thing and am interested in looking 
   at a variety of real life examples of people doing things like that. 
I'd love to see some working code implementing the obvious 
   work-around you mention...do you have any to share?  It's an 
   approach that makes a lot of sense, and as I said, I'd love to not 
   have to re-invent the wheel if someone else has already written that 
   code.  Thanks!
  
   Diana
  
  
   On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] 
   wrote:
   There was a previous discussion about this here:
  
   http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
  
   How big are the XML or JSON files you're looking to deal with?
  
   It may not be practical to deserialize the entire document at once. 
   In that case an obvious work-around would be to have some kind of 
   pre-processing step that separates XML nodes/JSON objects with 
   newlines so that you can analyze the data with Spark in a 
   line-oriented format. Your preprocessor wouldn't have to 
   parse/deserialize the massive document; it would just have to track 
   open/closed tags/braces to know when to insert a newline.
  
   Then you'd just open the line-delimited result and deserialize the 
   individual objects/nodes with map().
  
   Nick
  
  
   On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] 
   wrote:
   Has anyone got a working example of a Spark application that 
   analyzes data in a non-line-oriented format, such as XML or JSON?  
   I'd like to do this without re-inventing the wheel...anyone care to 
   share?  Thanks!
  
   Diana
  
  
  
  
   If you reply to this email, your message will be added to the 
   discussion below:
   http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
   To start a new topic under Apache Spark User List, email [hidden 
   email]
   To unsubscribe from Apache Spark User List, click here.
   NAML
  
  
   View this message in context: Re: example of non-line oriented input 
   data?
   Sent from the Apache Spark User List mailing list archive at 
   Nabble.com.
  
  
  
 
 
 
 
 
 
 
 
 
 
 



Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
.
  
   Matei
  
   On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com
 wrote:
  
   Thanks, Krakna, very helpful.  The way I read the code, it looks
 like you are assuming that each line in foo.log contains a complete json
 object?  (That is, that the data doesn't contain any records that are split
 into multiple lines.)  If so, is that because you know that to be true of
 your data?  Or did you do as Nicholas suggests and have some preprocessing
 on the text input to flatten the data in that way?
  
   Thanks,
   Diana
  
  
   On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com
 wrote:
   Katrina,
  
   Not sure if this is what you had in mind, but here's some simple
 pyspark code that I recently wrote to deal with JSON files.
  
   from pyspark import SparkContext, SparkConf
  
  
  
   from operator import add
   import json
  
  
  
   import random
   import numpy as np
  
  
  
  
   def concatenate_paragraphs(sentence_array):
  
  
  
   return ' '.join(sentence_array).split(' ')
  
  
  
  
   logFile = 'foo.json'
   conf = SparkConf()
  
  
  
   conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory,
 1g)
  
  
  
  
  
  
  
   sc = SparkContext(conf=conf)
  
  
  
   logData = sc.textFile(logFile).cache()
  
  
  
   num_lines = logData.count()
   print 'Number of lines: %d' % num_lines
  
  
  
  
  
  
  
   # JSON object has the structure: {key: {'paragraphs': [sentence1,
 sentence2, ...]}}
   tm = logData.map(lambda s: (json.loads(s)['key'],
 len(concatenate_paragraphs(json.loads(s)['paragraphs']
  
  
  
  
  
  
  
   tm = tm.reduceByKey(lambda _, x: _ + x)
  
  
  
  
  
  
  
   op = tm.collect()
   for key, num_words in op:
  
  
  
print 'state: %s, num_words: %d' % (state, num_words)
  
  
  
  
  
  
  
  
  
  
  
   On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark
 User List] [hidden email] wrote:
   I don't actually have any data.  I'm writing a course that teaches
 students how to do this sort of thing and am interested in looking at a
 variety of real life examples of people doing things like that.  I'd love
 to see some working code implementing the obvious work-around you
 mention...do you have any to share?  It's an approach that makes a lot of
 sense, and as I said, I'd love to not have to re-invent the wheel if
 someone else has already written that code.  Thanks!
  
   Diana
  
  
   On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email]
 wrote:
   There was a previous discussion about this here:
  
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
  
   How big are the XML or JSON files you're looking to deal with?
  
   It may not be practical to deserialize the entire document at once.
 In that case an obvious work-around would be to have some kind of
 pre-processing step that separates XML nodes/JSON objects with newlines so
 that you can analyze the data with Spark in a line-oriented format. Your
 preprocessor wouldn't have to parse/deserialize the massive document; it
 would just have to track open/closed tags/braces to know when to insert a
 newline.
  
   Then you'd just open the line-delimited result and deserialize the
 individual objects/nodes with map().
  
   Nick
  
  
   On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email]
 wrote:
   Has anyone got a working example of a Spark application that
 analyzes data in a non-line-oriented format, such as XML or JSON?  I'd like
 to do this without re-inventing the wheel...anyone care to share?  Thanks!
  
   Diana
  
  
  
  
   If you reply to this email, your message will be added to the
 discussion below:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
   To start a new topic under Apache Spark User List, email [hidden
 email]
   To unsubscribe from Apache Spark User List, click here.
   NAML
  
  
   View this message in context: Re: example of non-line oriented
 input data?
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
  
 
 







Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
)
  
  
  
  
  
  
  
   sc = SparkContext(conf=conf)
  
  
  
   logData = sc.textFile(logFile).cache()
  
  
  
   num_lines = logData.count()
   print 'Number of lines: %d' % num_lines
  
  
  
  
  
  
  
   # JSON object has the structure: {key: {'paragraphs': [sentence1,
 sentence2, ...]}}
   tm = logData.map(lambda s: (json.loads(s)['key'],
 len(concatenate_paragraphs(json.loads(s)['paragraphs']
  
  
  
  
  
  
  
   tm = tm.reduceByKey(lambda _, x: _ + x)
  
  
  
  
  
  
  
   op = tm.collect()
   for key, num_words in op:
  
  
  
print 'state: %s, num_words: %d' % (state, num_words)
  
  
  
  
  
  
  
  
  
  
  
   On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark
 User List] [hidden email] wrote:
   I don't actually have any data.  I'm writing a course that teaches
 students how to do this sort of thing and am interested in looking at a
 variety of real life examples of people doing things like that.  I'd love
 to see some working code implementing the obvious work-around you
 mention...do you have any to share?  It's an approach that makes a lot of
 sense, and as I said, I'd love to not have to re-invent the wheel if
 someone else has already written that code.  Thanks!
  
   Diana
  
  
   On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email]
 wrote:
   There was a previous discussion about this here:
  
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
  
   How big are the XML or JSON files you're looking to deal with?
  
   It may not be practical to deserialize the entire document at once.
 In that case an obvious work-around would be to have some kind of
 pre-processing step that separates XML nodes/JSON objects with newlines so
 that you can analyze the data with Spark in a line-oriented format. Your
 preprocessor wouldn't have to parse/deserialize the massive document; it
 would just have to track open/closed tags/braces to know when to insert a
 newline.
  
   Then you'd just open the line-delimited result and deserialize the
 individual objects/nodes with map().
  
   Nick
  
  
   On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email]
 wrote:
   Has anyone got a working example of a Spark application that
 analyzes data in a non-line-oriented format, such as XML or JSON?  I'd like
 to do this without re-inventing the wheel...anyone care to share?  Thanks!
  
   Diana
  
  
  
  
   If you reply to this email, your message will be added to the
 discussion below:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
   To start a new topic under Apache Spark User List, email [hidden
 email]
   To unsubscribe from Apache Spark User List, click here.
   NAML
  
  
   View this message in context: Re: example of non-line oriented
 input data?
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com http://nabble.com/.
  
  
  
 
 








Re: example of non-line oriented input data?

2014-03-19 Thread Jeremy Freeman
 preprocessing on the text input to flatten the data in that way?
  
   Thanks,
   Diana
  
  
   On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com 
   wrote:
   Katrina,
  
   Not sure if this is what you had in mind, but here's some simple 
   pyspark code that I recently wrote to deal with JSON files.
  
   from pyspark import SparkContext, SparkConf
  
  
  
   from operator import add
   import json
  
  
  
   import random
   import numpy as np
  
  
  
  
   def concatenate_paragraphs(sentence_array):
  
  
  
   return ' '.join(sentence_array).split(' ')
  
  
  
  
   logFile = 'foo.json'
   conf = SparkConf()
  
  
  
   conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory,
1g)
  
  
  
  
  
  
  
   sc = SparkContext(conf=conf)
  
  
  
   logData = sc.textFile(logFile).cache()
  
  
  
   num_lines = logData.count()
   print 'Number of lines: %d' % num_lines
  
  
  
  
  
  
  
   # JSON object has the structure: {key: {'paragraphs': [sentence1, 
   sentence2, ...]}}
   tm = logData.map(lambda s: (json.loads(s)['key'], 
   len(concatenate_paragraphs(json.loads(s)['paragraphs']
  
  
  
  
  
  
  
   tm = tm.reduceByKey(lambda _, x: _ + x)
  
  
  
  
  
  
  
   op = tm.collect()
   for key, num_words in op:
  
  
  
print 'state: %s, num_words: %d' % (state, num_words)
  
  
  
  
  
  
  
  
  
  
  
   On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark 
   User List] [hidden email] wrote:
   I don't actually have any data.  I'm writing a course that teaches 
   students how to do this sort of thing and am interested in looking at 
   a variety of real life examples of people doing things like that.  
   I'd love to see some working code implementing the obvious 
   work-around you mention...do you have any to share?  It's an 
   approach that makes a lot of sense, and as I said, I'd love to not 
   have to re-invent the wheel if someone else has already written that 
   code.  Thanks!
  
   Diana
  
  
   On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] 
   wrote:
   There was a previous discussion about this here:
  
   http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
  
   How big are the XML or JSON files you're looking to deal with?
  
   It may not be practical to deserialize the entire document at once. 
   In that case an obvious work-around would be to have some kind of 
   pre-processing step that separates XML nodes/JSON objects with 
   newlines so that you can analyze the data with Spark in a 
   line-oriented format. Your preprocessor wouldn't have to 
   parse/deserialize the massive document; it would just have to track 
   open/closed tags/braces to know when to insert a newline.
  
   Then you'd just open the line-delimited result and deserialize the 
   individual objects/nodes with map().
  
   Nick
  
  
   On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] 
   wrote:
   Has anyone got a working example of a Spark application that analyzes 
   data in a non-line-oriented format, such as XML or JSON?  I'd like to 
   do this without re-inventing the wheel...anyone care to share?  
   Thanks!
  
   Diana
  
  
  
  
   If you reply to this email, your message will be added to the 
   discussion below:
   http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
   To start a new topic under Apache Spark User List, email [hidden 
   email]
   To unsubscribe from Apache Spark User List, click here.
   NAML
  
  
   View this message in context: Re: example of non-line oriented input 
   data?
   Sent from the Apache Spark User List mailing list archive at 
   Nabble.com.
  
  
  
 
 
 
 
 
 
 
 
 
 



Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
 the massive document; it
 would just have to track open/closed tags/braces to know when to insert a
 newline.
  
   Then you'd just open the line-delimited result and deserialize the
 individual objects/nodes with map().
  
   Nick
  
  
   On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email]
 wrote:
   Has anyone got a working example of a Spark application that
 analyzes data in a non-line-oriented format, such as XML or JSON?  I'd like
 to do this without re-inventing the wheel...anyone care to share?  Thanks!
  
   Diana
  
  
  
  
   If you reply to this email, your message will be added to the
 discussion below:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
   To start a new topic under Apache Spark User List, email [hidden
 email]
   To unsubscribe from Apache Spark User List, click here.
   NAML
  
  
   View this message in context: Re: example of non-line oriented input
 data?
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
  
 
 





Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Hi Diana,

Non-text input formats are only supported in Java and Scala right now, where 
you can use sparkContext.hadoopFile or .hadoopDataset to load data with any 
InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only 
have textFile, which gives you one record per line. For JSON, you’d have to fit 
the whole JSON object on one line as you said. Hopefully we’ll also have some 
other forms of input soon.

If your input is a collection of separate files (say many .xml files), you can 
also use mapPartitions on it to group together the lines because each input 
file will end up being a single dataset partition (or map task). This will let 
you concatenate the lines in each file and parse them as one XML object.

Matei

On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com wrote:

 Thanks, Krakna, very helpful.  The way I read the code, it looks like you are 
 assuming that each line in foo.log contains a complete json object?  (That 
 is, that the data doesn't contain any records that are split into multiple 
 lines.)  If so, is that because you know that to be true of your data?  Or 
 did you do as Nicholas suggests and have some preprocessing on the text input 
 to flatten the data in that way?
 
 Thanks,
 Diana
 
 
 On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com wrote:
 Katrina, 
 
 Not sure if this is what you had in mind, but here's some simple pyspark code 
 that I recently wrote to deal with JSON files.
 
 from pyspark import SparkContext, SparkConf
 
 
 from operator import add
 import json
 
 
 import random
 import numpy as np
 
 
 
 def concatenate_paragraphs(sentence_array):
 
 
   return ' '.join(sentence_array).split(' ')
 
 
 
 logFile = 'foo.json'
 conf = SparkConf()
 
 
 conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory,
  1g)
 
 
 
 
 sc = SparkContext(conf=conf)
 
 
 logData = sc.textFile(logFile).cache()
 
 
 num_lines = logData.count()
 print 'Number of lines: %d' % num_lines
 
 
 
 
 # JSON object has the structure: {key: {'paragraphs': [sentence1, 
 sentence2, ...]}}
 tm = logData.map(lambda s: (json.loads(s)['key'], 
 len(concatenate_paragraphs(json.loads(s)['paragraphs']
 
 
 
 
 tm = tm.reduceByKey(lambda _, x: _ + x)
 
 
 
 
 op = tm.collect()
 for key, num_words in op:
 
 
   print 'state: %s, num_words: %d' % (state, num_words)
 
 
 
 
 
 
 
 
 On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] 
 [hidden email] wrote:
 I don't actually have any data.  I'm writing a course that teaches students 
 how to do this sort of thing and am interested in looking at a variety of 
 real life examples of people doing things like that.  I'd love to see some 
 working code implementing the obvious work-around you mention...do you have 
 any to share?  It's an approach that makes a lot of sense, and as I said, I'd 
 love to not have to re-invent the wheel if someone else has already written 
 that code.  Thanks!
 
 Diana
 
 
 On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote:
 There was a previous discussion about this here:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
 
 How big are the XML or JSON files you're looking to deal with? 
 
 It may not be practical to deserialize the entire document at once. In that 
 case an obvious work-around would be to have some kind of pre-processing step 
 that separates XML nodes/JSON objects with newlines so that you can analyze 
 the data with Spark in a line-oriented format. Your preprocessor wouldn't 
 have to parse/deserialize the massive document; it would just have to track 
 open/closed tags/braces to know when to insert a newline.
 
 Then you'd just open the line-delimited result and deserialize the individual 
 objects/nodes with map().
 
 Nick
 
 
 On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote:
 Has anyone got a working example of a Spark application that analyzes data in 
 a non-line-oriented format, such as XML or JSON?  I'd like to do this without 
 re-inventing the wheel...anyone care to share?  Thanks!
 
 Diana
 
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
 To start a new topic under Apache Spark User List, email [hidden email] 
 To unsubscribe from Apache Spark User List, click here.
 NAML
 
 
 View this message in context: Re: example of non-line oriented input data?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 



Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
 format. Your preprocessor 
 wouldn't have to parse/deserialize the massive document; it would just have 
 to track open/closed tags/braces to know when to insert a newline.
 
 Then you'd just open the line-delimited result and deserialize the 
 individual objects/nodes with map().
 
 Nick
 
 
 On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote:
 Has anyone got a working example of a Spark application that analyzes data 
 in a non-line-oriented format, such as XML or JSON?  I'd like to do this 
 without re-inventing the wheel...anyone care to share?  Thanks!
 
 Diana
 
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
 To start a new topic under Apache Spark User List, email [hidden email] 
 To unsubscribe from Apache Spark User List, click here.
 NAML
 
 
 View this message in context: Re: example of non-line oriented input data?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 



Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
 of
 sense, and as I said, I'd love to not have to re-invent the wheel if
 someone else has already written that code.  Thanks!
 
  Diana
 
 
  On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email]
 wrote:
  There was a previous discussion about this here:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
 
  How big are the XML or JSON files you're looking to deal with?
 
  It may not be practical to deserialize the entire document at once. In
 that case an obvious work-around would be to have some kind of
 pre-processing step that separates XML nodes/JSON objects with newlines so
 that you can analyze the data with Spark in a line-oriented format. Your
 preprocessor wouldn't have to parse/deserialize the massive document; it
 would just have to track open/closed tags/braces to know when to insert a
 newline.
 
  Then you'd just open the line-delimited result and deserialize the
 individual objects/nodes with map().
 
  Nick
 
 
  On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote:
  Has anyone got a working example of a Spark application that analyzes
 data in a non-line-oriented format, such as XML or JSON?  I'd like to do
 this without re-inventing the wheel...anyone care to share?  Thanks!
 
  Diana
 
 
 
 
  If you reply to this email, your message will be added to the
 discussion below:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
  To start a new topic under Apache Spark User List, email [hidden email]
  To unsubscribe from Apache Spark User List, click here.
  NAML
 
 
  View this message in context: Re: example of non-line oriented input
 data?
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 




Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
 in mind, but here's some simple pyspark 
  code that I recently wrote to deal with JSON files.
 
  from pyspark import SparkContext, SparkConf
 
 
 
  from operator import add
  import json
 
 
 
  import random
  import numpy as np
 
 
 
 
  def concatenate_paragraphs(sentence_array):
 
 
 
  return ' '.join(sentence_array).split(' ')
 
 
 
 
  logFile = 'foo.json'
  conf = SparkConf()
 
 
 
  conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory,
   1g)
 
 
 
 
 
 
 
  sc = SparkContext(conf=conf)
 
 
 
  logData = sc.textFile(logFile).cache()
 
 
 
  num_lines = logData.count()
  print 'Number of lines: %d' % num_lines
 
 
 
 
 
 
 
  # JSON object has the structure: {key: {'paragraphs': [sentence1, 
  sentence2, ...]}}
  tm = logData.map(lambda s: (json.loads(s)['key'], 
  len(concatenate_paragraphs(json.loads(s)['paragraphs']
 
 
 
 
 
 
 
  tm = tm.reduceByKey(lambda _, x: _ + x)
 
 
 
 
 
 
 
  op = tm.collect()
  for key, num_words in op:
 
 
 
   print 'state: %s, num_words: %d' % (state, num_words)
 
 
 
 
 
 
 
 
 
 
 
  On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User 
  List] [hidden email] wrote:
  I don't actually have any data.  I'm writing a course that teaches 
  students how to do this sort of thing and am interested in looking at a 
  variety of real life examples of people doing things like that.  I'd love 
  to see some working code implementing the obvious work-around you 
  mention...do you have any to share?  It's an approach that makes a lot of 
  sense, and as I said, I'd love to not have to re-invent the wheel if 
  someone else has already written that code.  Thanks!
 
  Diana
 
 
  On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote:
  There was a previous discussion about this here:
 
  http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html
 
  How big are the XML or JSON files you're looking to deal with?
 
  It may not be practical to deserialize the entire document at once. In 
  that case an obvious work-around would be to have some kind of 
  pre-processing step that separates XML nodes/JSON objects with newlines so 
  that you can analyze the data with Spark in a line-oriented format. Your 
  preprocessor wouldn't have to parse/deserialize the massive document; it 
  would just have to track open/closed tags/braces to know when to insert a 
  newline.
 
  Then you'd just open the line-delimited result and deserialize the 
  individual objects/nodes with map().
 
  Nick
 
 
  On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote:
  Has anyone got a working example of a Spark application that analyzes data 
  in a non-line-oriented format, such as XML or JSON?  I'd like to do this 
  without re-inventing the wheel...anyone care to share?  Thanks!
 
  Diana
 
 
 
 
  If you reply to this email, your message will be added to the discussion 
  below:
  http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html
  To start a new topic under Apache Spark User List, email [hidden email]
  To unsubscribe from Apache Spark User List, click here.
  NAML
 
 
  View this message in context: Re: example of non-line oriented input data?
  Sent from the Apache Spark User List mailing list archive at Nabble.com.