Re: example of non-line oriented input data?
them as one XML object. Matei On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com wrote: Thanks, Krakna, very helpful. The way I read the code, it looks like you are assuming that each line in foo.log contains a complete json object? (That is, that the data doesn't contain any records that are split into multiple lines.) If so, is that because you know that to be true of your data? Or did you do as Nicholas suggests and have some preprocessing on the text input to flatten the data in that way? Thanks, Diana On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com wrote: Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory, 1g) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
. Matei On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com wrote: Thanks, Krakna, very helpful. The way I read the code, it looks like you are assuming that each line in foo.log contains a complete json object? (That is, that the data doesn't contain any records that are split into multiple lines.) If so, is that because you know that to be true of your data? Or did you do as Nicholas suggests and have some preprocessing on the text input to flatten the data in that way? Thanks, Diana On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com wrote: Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory, 1g) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com http://nabble.com/.
Re: example of non-line oriented input data?
preprocessing on the text input to flatten the data in that way? Thanks, Diana On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com wrote: Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory, 1g) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset to load data with any InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only have textFile, which gives you one record per line. For JSON, you’d have to fit the whole JSON object on one line as you said. Hopefully we’ll also have some other forms of input soon. If your input is a collection of separate files (say many .xml files), you can also use mapPartitions on it to group together the lines because each input file will end up being a single dataset partition (or map task). This will let you concatenate the lines in each file and parse them as one XML object. Matei On Mar 17, 2014, at 9:52 AM, Diana Carroll dcarr...@cloudera.com wrote: Thanks, Krakna, very helpful. The way I read the code, it looks like you are assuming that each line in foo.log contains a complete json object? (That is, that the data doesn't contain any records that are split into multiple lines.) If so, is that because you know that to be true of your data? Or did you do as Nicholas suggests and have some preprocessing on the text input to flatten the data in that way? Thanks, Diana On Mon, Mar 17, 2014 at 12:09 PM, Krakna H shankark+...@gmail.com wrote: Katrina, Not sure if this is what you had in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory, 1g) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: example of non-line oriented input data?
in mind, but here's some simple pyspark code that I recently wrote to deal with JSON files. from pyspark import SparkContext, SparkConf from operator import add import json import random import numpy as np def concatenate_paragraphs(sentence_array): return ' '.join(sentence_array).split(' ') logFile = 'foo.json' conf = SparkConf() conf.setMaster(spark://cluster-master:7077).setAppName(example).set(spark.executor.memory, 1g) sc = SparkContext(conf=conf) logData = sc.textFile(logFile).cache() num_lines = logData.count() print 'Number of lines: %d' % num_lines # JSON object has the structure: {key: {'paragraphs': [sentence1, sentence2, ...]}} tm = logData.map(lambda s: (json.loads(s)['key'], len(concatenate_paragraphs(json.loads(s)['paragraphs'] tm = tm.reduceByKey(lambda _, x: _ + x) op = tm.collect() for key, num_words in op: print 'state: %s, num_words: %d' % (state, num_words) On Mon, Mar 17, 2014 at 11:58 AM, Diana Carroll [via Apache Spark User List] [hidden email] wrote: I don't actually have any data. I'm writing a course that teaches students how to do this sort of thing and am interested in looking at a variety of real life examples of people doing things like that. I'd love to see some working code implementing the obvious work-around you mention...do you have any to share? It's an approach that makes a lot of sense, and as I said, I'd love to not have to re-invent the wheel if someone else has already written that code. Thanks! Diana On Mon, Mar 17, 2014 at 11:35 AM, Nicholas Chammas [hidden email] wrote: There was a previous discussion about this here: http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-td1963.html How big are the XML or JSON files you're looking to deal with? It may not be practical to deserialize the entire document at once. In that case an obvious work-around would be to have some kind of pre-processing step that separates XML nodes/JSON objects with newlines so that you can analyze the data with Spark in a line-oriented format. Your preprocessor wouldn't have to parse/deserialize the massive document; it would just have to track open/closed tags/braces to know when to insert a newline. Then you'd just open the line-delimited result and deserialize the individual objects/nodes with map(). Nick On Mon, Mar 17, 2014 at 11:18 AM, Diana Carroll [hidden email] wrote: Has anyone got a working example of a Spark application that analyzes data in a non-line-oriented format, such as XML or JSON? I'd like to do this without re-inventing the wheel...anyone care to share? Thanks! Diana If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/example-of-non-line-oriented-input-data-tp2750p2752.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML View this message in context: Re: example of non-line oriented input data? Sent from the Apache Spark User List mailing list archive at Nabble.com.