JSON handling works great, although you have to be a little bit careful with just what is loaded/used where. One approach that works is:
- Jackson Scala 2.3.1 (or your favorite JSON lib) shipped as a JAR for the job. - Read data as RDD[String]. - Implement your per-line JSON binding in a method on an object, e.g., apply(...) for a companion object for a case class that models your line items. For the Jackson case, this would mean an ObjectMapper as a val in the companion object (only need one ObjectMapper instance). - .map(YourObject.apply) to get RDD[YourObject] And there you go. Something similar works for writing out JSON. Probably obvious if you're a seasoned Spark user, but DO NOT write your JSON serialization/deserialization as inline blocks, else you'll be transporting your ObjectMapper instances around the cluster when you don't need to (and depending on your specific configuration, it may not work). That is a facility that should (IMHO) be encapsulated with the pieces of the system that directly touch the data, i.e., on the worker. — [email protected] | Multifarious, Inc. | http://mult.ifario.us/ On Sun, Feb 23, 2014 at 9:10 PM, nicholas.chammas < [email protected]> wrote: > I'm new to this field, but it seems like most "Big Data" examples -- > Spark's included -- begin with reading in flat lines of text from a file. > > How would I go about having Spark turn a large JSON file into an RDD? > > So the file would just be a text file that looks like this: > > [{...}, {...}, ...] > > > where the individual JSON objects are arbitrarily complex (i.e. not > necessarily flat) and may or may not be on separate lines. > > Basically, I'm guessing Spark would need to parse the JSON since it cannot > rely on newlines as a delimiter. That sounds like a costly thing. > > Is JSON a "bad" format to have to deal with, or can Spark efficiently > ingest and work with data in this format? If it can, can I get a pointer as > to how I would do that? > > Nick > > ------------------------------ > View this message in context: Having Spark read a JSON > file<http://apache-spark-user-list.1001560.n3.nabble.com/Having-Spark-read-a-JSON-file-tp1963.html> > Sent from the Apache Spark User List mailing list > archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com. >
