Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Ewan Higgs
FWIW, CSV has the same problem that renders it immune to naive partitioning. Consider the following RFC 4180 compliant record: 1,2, all,of,these,are,just,one,field ,4,5 Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket? I suspect the algorithm is going to be bit fiddly and would definitely benefit from multiple heads. If possible, I think we should handle pathological cases like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Joe Halliwell
I've raised the JSON-related ticket at https://issues.apache.org/jira/browse/SPARK-7366. @Ewan I think it would be great to support multiline CSV records too. The motivation is very similar but my instinct is that little/nothing of the implementation could be usefully shared, so it's better as a

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Emre Sevinc
You can check out the following library: https://github.com/alexholmes/json-mapreduce -- Emre Sevinç On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Joe Halliwell
I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do its job both efficiently and correctly in the common case where the input is an array of similarly structured objects! I’d certainly be interested

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
I was wondering if it's possible to use existing Hive SerDes for this ? Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a écrit : I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input format to do

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Paul Brown
It's not JSON, per se, but data formats like smile ( http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide support for markers that can't be confused with content and also provide reasonably similar ergonomics. — p...@mult.ifario.us | Multifarious, Inc. |

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com wrote: I think Reynold’s argument shows the impossibility of the general case. But a “maximum object depth” hint could enable a new input

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Matei Zaharia
I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote: Joe - I think that's a legit and useful thing to do. Do you want to give it a shot? On Mon,

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Olivier Girardot
@joe, I'd be glad to help if you need. Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a écrit : I don't know whether this is common, but we might also allow another separator for JSON objects, such as two blank lines. Matei On May 4, 2015, at 2:28 PM, Reynold Xin

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
I'll try to study that and get back to you. Regards, Olivier. Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit : How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this

Multi-Line JSON in SparkSQL

2015-05-03 Thread Olivier Girardot
Hi everyone, Is there any way in Spark SQL to load multi-line JSON data efficiently, I think there was in the mailing list a reference to http://pivotal-field-engineering.github.io/pmr-common/ for its JSONInputFormat But it's rather inaccessible considering the dependency is not available in any