FWIW, CSV has the same problem that renders it immune to naive partitioning.
Consider the following RFC 4180 compliant record:
1,2,
all,of,these,are,just,one,field
,4,5
Now, it's probably a terrible idea to give a file system awareness of
actual file types, but couldn't HDFS handle this
@reynold, I’ll raise a JIRA today.@oliver, let’s discuss on the ticket?
I suspect the algorithm is going to be bit fiddly and would definitely benefit
from multiple heads. If possible, I think we should handle pathological cases
like {“:”:”:”,{”{”:”}”}} correctly, rather than bailing out.
I've raised the JSON-related ticket at
https://issues.apache.org/jira/browse/SPARK-7366.
@Ewan I think it would be great to support multiline CSV records too.
The motivation is very similar but my instinct is that little/nothing
of the implementation could be usefully shared, so it's better as a
You can check out the following library:
https://github.com/alexholmes/json-mapreduce
--
Emre Sevinç
On Sun, May 3, 2015 at 10:04 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data efficiently, I
think
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of a
string, and thus the first { might just be part of a string, rather than a
I think Reynold’s argument shows the impossibility of the general case.
But a “maximum object depth” hint could enable a new input format to do its job
both efficiently and correctly in the common case where the input is an array
of similarly structured objects! I’d certainly be interested
I was wondering if it's possible to use existing Hive SerDes for this ?
Le lun. 4 mai 2015 à 08:36, Joe Halliwell joe.halliw...@gmail.com a
écrit :
I think Reynold’s argument shows the impossibility of the general case.
But a “maximum object depth” hint could enable a new input format to do
It's not JSON, per se, but data formats like smile (
http://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29) provide
support for markers that can't be confused with content and also provide
reasonably similar ergonomics.
—
p...@mult.ifario.us | Multifarious, Inc. |
Joe - I think that's a legit and useful thing to do. Do you want to give it
a shot?
On Mon, May 4, 2015 at 12:36 AM, Joe Halliwell joe.halliw...@gmail.com
wrote:
I think Reynold’s argument shows the impossibility of the general case.
But a “maximum object depth” hint could enable a new input
I don't know whether this is common, but we might also allow another separator
for JSON objects, such as two blank lines.
Matei
On May 4, 2015, at 2:28 PM, Reynold Xin r...@databricks.com wrote:
Joe - I think that's a legit and useful thing to do. Do you want to give it
a shot?
On Mon,
@joe, I'd be glad to help if you need.
Le lun. 4 mai 2015 à 20:06, Matei Zaharia matei.zaha...@gmail.com a
écrit :
I don't know whether this is common, but we might also allow another
separator for JSON objects, such as two blank lines.
Matei
On May 4, 2015, at 2:28 PM, Reynold Xin
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
I'll try to study that and get back to you.
Regards,
Olivier.
Le lun. 4 mai 2015 à 04:05, Reynold Xin r...@databricks.com a écrit :
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this
Hi everyone,
Is there any way in Spark SQL to load multi-line JSON data efficiently, I
think there was in the mailing list a reference to
http://pivotal-field-engineering.github.io/pmr-common/ for its
JSONInputFormat
But it's rather inaccessible considering the dependency is not available in
any
14 matches
Mail list logo