Re: Dataframe nested schema inference from Json without type conflicts

Ewan Leith Thu, 01 Oct 2015 14:28:11 -0700

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


------ Original message------

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: [email protected];

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan

Re: Dataframe nested schema inference from Json without type conflicts

Reply via email to