Re: Dataframe nested schema inference from Json without type conflicts

Yin Huai Mon, 05 Oct 2015 09:41:12 -0700

Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this?
Also, sending out a PR will be great. For JSONRelation, I think we can pass
all user-specific options to it (see
org.apache.spark.sql.execution.datasources.json.DefaultSource's
createRelation) just like what we do for ParquetRelation. Then, inside
JSONRelation, we figure out what kind of options that have been specified.


Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> I’ve done some digging today and, as a quick and ugly fix, altering the
> case statement of the JSON inferField function in InferSchema.scala
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala
>
>
>
> to have
>
>
>
> case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE |
> VALUE_FALSE => StringType
>
>
>
> rather than the rules for each type works as we’d want.
>
>
>
> If we were to wrap this up in a configuration setting in JSONRelation like
> the samplingRatio setting, with the default being to behave as it currently
> works, does anyone think a pull request would plausibly get into the Spark
> main codebase?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>
>
>
> *From:* Ewan Leith [mailto:ewan.le...@realitymine.com]
> *Sent:* 02 October 2015 01:57
> *To:* yh...@databricks.com
>
> *Cc:* r...@databricks.com; dev@spark.apache.org
> *Subject:* Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Exactly, that's a much better way to put it.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> ------ Original message------
>
> *From: *Yin Huai
>
> *Date: *Thu, 1 Oct 2015 23:54
>
> *To: *Ewan Leith;
>
> *Cc: *r...@databricks.com;dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> Hi Ewan,
>
>
>
> For your use case, you only need the schema inference to pick up the
> structure of your data (basically you want spark sql to infer the type of
> complex values like arrays and structs but keep the type of primitive
> values as strings), right?
>
>
>
> Thanks,
>
>
>
> Yin
>
>
>
> On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
> We could, but if a client sends some unexpected records in the schema
> (which happens more than I'd like, our schema seems to constantly evolve),
> its fantastic how Spark picks up on that data and includes it.
>
>
>
> Passing in a fixed schema loses that nice additional ability, though it's
> what we'll probably have to adopt if we can't come up with a way to keep
> the inference working.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> ------ Original message------
>
> *From: *Reynold Xin
>
> *Date: *Thu, 1 Oct 2015 22:12
>
> *To: *Ewan Leith;
>
> *Cc: *dev@spark.apache.org;
>
> *Subject:*Re: Dataframe nested schema inference from Json without type
> conflicts
>
>
>
> You can pass the schema into json directly, can't you?
>
>
>
> On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
> Hi all,
>
>
>
> We really like the ability to infer a schema from JSON contained in an
> RDD, but when we’re using Spark Streaming on small batches of data, we
> sometimes find that Spark infers a more specific type than it should use,
> for example if the json in that small batch only contains integer values
> for a String field, it’ll class the field as an Integer type on one
> Streaming batch, then a String on the next one.
>
>
>
> Instead, we’d rather match every value as a String type, then handle any
> casting to a desired type later in the process.
>
>
>
> I don’t think there’s currently any simple way to avoid this that I can
> see, but we could add the functionality in the JacksonParser.scala file,
> probably in convertField.
>
>
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
>
>
>
> Does anyone know an easier and cleaner way to do this?
>
>
>
> Thanks,
>
> Ewan
>
>
>
>
>

Re: Dataframe nested schema inference from Json without type conflicts

Reply via email to