Hello Ewan, Adding a JSON-specific option makes sense. Can you open a JIRA for this? Also, sending out a PR will be great. For JSONRelation, I think we can pass all user-specific options to it (see org.apache.spark.sql.execution.datasources.json.DefaultSource's createRelation) just like what we do for ParquetRelation. Then, inside JSONRelation, we figure out what kind of options that have been specified.
Thanks, Yin On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith <ewan.le...@realitymine.com> wrote: > I’ve done some digging today and, as a quick and ugly fix, altering the > case statement of the JSON inferField function in InferSchema.scala > > > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala > > > > to have > > > > case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | > VALUE_FALSE => StringType > > > > rather than the rules for each type works as we’d want. > > > > If we were to wrap this up in a configuration setting in JSONRelation like > the samplingRatio setting, with the default being to behave as it currently > works, does anyone think a pull request would plausibly get into the Spark > main codebase? > > > > Thanks, > > Ewan > > > > > > > > *From:* Ewan Leith [mailto:ewan.le...@realitymine.com] > *Sent:* 02 October 2015 01:57 > *To:* yh...@databricks.com > > *Cc:* r...@databricks.com; dev@spark.apache.org > *Subject:* Re: Dataframe nested schema inference from Json without type > conflicts > > > > Exactly, that's a much better way to put it. > > > > Thanks, > > Ewan > > > > ------ Original message------ > > *From: *Yin Huai > > *Date: *Thu, 1 Oct 2015 23:54 > > *To: *Ewan Leith; > > *Cc: *r...@databricks.com;dev@spark.apache.org; > > *Subject:*Re: Dataframe nested schema inference from Json without type > conflicts > > > > Hi Ewan, > > > > For your use case, you only need the schema inference to pick up the > structure of your data (basically you want spark sql to infer the type of > complex values like arrays and structs but keep the type of primitive > values as strings), right? > > > > Thanks, > > > > Yin > > > > On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith <ewan.le...@realitymine.com> > wrote: > > We could, but if a client sends some unexpected records in the schema > (which happens more than I'd like, our schema seems to constantly evolve), > its fantastic how Spark picks up on that data and includes it. > > > > Passing in a fixed schema loses that nice additional ability, though it's > what we'll probably have to adopt if we can't come up with a way to keep > the inference working. > > > > Thanks, > > Ewan > > > > ------ Original message------ > > *From: *Reynold Xin > > *Date: *Thu, 1 Oct 2015 22:12 > > *To: *Ewan Leith; > > *Cc: *dev@spark.apache.org; > > *Subject:*Re: Dataframe nested schema inference from Json without type > conflicts > > > > You can pass the schema into json directly, can't you? > > > > On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith <ewan.le...@realitymine.com> > wrote: > > Hi all, > > > > We really like the ability to infer a schema from JSON contained in an > RDD, but when we’re using Spark Streaming on small batches of data, we > sometimes find that Spark infers a more specific type than it should use, > for example if the json in that small batch only contains integer values > for a String field, it’ll class the field as an Integer type on one > Streaming batch, then a String on the next one. > > > > Instead, we’d rather match every value as a String type, then handle any > casting to a desired type later in the process. > > > > I don’t think there’s currently any simple way to avoid this that I can > see, but we could add the functionality in the JacksonParser.scala file, > probably in convertField. > > > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala > > > > Does anyone know an easier and cleaner way to do this? > > > > Thanks, > > Ewan > > > > >