Thanks. Once you create the jira just reply to this email with the link.

On Wednesday, March 2, 2016, Ewan Leith <> wrote:

> Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if 
> we can, not sure if my own scala skills will be up to it but perhaps one of 
> my colleagues' will :)
> Ewan
> I don't think that exists right now, but it's definitely a good option to
> have. I myself have run into this issue a few times.
> Can you create a JIRA ticket so we can track it? Would be even better if
> you are interested in working on a patch! Thanks.
> On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith <
> <javascript:_e(%7B%7D,'cvml','');>> wrote:
>> Hi Reynold, yes that would be perfect for our use case.
>> I assume it doesn't exist though, otherwise I really need to go re-read the 
>> docs!
>> Thanks to both of you for replying by the way, I know you must be hugely 
>> busy.
>> Ewan
>> Are you looking for "relaxed" mode that simply return nulls for fields
>> that doesn't exist or have incompatible schema?
>> On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <
>> <javascript:_e(%7B%7D,'cvml','');>> wrote:
>>> Thanks Michael, it's not a great example really, as the data I'm working 
>>> with has some source files that do fit the schema, and some that don't (out 
>>> of millions that do work, perhaps 10 might not).
>>> In an ideal world for us the select would probably return the valid records 
>>> only.
>>> We're trying out the new dataset APIs to see if we can do some 
>>> pre-filtering that way.
>>> Thanks,
>>> Ewan
>>> -dev +user
>>> StructType(StructField(data,ArrayType(StructType(StructField(
>>>> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>>> StructField(name,StringType,true)),true),true), StructField(othertype,
>>>> ArrayType(StructType(StructField(company,StringType,true),
>>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>> Its not a great error message, but as the schema above shows, stuff is
>>> an array, not a struct.  So, you need to pick a particular element (using
>>> []) before you can pull out a specific field.  It would be easier to see
>>> this if you ran, which gives
>>> you a tree view.  Try the following.
>>> On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <
>>> <javascript:_e(%7B%7D,'cvml','');>> wrote:
>>>> When you create a dataframe using the ** API,
>>>> if you pass in a schema that’s compatible with some of the records, but
>>>> incompatible with others, it seems you can’t do a .select on the
>>>> problematic columns, instead you get an AnalysisException error.
>>>> I know loading the wrong data isn’t good behaviour, but if you’re
>>>> reading data from (for example) JSON files, there’s going to be malformed
>>>> files along the way. I think it would be nice to handle this error in a
>>>> nicer way, though I don’t know the best way to approach it.
>>>> Before I raise a JIRA ticket about it, would people consider this to be
>>>> a bug or expected behaviour?
>>>> I’ve attached a couple of sample JSON files and the steps below to
>>>> reproduce it, by taking the inferred schema from the simple1.json file, and
>>>> applying it to a union of simple1.json and simple2.json. You can visually
>>>> see the data has been parsed as I think you’d want if you do a .select on
>>>> the parent column and print out the output, but when you do a select on the
>>>> problem column you instead get an exception.
>>>> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x =>
>>>> x._2)*
>>>> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map
>>>> at <console>:27
>>>> *scala> val s1schema =*
>>>> s1schema: org.apache.spark.sql.types.StructType =
>>>> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
>>>> StructField(name,StringType,true)),true),true),
>>>> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
>>>> StructField(id,LongType,true)),true),true)),true),true)),true),true))
>>>> *scala>
>>>> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don
>>>> Joeh]),null], [null,WrappedArray([ACME,2])]))]
>>>> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])],
>>>> [WrappedArray([2,null]),null]))]
>>>> *scala>
>>>> org.apache.spark.sql.AnalysisException: cannot resolve
>>>> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires
>>>> integral type, however, 'onetype' is of string type.;
>>>>                 at
>>>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>>>                 at
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
>>>>                 at
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>>>>                 at
>>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>>                 at
>>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>>>> (The full exception is attached too).
>>>> What do people think, is this a bug?
>>>> Thanks,
>>>> Ewan
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> <javascript:_e(%7B%7D,'cvml','');>
>>>> For additional commands, e-mail:
>>>> <javascript:_e(%7B%7D,'cvml','');>

Reply via email to