Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Ewan Leith Wed, 02 Mar 2016 11:14:04 -0800

Thanks Michael, it's not a great example really, as the data I'm working with 
has some source files that do fit the schema, and some that don't (out of 
millions that do work, perhaps 10 might not).


In an ideal world for us the select would probably return the valid records 
only.

We're trying out the new dataset APIs to see if we can do some pre-filtering 
that way.

Thanks,
Ewan

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

Its not a great error message, but as the schema above shows, stuff is an 
array, not a struct.  So, you need to pick a particular element (using []) 
before you can pull out a specific field.  It would be easier to see this if 
you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a tree view. 
 Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith 
<[email protected]<mailto:[email protected]>> wrote:
When you create a dataframe using the sqlContext.read.schema() API, if you pass 
in a schema that's compatible with some of the records, but incompatible with 
others, it seems you can't do a .select on the problematic columns, instead you 
get an AnalysisException error.

I know loading the wrong data isn't good behaviour, but if you're reading data 
from (for example) JSON files, there's going to be malformed files along the 
way. I think it would be nice to handle this error in a nicer way, though I 
don't know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug 
or expected behaviour?

I've attached a couple of sample JSON files and the steps below to reproduce 
it, by taking the inferred schema from the simple1.json file, and applying it 
to a union of simple1.json and simple2.json. You can visually see the data has 
been parsed as I think you'd want if you do a .select on the parent column and 
print out the output, but when you do a select on the problem column you 
instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at 
<console>:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> 
sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], 
[null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], 
[WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
                at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
                at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
                at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
                at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
                at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan


---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

Reply via email to