Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist or have incompatible schema?
On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com> wrote: > Thanks Michael, it's not a great example really, as the data I'm working with > has some source files that do fit the schema, and some that don't (out of > millions that do work, perhaps 10 might not). > > In an ideal world for us the select would probably return the valid records > only. > > We're trying out the new dataset APIs to see if we can do some pre-filtering > that way. > > Thanks, > Ewan > > -dev +user > > StructType(StructField(data,ArrayType(StructType(StructField( >> *stuff,ArrayType(*StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >> StructField(name,StringType,true)),true),true), StructField(othertype, >> ArrayType(StructType(StructField(company,StringType,true), >> StructField(id,LongType,true)),true),true)),true),true)),true),true)) > > > Its not a great error message, but as the schema above shows, stuff is an > array, not a struct. So, you need to pick a particular element (using []) > before you can pull out a specific field. It would be easier to see this > if you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a > tree view. Try the following. > > > sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype") > > On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com> > wrote: > >> When you create a dataframe using the *sqlContext.read.schema()* API, if >> you pass in a schema that’s compatible with some of the records, but >> incompatible with others, it seems you can’t do a .select on the >> problematic columns, instead you get an AnalysisException error. >> >> >> >> I know loading the wrong data isn’t good behaviour, but if you’re reading >> data from (for example) JSON files, there’s going to be malformed files >> along the way. I think it would be nice to handle this error in a nicer >> way, though I don’t know the best way to approach it. >> >> >> >> Before I raise a JIRA ticket about it, would people consider this to be a >> bug or expected behaviour? >> >> >> >> I’ve attached a couple of sample JSON files and the steps below to >> reproduce it, by taking the inferred schema from the simple1.json file, and >> applying it to a union of simple1.json and simple2.json. You can visually >> see the data has been parsed as I think you’d want if you do a .select on >> the parent column and print out the output, but when you do a select on the >> problem column you instead get an exception. >> >> >> >> *scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)* >> >> s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at >> <console>:27 >> >> >> >> *scala> val s1schema = sqlContext.read.json(s1Rdd).schema* >> >> s1schema: org.apache.spark.sql.types.StructType = >> StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true), >> StructField(name,StringType,true)),true),true), >> StructField(othertype,ArrayType(StructType(StructField(company,StringType,true), >> StructField(id,LongType,true)),true),true)),true),true)),true),true)) >> >> >> >> *scala> >> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)* >> >> [WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don >> Joeh]),null], [null,WrappedArray([ACME,2])]))] >> >> [WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], >> [WrappedArray([2,null]),null]))] >> >> >> >> *scala> >> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")* >> >> org.apache.spark.sql.AnalysisException: cannot resolve >> 'data.stuff[onetype]' due to data type mismatch: argument 2 requires >> integral type, however, 'onetype' is of string type.; >> >> at >> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) >> >> at >> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) >> >> at >> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) >> >> at >> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >> >> at >> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) >> >> >> >> (The full exception is attached too). >> >> >> >> What do people think, is this a bug? >> >> >> >> Thanks, >> >> Ewan >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > >