Thanks for the answer. As far as the next step goes, I am thinking of writing out the dfKV dataframe to disk and then use Avro apis to read the data.
This smells like a bug somewhere. Cheers, Hien On Thu, Feb 28, 2019 at 4:02 AM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > No, just take a look at the schema of dfStruct since you've converted its > value column with to_avro: > > scala> dfStruct.printSchema > root > |-- id: integer (nullable = false) > |-- name: string (nullable = true) > |-- age: integer (nullable = false) > |-- value: struct (nullable = false) > | |-- name: string (nullable = true) > | |-- age: integer (nullable = false) > > > On Wed, Feb 27, 2019 at 6:51 PM Hien Luu <hien...@gmail.com> wrote: > >> Thanks for looking into this. Does this mean string fields should alway >> be nullable? >> >> You are right that the result is not yet correct and further digging is >> needed :( >> >> On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi <gabor.g.somo...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I was dealing with avro stuff lately and most of the time it has >>> something to do with the schema. >>> One thing I've pinpointed quickly (where I was struggling also) is the >>> name field should be nullable but the result is not yet correct so further >>> digging needed... >>> >>> scala> val expectedSchema = StructType(Seq(StructField("name", >>> StringType,true),StructField("age", IntegerType, false))) >>> expectedSchema: org.apache.spark.sql.types.StructType = >>> StructType(StructField(name,StringType,true), >>> StructField(age,IntegerType,false)) >>> >>> scala> val avroTypeStruct = >>> SchemaConverters.toAvroType(expectedSchema).toString >>> avroTypeStruct: String = >>> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]} >>> >>> scala> dfKV.select(from_avro('value, avroTypeStruct)).show >>> +---------------------------------------------+ >>> |from_avro(value, struct<name:string,age:int>)| >>> +---------------------------------------------+ >>> | [Mary Jane, 25]| >>> | [Mary Jane, 25]| >>> +---------------------------------------------+ >>> >>> BR, >>> G >>> >>> >>> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu <hien...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I ran into a pretty weird issue with to_avro and from_avro where it was >>>> not >>>> able to parse the data in a struct correctly. Please see the simple and >>>> self contained example below. I am using Spark 2.4. I am not sure if I >>>> missed something. >>>> >>>> This is how I start the spark-shell on my Mac: >>>> >>>> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 >>>> >>>> import org.apache.spark.sql.types._ >>>> import org.apache.spark.sql.avro._ >>>> import org.apache.spark.sql.functions._ >>>> >>>> >>>> spark.version >>>> >>>> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25)).toDF("id", >>>> "name", >>>> "age") >>>> >>>> val dfStruct = df.withColumn("value", struct("name","age")) >>>> >>>> dfStruct.show >>>> dfStruct.printSchema >>>> >>>> val dfKV = dfStruct.select(to_avro('id).as("key"), >>>> to_avro('value).as("value")) >>>> >>>> val expectedSchema = StructType(Seq(StructField("name", StringType, >>>> false),StructField("age", IntegerType, false))) >>>> >>>> val avroTypeStruct = >>>> SchemaConverters.toAvroType(expectedSchema).toString >>>> >>>> val avroTypeStr = s""" >>>> |{ >>>> | "type": "int", >>>> | "name": "key" >>>> |} >>>> """.stripMargin >>>> >>>> >>>> dfKV.select(from_avro('key, avroTypeStr)).show >>>> >>>> // output >>>> +-------------------+ >>>> |from_avro(key, int)| >>>> +-------------------+ >>>> | 1| >>>> | 2| >>>> +-------------------+ >>>> >>>> dfKV.select(from_avro('value, avroTypeStruct)).show >>>> >>>> // output >>>> +---------------------------------------------+ >>>> |from_avro(value, struct<name:string,age:int>)| >>>> +---------------------------------------------+ >>>> | [, 9]| >>>> | [, 9]| >>>> +---------------------------------------------+ >>>> >>>> Please help and thanks in advance. >>>> >>>> >>>> >>>> >>>> -- >>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >> >> -- >> Regards, >> > -- Regards,