What you suggested works in Spark 2.3 , but in the version that I am using (2.1) it produces the following exception :
found : org.apache.spark.sql.types.ArrayType required: org.apache.spark.sql.types.StructType ds.select(from_json($"news", schema) as "news_parsed").show(false) Is it viable/possible to export a function from 2.3 to 2.1? What other options do I have? Thank you. From: Magnus Nilsson <ma...@kth.se> Sent: Saturday, February 23, 2019 3:43 PM Cc: user@spark.apache.org Subject: Re: How can I parse an "unnamed" json array present in a column? Use spark.sql.types.ArrayType instead of a Scala Array as the root type when you define the schema and it will work. Regards, Magnus On Fri, Feb 22, 2019 at 11:15 PM Yeikel <em...@yeikel.com <mailto:em...@yeikel.com> > wrote: I have an "unnamed" json array stored in a *column*. The format is the following : column name : news Data : [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] Ideally , I'd like to parse it as : news ARRAY<struct<source:string, name:string>> I've tried the following : import org.apache.spark.sql.Encoders import org.apache.spark.sql.types._; val entry = scala.io.Source.fromFile("1.txt").mkString val ds = Seq(entry).toDF("news") val schema = Array(new StructType().add("name", StringType).add("source", StringType)) ds.select(from_json($"news", schema) as "news_parsed").show(false) But this is not allowed : found : Array[org.apache.spark.sql.types.StructType] required: org.apache.spark.sql.types.StructType I also tried passing the following schema : val schema = StructType(new StructType().add("name", StringType).add("source", StringType)) But this only parsed the first record : +--------------------+ |news_parsed | +--------------------+ |[News site1,source1]| +--------------------+ I am aware that if I fix the JSON like this : { "news": [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] } The parsing works as expected , but I would like to avoid doing that if possible. Another approach that I can think of is to map on it and parse it using third party libraries like Gson , but I am not sure if this is any better than fixing the json beforehand. I am running Spark 2.1 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>