I have an "unnamed" json array stored in a *column*. The format is the following :
column name : news Data : [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] Ideally , I'd like to parse it as : news ARRAY<struct<source:string, name:string>> I've tried the following : import org.apache.spark.sql.Encoders import org.apache.spark.sql.types._; val entry = scala.io.Source.fromFile("1.txt").mkString val ds = Seq(entry).toDF("news") val schema = Array(new StructType().add("name", StringType).add("source", StringType)) ds.select(from_json($"news", schema) as "news_parsed").show(false) But this is not allowed : found : Array[org.apache.spark.sql.types.StructType] required: org.apache.spark.sql.types.StructType I also tried passing the following schema : val schema = StructType(new StructType().add("name", StringType).add("source", StringType)) But this only parsed the first record : +--------------------+ |news_parsed | +--------------------+ |[News site1,source1]| +--------------------+ I am aware that if I fix the JSON like this : { "news": [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] } The parsing works as expected , but I would like to avoid doing that if possible. Another approach that I can think of is to map on it and parse it using third party libraries like Gson , but I am not sure if this is any better than fixing the json beforehand. I am running Spark 2.1 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org