I have an "unnamed" json array stored in a *column*.  

The format is the following : 

column name : news

Data : 

[
  {
    "source": "source1",
    "name": "News site1"
  },
   {
    "source": "source2",
    "name": "News site2"
  }
]


Ideally , I'd like to parse it as : 

news ARRAY<struct&lt;source:string, name:string>>

I've tried the following : 

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types._;

val entry = scala.io.Source.fromFile("1.txt").mkString

val ds = Seq(entry).toDF("news")

val schema = Array(new StructType().add("name", StringType).add("source",
StringType))

ds.select(from_json($"news", schema) as "news_parsed").show(false)

But this is not allowed : 

found   : Array[org.apache.spark.sql.types.StructType]
required: org.apache.spark.sql.types.StructType


I also tried passing the following schema : 

val schema = StructType(new StructType().add("name",
StringType).add("source", StringType))

But this only parsed the first record : 

+--------------------+
|news_parsed         |
+--------------------+
|[News site1,source1]|
+--------------------+


I am aware that if I fix the JSON like this : 

{
  "news": [
    {
      "source": "source1",
      "name": "News site1"
    },
    {
      "source": "source2",
      "name": "News site2"
    }
  ]
}

The parsing works as expected , but I would like to avoid doing that if
possible. 

Another approach that I can think of is to map on it and parse it using
third party libraries like Gson , but  I am not sure if this is any better
than fixing the json beforehand. 

I am running Spark 2.1



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to