That's a bummer, if you're unable to upgrade to Spark 2.3+ your best bet is probably to prepend/append the jsonarray-string and locate the json array as the value of a root attribute in a json-document (as in your first work around). I mean, it's such an easy and safe fix, you can still do it even if you stream the file.
Even better, make the source system create a JSON-lines file instead of an json array if possible. When I use Datasets (Tungsten) I basically try to stay there and use the available column functions unless I have no choice but to serialize and run custom advanced calculations/parsings. In your case just modifying the string and use the tested from_json function beats the available alternatives if you ask me. On Sun, Feb 24, 2019 at 1:13 AM <em...@yeikel.com> wrote: > What you suggested works in Spark 2.3 , but in the version that I am using > (2.1) it produces the following exception : > > > > found : org.apache.spark.sql.types.ArrayType > > required: org.apache.spark.sql.types.StructType > > ds.select(from_json($"news", schema) as "news_parsed").show(false) > > > > Is it viable/possible to export a function from 2.3 to 2.1? What other > options do I have? > > > > Thank you. > > > > > > *From:* Magnus Nilsson <ma...@kth.se> > *Sent:* Saturday, February 23, 2019 3:43 PM > *Cc:* user@spark.apache.org > *Subject:* Re: How can I parse an "unnamed" json array present in a > column? > > > > Use spark.sql.types.ArrayType instead of a Scala Array as the root type > when you define the schema and it will work. > > > > Regards, > > > > Magnus > > > > On Fri, Feb 22, 2019 at 11:15 PM Yeikel <em...@yeikel.com> wrote: > > I have an "unnamed" json array stored in a *column*. > > The format is the following : > > column name : news > > Data : > > [ > { > "source": "source1", > "name": "News site1" > }, > { > "source": "source2", > "name": "News site2" > } > ] > > > Ideally , I'd like to parse it as : > > news ARRAY<struct<source:string, name:string>> > > I've tried the following : > > import org.apache.spark.sql.Encoders > import org.apache.spark.sql.types._; > > val entry = scala.io.Source.fromFile("1.txt").mkString > > val ds = Seq(entry).toDF("news") > > val schema = Array(new StructType().add("name", StringType).add("source", > StringType)) > > ds.select(from_json($"news", schema) as "news_parsed").show(false) > > But this is not allowed : > > found : Array[org.apache.spark.sql.types.StructType] > required: org.apache.spark.sql.types.StructType > > > I also tried passing the following schema : > > val schema = StructType(new StructType().add("name", > StringType).add("source", StringType)) > > But this only parsed the first record : > > +--------------------+ > |news_parsed | > +--------------------+ > |[News site1,source1]| > +--------------------+ > > > I am aware that if I fix the JSON like this : > > { > "news": [ > { > "source": "source1", > "name": "News site1" > }, > { > "source": "source2", > "name": "News site2" > } > ] > } > > The parsing works as expected , but I would like to avoid doing that if > possible. > > Another approach that I can think of is to map on it and parse it using > third party libraries like Gson , but I am not sure if this is any better > than fixing the json beforehand. > > I am running Spark 2.1 > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >