That's a bummer, if you're unable to upgrade to Spark 2.3+ your best bet is
probably to prepend/append the jsonarray-string and locate the json array
as the value of a root attribute in a json-document (as in your first work
around). I mean, it's such an easy and safe fix, you can still do it even
if you stream the file.

Even better, make the source system create a JSON-lines file instead of an
json array if possible.

When I use Datasets (Tungsten) I basically try to stay there and use the
available column functions unless I have no choice but to serialize and run
custom advanced calculations/parsings. In your case just modifying the
string and use the tested from_json function beats the available
alternatives if you ask me.


On Sun, Feb 24, 2019 at 1:13 AM <em...@yeikel.com> wrote:

> What you suggested works in Spark 2.3 , but in the version that I am using
> (2.1) it produces the following exception :
>
>
>
> found   : org.apache.spark.sql.types.ArrayType
>
> required: org.apache.spark.sql.types.StructType
>
>        ds.select(from_json($"news", schema) as "news_parsed").show(false)
>
>
>
> Is it viable/possible to export a function from 2.3 to 2.1?  What other
> options do I have?
>
>
>
> Thank you.
>
>
>
>
>
> *From:* Magnus Nilsson <ma...@kth.se>
> *Sent:* Saturday, February 23, 2019 3:43 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: How can I parse an "unnamed" json array present in a
> column?
>
>
>
> Use spark.sql.types.ArrayType instead of a Scala Array as the root type
> when you define the schema and it will work.
>
>
>
> Regards,
>
>
>
> Magnus
>
>
>
> On Fri, Feb 22, 2019 at 11:15 PM Yeikel <em...@yeikel.com> wrote:
>
> I have an "unnamed" json array stored in a *column*.
>
> The format is the following :
>
> column name : news
>
> Data :
>
> [
>   {
>     "source": "source1",
>     "name": "News site1"
>   },
>    {
>     "source": "source2",
>     "name": "News site2"
>   }
> ]
>
>
> Ideally , I'd like to parse it as :
>
> news ARRAY<struct&lt;source:string, name:string>>
>
> I've tried the following :
>
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.types._;
>
> val entry = scala.io.Source.fromFile("1.txt").mkString
>
> val ds = Seq(entry).toDF("news")
>
> val schema = Array(new StructType().add("name", StringType).add("source",
> StringType))
>
> ds.select(from_json($"news", schema) as "news_parsed").show(false)
>
> But this is not allowed :
>
> found   : Array[org.apache.spark.sql.types.StructType]
> required: org.apache.spark.sql.types.StructType
>
>
> I also tried passing the following schema :
>
> val schema = StructType(new StructType().add("name",
> StringType).add("source", StringType))
>
> But this only parsed the first record :
>
> +--------------------+
> |news_parsed         |
> +--------------------+
> |[News site1,source1]|
> +--------------------+
>
>
> I am aware that if I fix the JSON like this :
>
> {
>   "news": [
>     {
>       "source": "source1",
>       "name": "News site1"
>     },
>     {
>       "source": "source2",
>       "name": "News site2"
>     }
>   ]
> }
>
> The parsing works as expected , but I would like to avoid doing that if
> possible.
>
> Another approach that I can think of is to map on it and parse it using
> third party libraries like Gson , but  I am not sure if this is any better
> than fixing the json beforehand.
>
> I am running Spark 2.1
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to