Hi, I'm trying to parse log files generated by Spark using SparkSQL.
In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).
{
"Event": "SparkListenerStageCompleted",
"Stage Info": {
"Stage ID": 1,
...
"RDD Info": [
{
"RDD ID": 5,
"Name": "5",
"Storage Level": {
"Use Disk": false,
"Use Memory": false,
"Use Tachyon": false,
"Deserialized": false,
"Replication": 1
},
"Number of Partitions": 2,
"Number of Cached Partitions": 0,
"Memory Size": 0,
"Tachyon Size": 0,
"Disk Size": 0
},
...
When i register the log as a table SparkSQL is able to generate the correct
schema that for the RDD Info element looks like
| -- RDD Info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Disk Size: long (nullable = true)
| | |-- Memory Size: long (nullable = true)
| | |-- Name: string (nullable = true)
My problem is that if I try to query the table I can only get array buffers
out of it:
"SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
Info` FROM stageEndInfos"
Stage ID RDD Info
1 ArrayBuffer([0,0,...
0 ArrayBuffer([0,0,...
2 ArrayBuffer([0,0,...
or:
"SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos"
RDD ID
ArrayBuffer(5, 4, 3)
ArrayBuffer(2, 1, 0)
ArrayBuffer(9, 6,...
Is there a way to explode the arrays in the rows in order to build a single
table? (Knowing that the RDD ID is unique and can be used as primary key)?
Thanks!
How can I get