Hi, I'm trying to parse log files generated by Spark using SparkSQL.

In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).

{
    "Event": "SparkListenerStageCompleted",
    "Stage Info": {
      "Stage ID": 1,
      ...
      "RDD Info": [
        {
          "RDD ID": 5,
          "Name": "5",
          "Storage Level": {
            "Use Disk": false,
            "Use Memory": false,
            "Use Tachyon": false,
            "Deserialized": false,
            "Replication": 1
          },
          "Number of Partitions": 2,
          "Number of Cached Partitions": 0,
          "Memory Size": 0,
          "Tachyon Size": 0,
          "Disk Size": 0
        },
...

When i register the log as a table SparkSQL is able to generate the correct
schema that for the RDD Info element looks like

 | -- RDD Info: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Disk Size: long (nullable = true)
 |    |    |-- Memory Size: long (nullable = true)
 |    |    |-- Name: string (nullable = true)

My problem is that if I try to query the table I can only get array buffers
out of it:

"SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
Info` FROM stageEndInfos"
Stage ID RDD Info
1        ArrayBuffer([0,0,...
0        ArrayBuffer([0,0,...
2        ArrayBuffer([0,0,...

or:

"SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos"
RDD ID
ArrayBuffer(5, 4, 3)
ArrayBuffer(2, 1, 0)
ArrayBuffer(9, 6,...

Is there a way to explode the arrays in the rows in order to build a single
table? (Knowing that the RDD ID is unique and can be used as primary key)?

Thanks!

How can I get

Reply via email to