Hi, I'm trying to parse log files generated by Spark using SparkSQL. In the JSON elements related to the StageCompleted event we have a nested structre containing an array of elements with RDD Info. (see the log below as an example (omitting some parts).
{ "Event": "SparkListenerStageCompleted", "Stage Info": { "Stage ID": 1, ... "RDD Info": [ { "RDD ID": 5, "Name": "5", "Storage Level": { "Use Disk": false, "Use Memory": false, "Use Tachyon": false, "Deserialized": false, "Replication": 1 }, "Number of Partitions": 2, "Number of Cached Partitions": 0, "Memory Size": 0, "Tachyon Size": 0, "Disk Size": 0 }, ... When i register the log as a table SparkSQL is able to generate the correct schema that for the RDD Info element looks like | -- RDD Info: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Disk Size: long (nullable = true) | | |-- Memory Size: long (nullable = true) | | |-- Name: string (nullable = true) My problem is that if I try to query the table I can only get array buffers out of it: "SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD Info` FROM stageEndInfos" Stage ID RDD Info 1 ArrayBuffer([0,0,... 0 ArrayBuffer([0,0,... 2 ArrayBuffer([0,0,... or: "SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos" RDD ID ArrayBuffer(5, 4, 3) ArrayBuffer(2, 1, 0) ArrayBuffer(9, 6,... Is there a way to explode the arrays in the rows in order to build a single table? (Knowing that the RDD ID is unique and can be used as primary key)? Thanks! How can I get