SparkSQL Nested structure

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to parse log files generated by Spark using SparkSQL.

In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).

{
Event: SparkListenerStageCompleted,
Stage Info: {
  Stage ID: 1,
  ...
  RDD Info: [
{
  RDD ID: 5,
  Name: 5,
  Storage Level: {
Use Disk: false,
Use Memory: false,
Use Tachyon: false,
Deserialized: false,
Replication: 1
  },
  Number of Partitions: 2,
  Number of Cached Partitions: 0,
  Memory Size: 0,
  Tachyon Size: 0,
  Disk Size: 0
},
...

When i register the log as a table SparkSQL is able to generate the correct
schema that for the RDD Info element looks like

 | -- RDD Info: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- Disk Size: long (nullable = true)
 |||-- Memory Size: long (nullable = true)
 |||-- Name: string (nullable = true)

My problem is that if I try to query the table I can only get array buffers
out of it:

SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
Info` FROM stageEndInfos
Stage ID RDD Info
1ArrayBuffer([0,0,...
0ArrayBuffer([0,0,...
2ArrayBuffer([0,0,...

or:

SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos
RDD ID
ArrayBuffer(5, 4, 3)
ArrayBuffer(2, 1, 0)
ArrayBuffer(9, 6,...

Is there a way to explode the arrays in the rows in order to build a single
table? (Knowing that the RDD ID is unique and can be used as primary key)?

Thanks!

How can I get


Re: SparkSQL Nested structure

2015-05-04 Thread Michael Armbrust
You are looking for LATERAL VIEW explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
in HiveQL.

On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com
wrote:

 Hi, I'm trying to parse log files generated by Spark using SparkSQL.

 In the JSON elements related to the StageCompleted event we have a nested
 structre containing an array of elements with RDD Info. (see the log below
 as an example (omitting some parts).

 {
 Event: SparkListenerStageCompleted,
 Stage Info: {
   Stage ID: 1,
   ...
   RDD Info: [
 {
   RDD ID: 5,
   Name: 5,
   Storage Level: {
 Use Disk: false,
 Use Memory: false,
 Use Tachyon: false,
 Deserialized: false,
 Replication: 1
   },
   Number of Partitions: 2,
   Number of Cached Partitions: 0,
   Memory Size: 0,
   Tachyon Size: 0,
   Disk Size: 0
 },
 ...

 When i register the log as a table SparkSQL is able to generate the
 correct schema that for the RDD Info element looks like

  | -- RDD Info: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- Disk Size: long (nullable = true)
  |||-- Memory Size: long (nullable = true)
  |||-- Name: string (nullable = true)

 My problem is that if I try to query the table I can only get array
 buffers out of it:

 SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
 Info` FROM stageEndInfos
 Stage ID RDD Info
 1ArrayBuffer([0,0,...
 0ArrayBuffer([0,0,...
 2ArrayBuffer([0,0,...

 or:

 SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos
 RDD ID
 ArrayBuffer(5, 4, 3)
 ArrayBuffer(2, 1, 0)
 ArrayBuffer(9, 6,...

 Is there a way to explode the arrays in the rows in order to build a
 single table? (Knowing that the RDD ID is unique and can be used as primary
 key)?

 Thanks!

 How can I get