You are looking for LATERAL VIEW explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode
in HiveQL.
On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com
wrote:
Hi, I'm trying to parse log files generated by Spark using SparkSQL.
In the JSON elements related to the StageCompleted event we have a nested
structre containing an array of elements with RDD Info. (see the log below
as an example (omitting some parts).
{
Event: SparkListenerStageCompleted,
Stage Info: {
Stage ID: 1,
...
RDD Info: [
{
RDD ID: 5,
Name: 5,
Storage Level: {
Use Disk: false,
Use Memory: false,
Use Tachyon: false,
Deserialized: false,
Replication: 1
},
Number of Partitions: 2,
Number of Cached Partitions: 0,
Memory Size: 0,
Tachyon Size: 0,
Disk Size: 0
},
...
When i register the log as a table SparkSQL is able to generate the
correct schema that for the RDD Info element looks like
| -- RDD Info: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- Disk Size: long (nullable = true)
|||-- Memory Size: long (nullable = true)
|||-- Name: string (nullable = true)
My problem is that if I try to query the table I can only get array
buffers out of it:
SELECT `stageEndInfos.Stage Info.Stage ID`, `stageEndInfos.Stage Info.RDD
Info` FROM stageEndInfos
Stage ID RDD Info
1ArrayBuffer([0,0,...
0ArrayBuffer([0,0,...
2ArrayBuffer([0,0,...
or:
SELECT `stageEndInfos.Stage Info.RDD Info.RDD ID` FROM stageEndInfos
RDD ID
ArrayBuffer(5, 4, 3)
ArrayBuffer(2, 1, 0)
ArrayBuffer(9, 6,...
Is there a way to explode the arrays in the rows in order to build a
single table? (Knowing that the RDD ID is unique and can be used as primary
key)?
Thanks!
How can I get