May be you could try something like this using sparkSQL 1.4 and dataframes
student.join(Grade, Grade(student_id) === student(student_id), left)
.groupBy(id)
.agg(sum(grade(Marks)), avg(grade(Marks)))
You could refer to the following document :
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame
On Thu, Jun 25, 2015 at 5:16 PM, Richard Catlin richard.m.cat...@gmail.com
wrote:
I am looking to do something similar to this Postgres query in HiveQL. If
I have a DataFrame student and a DataFrame grade, is this possible?
I read in Learning Spark: Lightning-Fast Big Data Analysis that it should
be possible. It says in Chapter 9
SchemaRDDs can store several basic types, as well as structures and
arrays of these types. They use the HiveQL syntax for type definitions.
Table-9-1 shown
and goes on to say
The last type, structures, is simply represented as other Rows in Spark
SQL. All of these types can also be nested within each other; for example,
you can have arrays of structs, or maps that contain structs
Here is the url of the page that describes the Postgres:
http://stackoverflow.com/questions/10928210/postgresql-aggregate-array
SELECT s.name, array_agg(g.Mark) as marks
FROM student s
LEFT JOIN Grade g ON g.Student_id = s.Id
GROUP BY s.Id
Thank you.
Richard Catlin