May be you could try something like this using sparkSQL 1.4 and dataframes
student.join(Grade, Grade("student_id") === student("student_id"), "left") .groupBy("id") .agg(sum(grade("Marks")), avg(grade("Marks"))) You could refer to the following document : https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame On Thu, Jun 25, 2015 at 5:16 PM, Richard Catlin <richard.m.cat...@gmail.com> wrote: > I am looking to do something similar to this Postgres query in HiveQL. If > I have a DataFrame student and a DataFrame grade, is this possible? > > I read in Learning Spark: Lightning-Fast Big Data Analysis that it should > be possible. It says in Chapter 9 > "SchemaRDDs can store several basic types, as well as structures and > arrays of these types. They use the HiveQL syntax for type definitions. > Table-9-1 shown" > and goes on to say > "The last type, structures, is simply represented as other Rows in Spark > SQL. All of these types can also be nested within each other; for example, > you can have arrays of structs, or maps that contain structs" > > Here is the url of the page that describes the Postgres: > http://stackoverflow.com/questions/10928210/postgresql-aggregate-array > > SELECT s.name, array_agg(g.Mark) as marks > > FROM student s > > LEFT JOIN Grade g ON g.Student_id = s.Id > > GROUP BY s.Id > > Thank you. > > Richard Catlin > > > >