May be you could try something like this using sparkSQL 1.4 and dataframes

student.join(Grade, Grade("student_id") === student("student_id"), "left")

       .groupBy("id")

       .agg(sum(grade("Marks")), avg(grade("Marks")))



You could refer to the following document :
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrame


On Thu, Jun 25, 2015 at 5:16 PM, Richard Catlin <richard.m.cat...@gmail.com>
wrote:

> I am looking to do something similar to this Postgres query in HiveQL.  If
> I have a DataFrame student and a DataFrame grade, is this possible?
>
> I read in Learning Spark: Lightning-Fast Big Data Analysis that it should
> be possible.  It says in Chapter 9
> "SchemaRDDs can store several basic types, as well as structures and
> arrays of these types.  They use the HiveQL syntax for type definitions.
> Table-9-1 shown"
> and goes on to say
> "The last type, structures, is simply represented as other Rows in Spark
> SQL.  All of these types can also be nested within each other; for example,
> you can have arrays of structs, or maps that contain structs"
>
> Here is the url of the page that describes the Postgres:
> http://stackoverflow.com/questions/10928210/postgresql-aggregate-array
>
> SELECT s.name,  array_agg(g.Mark) as marks
>
> FROM student s
>
> LEFT JOIN Grade g ON g.Student_id = s.Id
>
> GROUP BY s.Id
>
> Thank you.
>
> Richard Catlin
>
>
>
>

Reply via email to