Re: I need help mapping a PairRDD solution to Dataset

Michael Armbrust Wed, 20 Jan 2016 10:44:35 -0800

The analog to PairRDD is a GroupedDataset (created by calling groupBy),
which offers similar functionality, but doesn't require you to construct
new object that are in the form of key/value pairs.  It doesn't matter if
they are complex objects, as long as you can create an encoder for them
(currently supported for JavaBeans and case classes, but support for custom
encoders is on the roadmap).  These encoders are responsible for both fast
serialization and providing a view of your object that looks like a row.

Based on the description of your problem, it sounds like you can use
joinWith and just express the predicate as a column.

import org.apache.spark.sql.functions._
ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
school.region"))

The as operation is only required if you need to differentiate columns on
either side that have the same name.

Note that by defining the join condition as an expression instead of a
lambda function, we are giving Spark SQL more information about the join so
it can often do the comparison without needing to deserialize the object,
which overtime will let us put more optimizations into the engine.

You can also do this using lambda functions if you want though:

ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) =>
  ...
}

On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> We have been working a large search problem which we have been solving in
> the following ways.
>
> We have two sets of objects, say children and schools. The object is to
> find the closest school to each child. There is a distance measure but it
> is relatively expensive and would be very costly to apply to all pairs.
>
> However the map can be divided into regions. If we assume that the closest
> school to a child is in his region of a neighboring region we need only
> compute the distance between a child and all schools in his region and
> neighboring regions.
>
> We currently use paired RDDs and a join to do this assigning children to
> one region and schools to their own region and neighboring regions and then
> creating a join and computing distances. Note the real problem is more
> complex.
>
> I can create Datasets of the two types of objects but see no Dataset
> analog for a PairRDD. How could I map my solution using PairRDDs to
> Datasets - assume the two objects are relatively complex data types and do
> not look like SQL dataset rows?
>
>
>

Re: I need help mapping a PairRDD solution to Dataset

Reply via email to