Thanks - this helps a lot except for the issue of looking at schools in
neighboring regions

On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
> which offers similar functionality, but doesn't require you to construct
> new object that are in the form of key/value pairs.  It doesn't matter if
> they are complex objects, as long as you can create an encoder for them
> (currently supported for JavaBeans and case classes, but support for custom
> encoders is on the roadmap).  These encoders are responsible for both fast
> serialization and providing a view of your object that looks like a row.
>
> Based on the description of your problem, it sounds like you can use
> joinWith and just express the predicate as a column.
>
> import org.apache.spark.sql.functions._
> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
> school.region"))
>
> The as operation is only required if you need to differentiate columns on
> either side that have the same name.
>
> Note that by defining the join condition as an expression instead of a
> lambda function, we are giving Spark SQL more information about the join so
> it can often do the comparison without needing to deserialize the object,
> which overtime will let us put more optimizations into the engine.
>
> You can also do this using lambda functions if you want though:
>
> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
> =>
>   ...
> }
>
>
> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>> We have been working a large search problem which we have been solving in
>> the following ways.
>>
>> We have two sets of objects, say children and schools. The object is to
>> find the closest school to each child. There is a distance measure but it
>> is relatively expensive and would be very costly to apply to all pairs.
>>
>> However the map can be divided into regions. If we assume that the
>> closest school to a child is in his region of a neighboring region we need
>> only compute the distance between a child and all schools in his region and
>> neighboring regions.
>>
>> We currently use paired RDDs and a join to do this assigning children to
>> one region and schools to their own region and neighboring regions and then
>> creating a join and computing distances. Note the real problem is more
>> complex.
>>
>> I can create Datasets of the two types of objects but see no Dataset
>> analog for a PairRDD. How could I map my solution using PairRDDs to
>> Datasets - assume the two objects are relatively complex data types and do
>> not look like SQL dataset rows?
>>
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to