Thanks - this helps a lot except for the issue of looking at schools in neighboring regions
On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust <mich...@databricks.com> wrote: > The analog to PairRDD is a GroupedDataset (created by calling groupBy), > which offers similar functionality, but doesn't require you to construct > new object that are in the form of key/value pairs. It doesn't matter if > they are complex objects, as long as you can create an encoder for them > (currently supported for JavaBeans and case classes, but support for custom > encoders is on the roadmap). These encoders are responsible for both fast > serialization and providing a view of your object that looks like a row. > > Based on the description of your problem, it sounds like you can use > joinWith and just express the predicate as a column. > > import org.apache.spark.sql.functions._ > ds1.as("child").joinWith(ds2.as("school"), expr("child.region = > school.region")) > > The as operation is only required if you need to differentiate columns on > either side that have the same name. > > Note that by defining the join condition as an expression instead of a > lambda function, we are giving Spark SQL more information about the join so > it can often do the comparison without needing to deserialize the object, > which overtime will let us put more optimizations into the engine. > > You can also do this using lambda functions if you want though: > > ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) > => > ... > } > > > On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> We have been working a large search problem which we have been solving in >> the following ways. >> >> We have two sets of objects, say children and schools. The object is to >> find the closest school to each child. There is a distance measure but it >> is relatively expensive and would be very costly to apply to all pairs. >> >> However the map can be divided into regions. If we assume that the >> closest school to a child is in his region of a neighboring region we need >> only compute the distance between a child and all schools in his region and >> neighboring regions. >> >> We currently use paired RDDs and a join to do this assigning children to >> one region and schools to their own region and neighboring regions and then >> creating a join and computing distances. Note the real problem is more >> complex. >> >> I can create Datasets of the two types of objects but see no Dataset >> analog for a PairRDD. How could I map my solution using PairRDDs to >> Datasets - assume the two objects are relatively complex data types and do >> not look like SQL dataset rows? >> >> >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com