Yeah, that tough. Perhaps you could do something like a flatMap and emit multiple virtual copies of each student for each region that is neighboring their actual region.
On Wed, Jan 20, 2016 at 10:50 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > Thanks - this helps a lot except for the issue of looking at schools in > neighboring regions > > On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust <mich...@databricks.com > > wrote: > >> The analog to PairRDD is a GroupedDataset (created by calling groupBy), >> which offers similar functionality, but doesn't require you to construct >> new object that are in the form of key/value pairs. It doesn't matter if >> they are complex objects, as long as you can create an encoder for them >> (currently supported for JavaBeans and case classes, but support for custom >> encoders is on the roadmap). These encoders are responsible for both fast >> serialization and providing a view of your object that looks like a row. >> >> Based on the description of your problem, it sounds like you can use >> joinWith and just express the predicate as a column. >> >> import org.apache.spark.sql.functions._ >> ds1.as("child").joinWith(ds2.as("school"), expr("child.region = >> school.region")) >> >> The as operation is only required if you need to differentiate columns on >> either side that have the same name. >> >> Note that by defining the join condition as an expression instead of a >> lambda function, we are giving Spark SQL more information about the join so >> it can often do the comparison without needing to deserialize the object, >> which overtime will let us put more optimizations into the engine. >> >> You can also do this using lambda functions if you want though: >> >> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2) >> => >> ... >> } >> >> >> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com> >> wrote: >> >>> We have been working a large search problem which we have been solving >>> in the following ways. >>> >>> We have two sets of objects, say children and schools. The object is to >>> find the closest school to each child. There is a distance measure but it >>> is relatively expensive and would be very costly to apply to all pairs. >>> >>> However the map can be divided into regions. If we assume that the >>> closest school to a child is in his region of a neighboring region we need >>> only compute the distance between a child and all schools in his region and >>> neighboring regions. >>> >>> We currently use paired RDDs and a join to do this assigning children to >>> one region and schools to their own region and neighboring regions and then >>> creating a join and computing distances. Note the real problem is more >>> complex. >>> >>> I can create Datasets of the two types of objects but see no Dataset >>> analog for a PairRDD. How could I map my solution using PairRDDs to >>> Datasets - assume the two objects are relatively complex data types and do >>> not look like SQL dataset rows? >>> >>> >>> >> > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > >