We have been working a large search problem which we have been solving in the following ways.
We have two sets of objects, say children and schools. The object is to find the closest school to each child. There is a distance measure but it is relatively expensive and would be very costly to apply to all pairs. However the map can be divided into regions. If we assume that the closest school to a child is in his region of a neighboring region we need only compute the distance between a child and all schools in his region and neighboring regions. We currently use paired RDDs and a join to do this assigning children to one region and schools to their own region and neighboring regions and then creating a join and computing distances. Note the real problem is more complex. I can create Datasets of the two types of objects but see no Dataset analog for a PairRDD. How could I map my solution using PairRDDs to Datasets - assume the two objects are relatively complex data types and do not look like SQL dataset rows?