We have been working a large search problem which we have been solving in
the following ways.

We have two sets of objects, say children and schools. The object is to
find the closest school to each child. There is a distance measure but it
is relatively expensive and would be very costly to apply to all pairs.

However the map can be divided into regions. If we assume that the closest
school to a child is in his region of a neighboring region we need only
compute the distance between a child and all schools in his region and
neighboring regions.

We currently use paired RDDs and a join to do this assigning children to
one region and schools to their own region and neighboring regions and then
creating a join and computing distances. Note the real problem is more
complex.

I can create Datasets of the two types of objects but see no Dataset analog
for a PairRDD. How could I map my solution using PairRDDs to Datasets -
assume the two objects are relatively complex data types and do not look
like SQL dataset rows?

Reply via email to