Re: I need help mapping a PairRDD solution to Dataset

Michael Armbrust Wed, 20 Jan 2016 11:23:48 -0800

Yeah, that tough.  Perhaps you could do something like a flatMap and emit
multiple virtual copies of each student for each region that is neighboring
their actual region.


On Wed, Jan 20, 2016 at 10:50 AM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> Thanks - this helps a lot except for the issue of looking at schools in
> neighboring regions
>
> On Wed, Jan 20, 2016 at 10:43 AM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> The analog to PairRDD is a GroupedDataset (created by calling groupBy),
>> which offers similar functionality, but doesn't require you to construct
>> new object that are in the form of key/value pairs.  It doesn't matter if
>> they are complex objects, as long as you can create an encoder for them
>> (currently supported for JavaBeans and case classes, but support for custom
>> encoders is on the roadmap).  These encoders are responsible for both fast
>> serialization and providing a view of your object that looks like a row.
>>
>> Based on the description of your problem, it sounds like you can use
>> joinWith and just express the predicate as a column.
>>
>> import org.apache.spark.sql.functions._
>> ds1.as("child").joinWith(ds2.as("school"), expr("child.region =
>> school.region"))
>>
>> The as operation is only required if you need to differentiate columns on
>> either side that have the same name.
>>
>> Note that by defining the join condition as an expression instead of a
>> lambda function, we are giving Spark SQL more information about the join so
>> it can often do the comparison without needing to deserialize the object,
>> which overtime will let us put more optimizations into the engine.
>>
>> You can also do this using lambda functions if you want though:
>>
>> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
>> =>
>>   ...
>> }
>>
>>
>> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>>
>>> We have been working a large search problem which we have been solving
>>> in the following ways.
>>>
>>> We have two sets of objects, say children and schools. The object is to
>>> find the closest school to each child. There is a distance measure but it
>>> is relatively expensive and would be very costly to apply to all pairs.
>>>
>>> However the map can be divided into regions. If we assume that the
>>> closest school to a child is in his region of a neighboring region we need
>>> only compute the distance between a child and all schools in his region and
>>> neighboring regions.
>>>
>>> We currently use paired RDDs and a join to do this assigning children to
>>> one region and schools to their own region and neighboring regions and then
>>> creating a join and computing distances. Note the real problem is more
>>> complex.
>>>
>>> I can create Datasets of the two types of objects but see no Dataset
>>> analog for a PairRDD. How could I map my solution using PairRDDs to
>>> Datasets - assume the two objects are relatively complex data types and do
>>> not look like SQL dataset rows?
>>>
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: I need help mapping a PairRDD solution to Dataset

Reply via email to