Andrew,

Thank you very much for your feedback.  Unfortunately, the ranges are not
of predictable size but you gave me an idea of how to handle it.  Here's
what I'm thinking:

1. Choose number of partitions, n, over IP space
2. Preprocess the IPRanges, splitting any of them that cross partition
boundaries
3. Partition ipToUrl and the new ipRangeToZip according to the partitioning
scheme from step 1
4. Join matching partitions of these two RDDs

I still don't know how to do step 4 though.  I see that RDDs have a
mapPartitions() operation to let you do whatever you want with a partition.
 What I need is a way to get my hands on two partitions at once, each from
different RDDs.

Any ideas?

Thanks,

Roger


On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <and...@andrewash.com> wrote:

> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
> 192.168.0.0/16 or 10.0.0.0/8?
>
> If the range is always an even subnet mask and not split across subnets,
> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
> joining the two RDDs.  The expansion would be at most 32x if all your
> ranges can be expressed in CIDR notation, and in practice would be much
> smaller than that (typically you don't need things bigger than a /8 and
> often not smaller than a /24)
>
> Hopefully you can use your knowledge of the ip ranges to make this
> feasible.
>
> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
> of CIDR notations and do the join then, but you're starting to have the
> cartesian product work against you on scale at that point.
>
> Andrew
>
>
> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <roger.hoo...@gmail.com>wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to join two RDDs with different key types
>> and appreciate any suggestions.
>>
>> Say I have two RDDS:
>>     ipToUrl of type (IP, String)
>>     ipRangeToZip of type (IPRange, String)
>>
>> How can I join/cogroup these two RDDs together to produce a new RDD of
>> type (IP, (String, String)) where IP is the key and the values are the urls
>> and zipcodes?
>>
>> Say I have a method on the IPRange class called matches(ip: IP), I want
>> the joined records to match when ipRange.matches(ip).
>>
>> Thanks,
>>
>> Roger
>>
>>
>

Reply via email to