I'm thinking of creating a union type for the key so that IPRange and IP
types can be joined.


On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <roger.hoo...@gmail.com>wrote:

> Andrew,
>
> Thank you very much for your feedback.  Unfortunately, the ranges are not
> of predictable size but you gave me an idea of how to handle it.  Here's
> what I'm thinking:
>
> 1. Choose number of partitions, n, over IP space
> 2. Preprocess the IPRanges, splitting any of them that cross partition
> boundaries
> 3. Partition ipToUrl and the new ipRangeToZip according to the
> partitioning scheme from step 1
> 4. Join matching partitions of these two RDDs
>
> I still don't know how to do step 4 though.  I see that RDDs have a
> mapPartitions() operation to let you do whatever you want with a partition.
>  What I need is a way to get my hands on two partitions at once, each from
> different RDDs.
>
> Any ideas?
>
> Thanks,
>
> Roger
>
>
> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <and...@andrewash.com> wrote:
>
>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>> 192.168.0.0/16 or 10.0.0.0/8?
>>
>> If the range is always an even subnet mask and not split across subnets,
>> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
>> joining the two RDDs.  The expansion would be at most 32x if all your
>> ranges can be expressed in CIDR notation, and in practice would be much
>> smaller than that (typically you don't need things bigger than a /8 and
>> often not smaller than a /24)
>>
>> Hopefully you can use your knowledge of the ip ranges to make this
>> feasible.
>>
>> Otherwise, you could additionally flatmap the ipRangeToZip out to a list
>> of CIDR notations and do the join then, but you're starting to have the
>> cartesian product work against you on scale at that point.
>>
>> Andrew
>>
>>
>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <roger.hoo...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I'm trying to figure out how to join two RDDs with different key types
>>> and appreciate any suggestions.
>>>
>>> Say I have two RDDS:
>>>     ipToUrl of type (IP, String)
>>>     ipRangeToZip of type (IPRange, String)
>>>
>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>> type (IP, (String, String)) where IP is the key and the values are the urls
>>> and zipcodes?
>>>
>>> Say I have a method on the IPRange class called matches(ip: IP), I want
>>> the joined records to match when ipRange.matches(ip).
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>
>

Reply via email to