Thanks for following up.  I hope to get some free time this afternoon to
get it working.  Will let you know.


On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash <and...@andrewash.com> wrote:

> Glad to hear you're making progress!  Do you have a working version of the
> join?  Is there anything else you need help with?
>
>
> On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover <roger.hoo...@gmail.com>wrote:
>
>> Ah, in case this helps others, looks like RDD.zipPartitions will
>> accomplish step 4.
>>
>>
>> On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <roger.hoo...@gmail.com>wrote:
>>
>>> Andrew,
>>>
>>> Thank you very much for your feedback.  Unfortunately, the ranges are
>>> not of predictable size but you gave me an idea of how to handle it.
>>>  Here's what I'm thinking:
>>>
>>> 1. Choose number of partitions, n, over IP space
>>> 2. Preprocess the IPRanges, splitting any of them that cross partition
>>> boundaries
>>> 3. Partition ipToUrl and the new ipRangeToZip according to the
>>> partitioning scheme from step 1
>>> 4. Join matching partitions of these two RDDs
>>>
>>> I still don't know how to do step 4 though.  I see that RDDs have a
>>> mapPartitions() operation to let you do whatever you want with a partition.
>>>  What I need is a way to get my hands on two partitions at once, each from
>>> different RDDs.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>>
>>> Roger
>>>
>>>
>>> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <and...@andrewash.com>wrote:
>>>
>>>> Are your IPRanges all on nice, even CIDR-format ranges?  E.g.
>>>> 192.168.0.0/16 or 10.0.0.0/8?
>>>>
>>>> If the range is always an even subnet mask and not split across
>>>> subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and
>>>> then joining the two RDDs.  The expansion would be at most 32x if all your
>>>> ranges can be expressed in CIDR notation, and in practice would be much
>>>> smaller than that (typically you don't need things bigger than a /8 and
>>>> often not smaller than a /24)
>>>>
>>>> Hopefully you can use your knowledge of the ip ranges to make this
>>>> feasible.
>>>>
>>>> Otherwise, you could additionally flatmap the ipRangeToZip out to a
>>>> list of CIDR notations and do the join then, but you're starting to have
>>>> the cartesian product work against you on scale at that point.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover 
>>>> <roger.hoo...@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to figure out how to join two RDDs with different key types
>>>>> and appreciate any suggestions.
>>>>>
>>>>> Say I have two RDDS:
>>>>>     ipToUrl of type (IP, String)
>>>>>     ipRangeToZip of type (IPRange, String)
>>>>>
>>>>> How can I join/cogroup these two RDDs together to produce a new RDD of
>>>>> type (IP, (String, String)) where IP is the key and the values are the 
>>>>> urls
>>>>> and zipcodes?
>>>>>
>>>>> Say I have a method on the IPRange class called matches(ip: IP), I
>>>>> want the joined records to match when ipRange.matches(ip).
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Roger
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to