Thanks for following up. I hope to get some free time this afternoon to get it working. Will let you know.
On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash <and...@andrewash.com> wrote: > Glad to hear you're making progress! Do you have a working version of the > join? Is there anything else you need help with? > > > On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover <roger.hoo...@gmail.com>wrote: > >> Ah, in case this helps others, looks like RDD.zipPartitions will >> accomplish step 4. >> >> >> On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <roger.hoo...@gmail.com>wrote: >> >>> Andrew, >>> >>> Thank you very much for your feedback. Unfortunately, the ranges are >>> not of predictable size but you gave me an idea of how to handle it. >>> Here's what I'm thinking: >>> >>> 1. Choose number of partitions, n, over IP space >>> 2. Preprocess the IPRanges, splitting any of them that cross partition >>> boundaries >>> 3. Partition ipToUrl and the new ipRangeToZip according to the >>> partitioning scheme from step 1 >>> 4. Join matching partitions of these two RDDs >>> >>> I still don't know how to do step 4 though. I see that RDDs have a >>> mapPartitions() operation to let you do whatever you want with a partition. >>> What I need is a way to get my hands on two partitions at once, each from >>> different RDDs. >>> >>> Any ideas? >>> >>> Thanks, >>> >>> Roger >>> >>> >>> On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <and...@andrewash.com>wrote: >>> >>>> Are your IPRanges all on nice, even CIDR-format ranges? E.g. >>>> 192.168.0.0/16 or 10.0.0.0/8? >>>> >>>> If the range is always an even subnet mask and not split across >>>> subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and >>>> then joining the two RDDs. The expansion would be at most 32x if all your >>>> ranges can be expressed in CIDR notation, and in practice would be much >>>> smaller than that (typically you don't need things bigger than a /8 and >>>> often not smaller than a /24) >>>> >>>> Hopefully you can use your knowledge of the ip ranges to make this >>>> feasible. >>>> >>>> Otherwise, you could additionally flatmap the ipRangeToZip out to a >>>> list of CIDR notations and do the join then, but you're starting to have >>>> the cartesian product work against you on scale at that point. >>>> >>>> Andrew >>>> >>>> >>>> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover >>>> <roger.hoo...@gmail.com>wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm trying to figure out how to join two RDDs with different key types >>>>> and appreciate any suggestions. >>>>> >>>>> Say I have two RDDS: >>>>> ipToUrl of type (IP, String) >>>>> ipRangeToZip of type (IPRange, String) >>>>> >>>>> How can I join/cogroup these two RDDs together to produce a new RDD of >>>>> type (IP, (String, String)) where IP is the key and the values are the >>>>> urls >>>>> and zipcodes? >>>>> >>>>> Say I have a method on the IPRange class called matches(ip: IP), I >>>>> want the joined records to match when ipRange.matches(ip). >>>>> >>>>> Thanks, >>>>> >>>>> Roger >>>>> >>>>> >>>> >>> >> >