Ah, in case this helps others, looks like RDD.zipPartitions will accomplish step 4.
On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover <roger.hoo...@gmail.com>wrote: > Andrew, > > Thank you very much for your feedback. Unfortunately, the ranges are not > of predictable size but you gave me an idea of how to handle it. Here's > what I'm thinking: > > 1. Choose number of partitions, n, over IP space > 2. Preprocess the IPRanges, splitting any of them that cross partition > boundaries > 3. Partition ipToUrl and the new ipRangeToZip according to the > partitioning scheme from step 1 > 4. Join matching partitions of these two RDDs > > I still don't know how to do step 4 though. I see that RDDs have a > mapPartitions() operation to let you do whatever you want with a partition. > What I need is a way to get my hands on two partitions at once, each from > different RDDs. > > Any ideas? > > Thanks, > > Roger > > > On Mon, Apr 14, 2014 at 5:45 PM, Andrew Ash <and...@andrewash.com> wrote: > >> Are your IPRanges all on nice, even CIDR-format ranges? E.g. >> 192.168.0.0/16 or 10.0.0.0/8? >> >> If the range is always an even subnet mask and not split across subnets, >> I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then >> joining the two RDDs. The expansion would be at most 32x if all your >> ranges can be expressed in CIDR notation, and in practice would be much >> smaller than that (typically you don't need things bigger than a /8 and >> often not smaller than a /24) >> >> Hopefully you can use your knowledge of the ip ranges to make this >> feasible. >> >> Otherwise, you could additionally flatmap the ipRangeToZip out to a list >> of CIDR notations and do the join then, but you're starting to have the >> cartesian product work against you on scale at that point. >> >> Andrew >> >> >> On Tue, Apr 15, 2014 at 1:07 AM, Roger Hoover <roger.hoo...@gmail.com>wrote: >> >>> Hi, >>> >>> I'm trying to figure out how to join two RDDs with different key types >>> and appreciate any suggestions. >>> >>> Say I have two RDDS: >>> ipToUrl of type (IP, String) >>> ipRangeToZip of type (IPRange, String) >>> >>> How can I join/cogroup these two RDDs together to produce a new RDD of >>> type (IP, (String, String)) where IP is the key and the values are the urls >>> and zipcodes? >>> >>> Say I have a method on the IPRange class called matches(ip: IP), I want >>> the joined records to match when ipRange.matches(ip). >>> >>> Thanks, >>> >>> Roger >>> >>> >> >