Great point. We don't need geo-diversity for websites with the IP address issue, so we could design for that case specially on a one-off basis.
For throughput it shouldn't be an issue where we're located, but we often find websites serving different content based on the source IP of the traffic. So, having a presence closer to the user is useful. But then again, this is a different concern that's orthogonal to the original question, because geo-ip doesn't make much sense with an anycast IP. For those websites that need a stable IP for NACLs *and* serve different content based on source IP, we have to use the predictable 3-5 IPs per site suggestion of yours. On Wed, Jul 28, 2021 at 11:27 AM Glenn McGurrin via NANOG <nanog@nanog.org> wrote: > I'd had a similar thought/question, though keeping the geo diversity, > you manage the crawlers, and are making contact individually with these > sites from what you have stated (and so don't need a one size fit's all > list for public posting), so why not have a restricted subset of the > crawlers handle sites with these issues (which subset may be unique per > site, which makes maintaining even load balancing not overly complex > /limiting, especially as you are using nat anyway, so multiple servers > can be behind each ip and that number can vary). That let's you have > geo diversity (or even multi cloud diversity) for every site, but each > site that needs this IP whitelisting only needs 3-5 IP's at any site, > but yet you can distribute load over a much larger overall set of > machines and nat gateways. > > As I understand it even CDN's that anycast TCP (externally or internally > [load balancing via routers and multi path]) do similar by spreading > load over multiple IP's at the DNS layer first. > > As the transition to IPv6 happens you may have it easier as getting a > large enough allocation to allow for splitting it out into multiple > subnets advertised from different locations without providers dropping > the route as too long a prefix is much easier on the v6 side, so you > could give one /36 or /40 or even /44 out to whitelist but have /48's at > each location. For sites with ipv6 support that may help now, but it > won't help all sites for quite some time, though the number that support > v6 is slowly getting better. For the foreseeable future you still need > to handle the v4 side one way or another though. > > On 7/28/2021 10:21 AM, William Herrin wrote: > > On Wed, Jul 28, 2021 at 6:04 AM Vimal <j.vi...@gmail.com> wrote: > >> My intention is to run a web-crawling service on a public cloud. This > service > >> is geographically distributed, and therefore will run in multiple > regions > >> around the world inside AWS... this means there will be multiple AWS > VPCs, > >> each with their own NAT gateway, and traffic destined to websites > >> that we crawl will appear to come from this NAT gateway's IP address. > > > > Hello, > > > > AWS does not provide the ability to attach anycasted IP addresses to a > > NAT gateway, regardless of whether it would work, so that's the end of > > your quest. > > > >> The reason I want a predictable IP is to communicate this IP to website > >> owners so they can allow access from these IPs into their networks. > >> I chose IP as an example; it can also be a subnet, but what I don't > want to > >> provide is a list of 100 different IP addresses without any > predictability. > > > > If you bring your own IP addresses, you can attach a separate /24s of > > them to your VPCs in each region, providing you with a single > > predictable range of source addresses. You will find it difficult and > > expensive to acquire that many IP addresses from the regional > > registries for the purpose you describe. > > > > > > Silly question but: for a web crawler, why do you care whether it has > > the limited geographically distribution that a cloud service provides? > > It's a parallel batch task. It doesn't exactly matter whether you have > > minimum latency. > > > > Regards, > > Bill Herrin > > > > > > > -- Vimal