Tahoe-LAFS is currently very "grid"-centric. When you set up a client, you join a specific cluster of machines (based upon an Introducer), and all your uploads and downloads will only consider those nodes.
IPFS, Dat, and other projects in this space have a "one true grid" property that I envy. "ipfs cat /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme" will get you the README from any [IPFS] node in the world, without any pre-arranged server setup or membership steps. In contrast, "tahoe get URI:CHK:yqy7u7jbhh7crxjrzkgda72hhq:ezi5mm3xax6zdqaoxxadanno46st72re7zwfb6pi7krr6emxmspq:3:10:3364" will only work if you run it on a Tahoe client that's connected to the same grid where I uploaded it. There are a bunch of good reasons why Tahoe is grid-centric, which mostly come down to durability and reliability: * as a backup tool, you want your data to be on reliable servers, which in practice means you either have to pay for storage or provide some reciprocal benefit (tit-for-tat, but for storage). If you don't know who your servers are, it's hard to believe they'll be reliable. * direct interaction with an unbounded number of servers will incur considerable overhead: scaling is all about hierarchy, and your "local grid" is the lowest level of this hierarchy So if we aren't able to get to a "one true grid" in Tahoe any time soon, is there a way we could still enable access *among* grids? I really want to be able to email you the output of "tahoe put" and have you be able to download the contents, without first needing to get the two of us into the same grid. Last week we spent some time sketching this out. It builds upon the "grid manager" idea I wrote about previously. The scheme looks like this: * Every server belongs to exactly one "home grid", run by a grid manager * Every client belongs to exactly one "home grid" too * We extend filecaps to include an optional grid-id. This is just like an area code on a phone number: omitting it means the file is stored in your local grid, or in the same grid as the parent directory, or something. Filecaps with a grid-id that doesn't match yours are called "long-distance filecaps". * The grid manager publishes a pointer (FURL or URL) to a "gateway", which is like the long-distance operator. Clients that need to retrieve a long-distance filecap will attenuate it into a verifycap, then ask the gateway to retrieve the desired range of ciphertext. The client will verify the ciphertext blocks and finish the decryption themselves. * Gateways "charge" clients just like local servers do: they require the client's account pubkey to be in the approved list (from the grid manager), and they report byte counts to the manager for billing. * Grids that are willing to have their files be accessed remotely will select one or more nodes to act as external contact points (need a better name for this). These behave as decoding proxies. When a request arrives (for a verifycap), they'll contact the necessary local servers for the blocks, verify them (switching to different servers if they're bad), undo the erasure-coding, then send the ciphertext back to the requesting gateway. * We build a "clearinghouse": a service that helps Gateways connect to these external contact points. It needs to manage the table of area codes, and it needs to help with billing/quotas. Gateways send the area code to this service and are told a FURL or URL of the foreign grid's contact points. These contact points check that the gateway's public key is approved by the clearinghouse, turn verifycaps into ciphertext, then bill the clearinghouse for each byte fetched. * If/when billing is involved, the payments go through the clearinghouse. Grids can publish their prices, and the local gateway can decide whether it will perform the transfer depending upon policy and how much it's going to cost. But at the end of the day, the clearinghouse keeps track of who owes what to whom, and it's the grid managers who exchange payment with the clearinghouse to settle the bill. Grids who export a lot of data will get paid, and grid with gateways that pull down a lot of data will be paying. * The "area codes" could either be short (sitting in the "centralized" corner of Zooko's Triangle), or full-sized pubkeys. If they're short, then long-distance filecaps are easier to work with, but all grids must choose somebody (hopefully the same somebody) to decide on the mapping from area code to pubkey and FURL. We could use a distributed table like Namecoin to do this, but in either case you might want some kind of rate-limiting to discourage squatting. Long area codes avoid centralized allocation, but make cut-and-paste less pleasant. In all cases you need someone to distribute the routing pointers, which change over time as gateways/contact-points move around. The gateway and "contact points" look a lot like two halves of the as-yet-unwritten Download Helper, separated by a connection that's mediated by the clearinghouse. The relationship between gateway, contact-point, and clearinghouse looks just like the one between client, server, and grid-manager in the local grid (same signatures, same payment flow, same authority delegation patterns). This ought to scale decently: * clients only ever talk to their local servers, and their local gateway * servers only ever talk to their local clients, and the local contact-point * gateways/contact-points talk to each other, as the second level of the hierarchy * local payment arrangements are independent of inter-grid payments * each grid has a payment relationship with only the clearinghouse, not with each individual remote grid, to minimize the number of payments, making them more efficient And it should maintain the reliability targets we've had in the past: * your data is uploaded to your local servers * you've delegated your durability to your grid manager * .. who you might pay, or have other incentives in place * downloading your own data only depends upon your local grid * downloading remote data is less reliable than downloading local data, and depends upon your local gateway (to exist), your local grid-manager (to pay for membership in the clearinghouse), the clearinghouse, the remote contact-point, and some k-sized subset of the remote servers The development steps we'll need look something like: * build all the grid-manager stuff from my other email * finally implement the mutable/immutable Download Helper * then split it into two pieces, connected by a socket * add accounting controls to it, so only approved clients can connect on the downstream side, and only approved servers are used on the upstream side * rename the two pieces, bundle them into standalone nodes * enhance the grid-manager to publish the gateway to local clients * decide on what area-codes should look like (short vs long) * build the clearinghouse, which looks like a meta-grid-manager, with one account key per grid * enhance the grid-manager to publish the contact-point to the clearinghouse, and to tell the gateway how to ask the clearinghouse for remote contact-points * figure out the inter-grid billing/accounting stuff There's still a bunch of questions, but I think this is a promising direction. cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
