On Sat, Apr 29, 2017 at 9:11 PM, lee <l...@yagibdah.de> wrote: > > "Poison BL." <poiso...@gmail.com> writes: > > Half petabyte datasets aren't really something I'd personally *ever* trust > > ftp with in the first place. > > Why not? (12GB are nowhere close to half a petabyte ...)
Ah... I completely misread that "or over 50k files in 12GB" as 50k files *at* 12GB each... which works out to 0.6 PB, incidentally. > The data would come in from suppliers. There isn't really anything > going on atm but fetching data once a month which can be like 100MB or > 12GB or more. That's because ppl don't use ftp ... Really, if you're pulling it in from third party suppliers, you tend to be tied to what they offer as a method of pulling it from them (or them pushing it out to you), unless you're in the unique position to dictate the decision for them. From there, assuming you can push your choice of product on them, it becomes a question of how often the same dataset will need updated from the same sources, how much it changes between updates, how secure it needs to be in transit, how much you need to be able to trust that the source is still legitimately who you think it is, and how much verification that there wasn't any corruption during the transfer. Generic FTP has been losing favor over time because it was built in a time that many of those questions weren't really at the top of the list for concerns. SFTP (or SCP) (as long as keys are handled properly) allows for pretty solid certainty that a) both ends of the connection are who they say they are, b) those two ends are the only ones reading the data in transit, and c) the data that was sent is the same that was received (simply as a side benefit of the encryption/decryption). Rsync over SSH gives the same set of benefits, reduces the bandwidth used for updating the dataset (when it's the same dataset, at least), and will also verify the data on both ends (as it exists on disk) matches. If you're particularly lucky, the data might even hit just the right mark that benefits from the in-line compression you can turn on with SSH, too, cutting down the actual amount of bandwidth you burn through for each transfer. If your suppliers all have *nix based systems available, those are also standard tools that they'll have on hand. If they're strictly Windows shops, SCP/SFTP are still readily available, though they aren't built into the OS... rsync gets a bit trickier. > > How often does it need moved in/out of your facility, and is there no way > > to break up the processing into smaller chunks than a 0.6PB mass of files? > > Distribute out the smaller pieces with rsync, scp, or the like, operate on > > them, and pull back in the results, rather than trying to shift around the > > entire set. There's a reason Amazon will send a physical truck to a site to > > import large datasets into glacier... ;) > > Amazon has trucks? Perhaps they do in other countries. Here, amazon is > just another web shop. They might have some delivery vans, but I've > never seen one, so I doubt it. And why would anyone give them their > data? There's no telling what they would do with it. Amazon's also one of the best known cloud computing suppliers on the planet (AWS = Amazon Web Services). They have everything from pure compute offerings to cloud storage geared towards *large* data archival. The latter offering is named "glacier", and they offer a service for the import of data into it (usually the "first pass", incremental changes are generally done over the wire) that consists of a shipping truck with a rather nifty storage system in the back of it that they hook right into your network. You fill it with data, and then they drive it back to one of their data centers to load it into place. -- Poison [BLX] Joshua M. Murphy