On Sat, Apr 29, 2017 at 9:11 PM, lee <l...@yagibdah.de> wrote:
>
> "Poison BL." <poiso...@gmail.com> writes:
> > Half petabyte datasets aren't really something I'd personally *ever*
trust
> > ftp with in the first place.
>
> Why not?  (12GB are nowhere close to half a petabyte ...)

Ah... I completely misread that "or over 50k files in 12GB" as 50k files
*at* 12GB each... which works out to 0.6 PB, incidentally.

> The data would come in from suppliers.  There isn't really anything
> going on atm but fetching data once a month which can be like 100MB or
> 12GB or more.  That's because ppl don't use ftp ...

Really, if you're pulling it in from third party suppliers, you tend to be
tied to what they offer as a method of pulling it from them (or them
pushing it out to you), unless you're in the unique position to dictate the
decision for them. From there, assuming you can push your choice of product
on them, it becomes a question of how often the same dataset will need
updated from the same sources, how much it changes between updates, how
secure it needs to be in transit, how much you need to be able to trust
that the source is still legitimately who you think it is, and how much
verification that there wasn't any corruption during the transfer. Generic
FTP has been losing favor over time because it was built in a time that
many of those questions weren't really at the top of the list for concerns.

SFTP (or SCP) (as long as keys are handled properly) allows for pretty
solid certainty that a) both ends of the connection are who they say they
are, b) those two ends are the only ones reading the data in transit, and
c) the data that was sent is the same that was received (simply as a side
benefit of the encryption/decryption). Rsync over SSH gives the same set of
benefits, reduces the bandwidth used for updating the dataset (when it's
the same dataset, at least), and will also verify the data on both ends (as
it exists on disk) matches. If you're particularly lucky, the data might
even hit just the right mark that benefits from the in-line compression you
can turn on with SSH, too, cutting down the actual amount of bandwidth you
burn through for each transfer.

If your suppliers all have *nix based systems available, those are also
standard tools that they'll have on hand. If they're strictly Windows
shops, SCP/SFTP are still readily available, though they aren't built into
the OS... rsync gets a bit trickier.

> > How often does it need moved in/out of your facility, and is there no
way
> > to break up the processing into smaller chunks than a 0.6PB mass of
files?
> > Distribute out the smaller pieces with rsync, scp, or the like, operate
on
> > them, and pull back in the results, rather than trying to shift around
the
> > entire set. There's a reason Amazon will send a physical truck to a
site to
> > import large datasets into glacier... ;)
>
> Amazon has trucks?  Perhaps they do in other countries.  Here, amazon is
> just another web shop.  They might have some delivery vans, but I've
> never seen one, so I doubt it.  And why would anyone give them their
> data?  There's no telling what they would do with it.

Amazon's also one of the best known cloud computing suppliers on the planet
(AWS = Amazon Web Services). They have everything from pure compute
offerings to cloud storage geared towards *large* data archival. The latter
offering is named "glacier", and they offer a service for the import of
data into it (usually the "first pass", incremental changes are generally
done over the wire) that consists of a shipping truck with a rather nifty
storage system in the back of it that they hook right into your network.
You fill it with data, and then they drive it back to one of their data
centers to load it into place.

--
Poison [BLX]
Joshua M. Murphy

Reply via email to