alive

Reynold Xin Tue, 27 Feb 2018 01:58:52 -0800

This was actually an AMPLab bucket.


On Feb 27, 2018, 6:04 PM +1300, Holden Karau <[email protected]>, wrote:
> Thanks Nick, we deprecated this during the roll over to the new release 
> managers. I assume this bucket was maintained by someone at databricks so 
> maybe they can chime in.
>
> > On Feb 26, 2018 8:57 PM, "Nicholas Chammas" <[email protected]> 
> > wrote:
> > > If you go to the Downloads page and download Spark 2.2.1, you’ll get a 
> > > link to an Apache mirror. It didn’t use to be this way. As recently as 
> > > Spark 2.2.0, downloads were served via CloudFront, which was backed by an 
> > > S3 bucket named spark-related-packages.
> > > It seems that we’ve stopped using CloudFront, and the S3 bucket behind it 
> > > has stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m 
> > > guessing this is part of an effort to use the Apache mirror network, like 
> > > other Apache projects do.
> > > From a user perspective, the Apache mirror network is several steps down 
> > > from using a modern CDN. Let me summarize why:
> > >
> > > 1. Apache mirrors are often slow. Apache does not impose any performance 
> > > requirements on its mirrors. The difference between getting a good mirror 
> > > and a bad one means downloading Spark in less than a minute vs. 20 
> > > minutes. The problem is so bad that I’ve thought about adding an Apache 
> > > mirror blacklist to Flintrock to avoid getting one of these dud mirrors.
> > > 2. Apache mirrors are inconvenient to use. When you download something 
> > > from an Apache mirror, you get a link like this one. Instead of 
> > > automatically redirecting you to your download, though, you need to 
> > > process the results you get back to find your download target. And you 
> > > need to handle the high download failure rate, since sometimes the mirror 
> > > you get doesn’t have the file it claims to have.
> > > 3. Apache mirrors are incomplete. Apache mirrors only keep around the 
> > > latest releases, save for a few “archive” mirrors, which are often slow. 
> > > So if you want to download anything but the latest version of Spark, you 
> > > are out of luck.
> > >
> > > Some of these problems can be mitigated by picking a specific mirror that 
> > > works well and hardcoding it in your scripts, but that defeats the 
> > > purpose of dynamically selecting a mirror and makes you a “bad” user of 
> > > the mirror network.
> > > I raised some of these issues over on INFRA-10999. The ticket sat for a 
> > > year before I heard anything back, and the bottom line was that none of 
> > > the above problems have a solution on the horizon. It’s fine. I 
> > > understand that Apache is a volunteer organization and that the 
> > > infrastructure team has a lot to manage as it is. I still find it 
> > > disappointing that an organization of Apache’s stature doesn’t have a 
> > > better solution for this in collaboration with a third party. Python 
> > > serves PyPI downloads using Fastly and Homebrew serves packages using 
> > > Bintray. They both work really, really well. Why don’t we have something 
> > > as good for Apache projects? Anyway, that’s a separate discussion.
> > > What I want to say is this:
> > > Dear whoever owns the spark-related-packages S3 bucket,
> > > Please keep the bucket up-to-date with the latest Spark releases, 
> > > alongside the past releases that are already on there. It’s a huge help 
> > > to the Flintrock project, and it’s an equally big help to those of us 
> > > writing infrastructure automation scripts that deploy Spark in other 
> > > contexts.
> > > I understand that hosting this stuff is not free, and that I am not 
> > > paying anything for this service. If it needs to go, so be it. But I 
> > > wanted to take this opportunity to lay out the benefits I’ve enjoyed 
> > > thanks to having this bucket around, and to make sure that if it did die, 
> > > it didn’t die a quiet death.
> > > Sincerely,
> > > Nick
> > >
>

Re: Please keep s3://spark-related-packages/ alive

Reply via email to