alive

Matei Zaharia Tue, 27 Feb 2018 18:32:38 -0800

For Flintrock, have you considered using a Requester Pays bucket? That way 
you’d get the availability of S3 without having to foot the bill for bandwidth 
yourself (which was the bulk of the cost for the old bucket).


Matei

> On Feb 27, 2018, at 4:35 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> So is there no hope for this S3 bucket, or room to replace it with a bucket 
> owned by some organization other than AMPLab (which is technically now 
> defunct, I guess)? Sorry to persist, but I just have to ask.
> 
> On Tue, Feb 27, 2018 at 10:36 AM Michael Heuer <heue...@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 8:17 AM, Sean Owen <sro...@gmail.com> wrote:
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-d3kbcqa49mib13-cloudfront-net-td22427.html
>  -- it was 'retired', yes.
> 
> Agree with all that, though they're intended for occasional individual use 
> and not a case where performance and uptime matter. For that, I think you'd 
> want to just host your own copy of the bits you need. 
> 
> The notional problem was that the S3 bucket wasn't obviously 
> controlled/blessed by the ASF and yet was a source of official bits. It was 
> another set of third-party credentials to hand around to release managers, 
> which was IIRC a little problematic.
> 
> Homebrew does host distributions of ASF projects, like Spark, FWIW. 
> 
> To clarify, the apache-spark.rb formula in Homebrew uses the Apache mirror 
> closer.lua script
> 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4
> 
>    michael
> 
>  
> On Mon, Feb 26, 2018 at 10:57 PM Nicholas Chammas 
> <nicholas.cham...@gmail.com> wrote:
> If you go to the Downloads page and download Spark 2.2.1, you’ll get a link 
> to an Apache mirror. It didn’t use to be this way. As recently as Spark 
> 2.2.0, downloads were served via CloudFront, which was backed by an S3 bucket 
> named spark-related-packages.
> 
> It seems that we’ve stopped using CloudFront, and the S3 bucket behind it has 
> stopped receiving updates (e.g. Spark 2.2.1 isn’t there). I’m guessing this 
> is part of an effort to use the Apache mirror network, like other Apache 
> projects do.
> 
> From a user perspective, the Apache mirror network is several steps down from 
> using a modern CDN. Let me summarize why:
> 
>       • Apache mirrors are often slow. Apache does not impose any performance 
> requirements on its mirrors. The difference between getting a good mirror and 
> a bad one means downloading Spark in less than a minute vs. 20 minutes. The 
> problem is so bad that I’ve thought about adding an Apache mirror blacklist 
> to Flintrock to avoid getting one of these dud mirrors.
>       • Apache mirrors are inconvenient to use. When you download something 
> from an Apache mirror, you get a link like this one. Instead of automatically 
> redirecting you to your download, though, you need to process the results you 
> get back to find your download target. And you need to handle the high 
> download failure rate, since sometimes the mirror you get doesn’t have the 
> file it claims to have.
>       • Apache mirrors are incomplete. Apache mirrors only keep around the 
> latest releases, save for a few “archive” mirrors, which are often slow. So 
> if you want to download anything but the latest version of Spark, you are out 
> of luck.
> Some of these problems can be mitigated by picking a specific mirror that 
> works well and hardcoding it in your scripts, but that defeats the purpose of 
> dynamically selecting a mirror and makes you a “bad” user of the mirror 
> network.
> 
> I raised some of these issues over on INFRA-10999. The ticket sat for a year 
> before I heard anything back, and the bottom line was that none of the above 
> problems have a solution on the horizon. It’s fine. I understand that Apache 
> is a volunteer organization and that the infrastructure team has a lot to 
> manage as it is. I still find it disappointing that an organization of 
> Apache’s stature doesn’t have a better solution for this in collaboration 
> with a third party. Python serves PyPI downloads using Fastly and Homebrew 
> serves packages using Bintray. They both work really, really well. Why don’t 
> we have something as good for Apache projects? Anyway, that’s a separate 
> discussion.
> 
> What I want to say is this:
> 
> Dear whoever owns the spark-related-packages S3 bucket,
> 
> Please keep the bucket up-to-date with the latest Spark releases, alongside 
> the past releases that are already on there. It’s a huge help to the 
> Flintrock project, and it’s an equally big help to those of us writing 
> infrastructure automation scripts that deploy Spark in other contexts.
> 
> I understand that hosting this stuff is not free, and that I am not paying 
> anything for this service. If it needs to go, so be it. But I wanted to take 
> this opportunity to lay out the benefits I’ve enjoyed thanks to having this 
> bucket around, and to make sure that if it did die, it didn’t die a quiet 
> death.
> 
> Sincerely,
> Nick
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Please keep s3://spark-related-packages/ alive

Reply via email to