Here are some specific questions I'd recommend for the Apache Spark PMC to bring to ASF legal counsel:
1) Does the philosophy described on LEGAL-270 still represent a sanctioned approach to publishing releases via container image? 2) If the transitive closure of pulled-in licenses on each of these images is limited to licenses that are defined as compatible with Apache-2 <https://www.apache.org/legal/resolved.html>, does that satisfy ASF licensing and legal guidelines? 3) What form of documentation/auditing for (2) should be provided to meet legal requirements? I would define the proposed action this way; to include, as part of the Apache Spark official release process, publishing a "spark-base" image, to be tagged with the specific release, that consists of a build of the spark code for that release installed on a base-image (currently alpine, but possibly some other alternative like centos), combined with the jvm and python (and any of their transitive deps). Additionally, some number of images derived from "spark-base" would be built, which consist of spark-base and a small layer of bash scripting for ENTRYPOINT and CMD, to support the kubernets back-end. Optionally, similar images targeted for mesos or yarn might also be created. On Tue, Dec 19, 2017 at 1:28 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > Reasoning by analogy to other Apache projects is generally not sufficient > when it come to securing legally permissible form or behavior -- that > another project is doing something is not a guarantee that they are doing > it right. If we have issues or legal questions, we need to formulate them > and our proposed actions as clearly and concretely as possible so that the > PMC can take those issues, questions and proposed actions to Apache counsel > for advice or guidance. > > On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson <eerla...@redhat.com> > wrote: > >> I've been looking a bit more into ASF legal posture on licensing and >> container images. What I have found indicates that ASF considers container >> images to be just another variety of distribution channel. As such, it is >> acceptable to publish official releases; for example an image such as >> spark:v2.3.0 built from the v2.3.0 source is fine. It is not acceptable to >> do something like regularly publish spark:latest built from the head of >> master. >> >> More detail here: >> https://issues.apache.org/jira/browse/LEGAL-270 >> >> So as I understand it, making a release-tagged public image as part of >> each official release does not pose any problems. >> >> With respect to considering the licenses of other ancillary dependencies >> that are also installed on such container images, I noticed this clause in >> the legal boilerplate for the Flink images >> <https://hub.docker.com/r/library/flink/>: >> >> As with all Docker images, these likely also contain other software which >>> may be under other licenses (such as Bash, etc from the base distribution, >>> along with any direct or indirect dependencies of the primary software >>> being contained). >>> >> >> So it may be sufficient to resolve this via disclaimer. >> >> -Erik >> >> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <eerla...@redhat.com> >> wrote: >> >>> Currently the containers are based off alpine, which pulls in BSD2 and >>> MIT licensing: >>> https://github.com/apache/spark/pull/19717#discussion_r154502824 >>> >>> to the best of my understanding, neither of those poses a problem. If >>> we based the image off of centos I'd also expect the licensing of any image >>> deps to be compatible. >>> >>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> What licensing issues come into play? >>>> >>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <eerla...@redhat.com> >>>> wrote: >>>> >>>>> We've been discussing the topic of container images a bit more. The >>>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT >>>>> logic, which is different than mesos, and which is probably not practical >>>>> to unify at this level. >>>>> >>>>> However: These CMD and ENTRYPOINT configurations are essentially just >>>>> a thin skin on top of an image which is just an install of a spark distro. >>>>> We feel that a single "spark-base" image should be publishable, that is >>>>> consumable by kube-spark images, and mesos-spark images, and likely any >>>>> other community image whose primary purpose is running spark components. >>>>> The kube-specific dockerfiles would be written "FROM spark-base" and just >>>>> add the small command and entrypoint layers. Likewise, the mesos images >>>>> could add any specialization layers that are necessary on top of the >>>>> "spark-base" image. >>>>> >>>>> Does this factorization sound reasonable to others? >>>>> Cheers, >>>>> Erik >>>>> >>>>> >>>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan < >>>>> mri...@gmail.com> wrote: >>>>> >>>>>> We do support running on Apache Mesos via docker images - so this >>>>>> would not be restricted to k8s. >>>>>> But unlike mesos support, which has other modes of running, I believe >>>>>> k8s support more heavily depends on availability of docker images. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Mridul >>>>>> >>>>>> >>>>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com> >>>>>> wrote: >>>>>> > Would it be logical to provide Docker-based distributions of other >>>>>> pieces of >>>>>> > Spark? or is this specific to K8S? >>>>>> > The problem is we wouldn't generally also provide a distribution of >>>>>> Spark >>>>>> > for the reasons you give, because if that, then why not RPMs and so >>>>>> on. >>>>>> > >>>>>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan < >>>>>> ramanath...@google.com> >>>>>> > wrote: >>>>>> >> >>>>>> >> In this context, I think the docker images are similar to the >>>>>> binaries >>>>>> >> rather than an extension. >>>>>> >> It's packaging the compiled distribution to save people the effort >>>>>> of >>>>>> >> building one themselves, akin to binaries or the python package. >>>>>> >> >>>>>> >> For reference, this is the base dockerfile for the main image that >>>>>> we >>>>>> >> intend to publish. It's not particularly complicated. >>>>>> >> The driver and executor images are based on said base image and >>>>>> only >>>>>> >> customize the CMD (any file/directory inclusions are extraneous >>>>>> and will be >>>>>> >> removed). >>>>>> >> >>>>>> >> Is there only one way to build it? That's a bit harder to reason >>>>>> about. >>>>>> >> The base image I'd argue is likely going to always be built that >>>>>> way. The >>>>>> >> driver and executor images, there may be cases where people want to >>>>>> >> customize it - (like putting all dependencies into it for example). >>>>>> >> In those cases, as long as our images are bare bones, they can use >>>>>> the >>>>>> >> spark-driver/spark-executor images we publish as the base, and >>>>>> build their >>>>>> >> customization as a layer on top of it. >>>>>> >> >>>>>> >> I think the composability of docker images, makes this a bit >>>>>> different >>>>>> >> from say - debian packages. >>>>>> >> We can publish canonical images that serve as both - a complete >>>>>> image for >>>>>> >> most Spark applications, as well as a stable substrate to build >>>>>> >> customization upon. >>>>>> >> >>>>>> >> On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra < >>>>>> m...@clearstorydata.com> >>>>>> >> wrote: >>>>>> >>> >>>>>> >>> It's probably also worth considering whether there is only one, >>>>>> >>> well-defined, correct way to create such an image or whether this >>>>>> is a >>>>>> >>> reasonable avenue for customization. Part of why we don't do >>>>>> something like >>>>>> >>> maintain and publish canonical Debian packages for Spark is >>>>>> because >>>>>> >>> different organizations doing packaging and distribution of >>>>>> infrastructures >>>>>> >>> or operating systems can reasonably want to do this in a custom >>>>>> (or >>>>>> >>> non-customary) way. If there is really only one reasonable way to >>>>>> do a >>>>>> >>> docker image, then my bias starts to tend more toward the Spark >>>>>> PMC taking >>>>>> >>> on the responsibility to maintain and publish that image. If >>>>>> there is more >>>>>> >>> than one way to do it and publishing a particular image is more >>>>>> just a >>>>>> >>> convenience, then my bias tends more away from maintaining and >>>>>> publish it. >>>>>> >>> >>>>>> >>> On Wed, Nov 29, 2017 at 5:14 AM, Sean Owen <so...@cloudera.com> >>>>>> wrote: >>>>>> >>>> >>>>>> >>>> Source code is the primary release; compiled binary releases are >>>>>> >>>> conveniences that are also released. A docker image sounds >>>>>> fairly different >>>>>> >>>> though. To the extent it's the standard delivery mechanism for >>>>>> some artifact >>>>>> >>>> (think: pyspark on PyPI as well) that makes sense, but is that >>>>>> the >>>>>> >>>> situation? if it's more of an extension or alternate >>>>>> presentation of Spark >>>>>> >>>> components, that typically wouldn't be part of a Spark release. >>>>>> The ones the >>>>>> >>>> PMC takes responsibility for maintaining ought to be the core, >>>>>> critical >>>>>> >>>> means of distribution alone. >>>>>> >>>> >>>>>> >>>> On Wed, Nov 29, 2017 at 2:52 AM Anirudh Ramanathan >>>>>> >>>> <ramanath...@google.com.invalid> wrote: >>>>>> >>>>> >>>>>> >>>>> Hi all, >>>>>> >>>>> >>>>>> >>>>> We're all working towards the Kubernetes scheduler backend >>>>>> (full steam >>>>>> >>>>> ahead!) that's targeted towards Spark 2.3. One of the questions >>>>>> that comes >>>>>> >>>>> up often is docker images. >>>>>> >>>>> >>>>>> >>>>> While we're making available dockerfiles to allow people to >>>>>> create >>>>>> >>>>> their own docker images from source, ideally, we'd want to >>>>>> publish official >>>>>> >>>>> docker images as part of the release process. >>>>>> >>>>> >>>>>> >>>>> I understand that the ASF has procedure around this, and we >>>>>> would want >>>>>> >>>>> to get that started to help us get these artifacts published by >>>>>> 2.3. I'd >>>>>> >>>>> love to get a discussion around this started, and the thoughts >>>>>> of the >>>>>> >>>>> community regarding this. >>>>>> >>>>> >>>>>> >>>>> -- >>>>>> >>>>> Thanks, >>>>>> >>>>> Anirudh Ramanathan >>>>>> >>> >>>>>> >>> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> -- >>>>>> >> Anirudh Ramanathan >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >