Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-19 Thread zzc
Hi, dev, the version of latest spark doc is still 2.2.0,  when to publish the
2.2.1 doc ?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev

java.lang.AbstractMethodError almost always means that you have different
libraries on the classpath than at compilation time.  In this case I would
check to make sure you have the correct version of Scala (and only have one
version of scala) on the classpath.

On Tue, Dec 19, 2017 at 5:42 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> Hi All,
>
> Can anyone help me with below error,
>
> Exception in thread "main" java.lang.AbstractMethodError
> at scala.collection.TraversableLike$class.filterNot(TraversableLike.
> scala:278)
> at org.apache.spark.sql.types.StructType.filterNot(StructType.scala:98)
> at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:386)
> at org.spark.jsonDF.StructStreamKafkaToDF$.getValueSchema(
> StructStreamKafkaToDF.scala:22)
> at org.spark.jsonDF.StructStreaming$.createRowDF(StructStreaming.scala:21)
> at SparkEntry$.delayedEndpoint$SparkEntry$1(SparkEntry.scala:22)
> at SparkEntry$delayedInit$body.apply(SparkEntry.scala:7)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> at scala.runtime.AbstractFunction0.apply$mcV$
> sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.generic.TraversableForwarder$class.
> foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at SparkEntry$.main(SparkEntry.scala:7)
> at SparkEntry.main(SparkEntry.scala)
>
> This happening, when i try to pass Dataset[String] containing jsons to
> spark.read.json(Records).
>
> Regards,
> Satyajit.
>


Spark error while trying to spark.read.json()

2017-12-19 Thread satyajit vegesna
Hi All,

Can anyone help me with below error,

Exception in thread "main" java.lang.AbstractMethodError
at
scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:278)
at org.apache.spark.sql.types.StructType.filterNot(StructType.scala:98)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:386)
at
org.spark.jsonDF.StructStreamKafkaToDF$.getValueSchema(StructStreamKafkaToDF.scala:22)
at org.spark.jsonDF.StructStreaming$.createRowDF(StructStreaming.scala:21)
at SparkEntry$.delayedEndpoint$SparkEntry$1(SparkEntry.scala:22)
at SparkEntry$delayedInit$body.apply(SparkEntry.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at SparkEntry$.main(SparkEntry.scala:7)
at SparkEntry.main(SparkEntry.scala)

This happening, when i try to pass Dataset[String] containing jsons to
spark.read.json(Records).

Regards,
Satyajit.


Re: Timeline for Spark 2.3

2017-12-19 Thread Michael Armbrust
Do people really need to be around for the branch cut (modulo the person
cutting the branch)?

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as
soon as we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau  wrote:

> Sounds reasonable, although I'd choose the 2nd perhaps just since lots of
> folks are off on the 1st?
>
> On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal 
> wrote:
>
>> Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that
>> (i.e., week of 8th Jan)?
>>
>>
>> On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau 
>> wrote:
>>
>>> So personally I’d be in favour or pushing to early January, doing a
>>> release over the holidays is a little rough with herding all of people to
>>> vote.
>>>
>>> On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
>>> wrote:
>>>
 I wanted to check in on the state of the 2.3 freeze schedule.  Original
 proposal was "late Dec", which is a bit open to interpretation.

 We are working to get some refactoring done on the integration testing
 for the Kubernetes back-end in preparation for testing upcoming release
 candidates, however holiday vacation time is about to begin taking its toll
 both on upstream reviewing and on the "downstream" spark-on-kube fork.

 If the freeze pushed into January, that would take some of the pressure
 off the kube back-end upstreaming. However, regardless, I was wondering if
 the dates could be clarified.
 Cheers,
 Erik


 On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
 wrote:

> Hi,
>
> What is the process to request an issue/fix to be included in the next
> release? Is there a place to vote for features?
> I am interested in https://issues.apache.org/jira/browse/SPARK-13127,
> to see
> if we can get Spark upgrade parquet to 1.9.0, which addresses the
> https://issues.apache.org/jira/browse/PARQUET-686.
> Can we include the fix in Spark 2.3 release?
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
 --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Sameer Agarwal
>> Software Engineer | Databricks Inc.
>> http://cs.berkeley.edu/~sameerag
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Timeline for Spark 2.3

2017-12-19 Thread Holden Karau
Sounds reasonable, although I'd choose the 2nd perhaps just since lots of
folks are off on the 1st?

On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal 
wrote:

> Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that
> (i.e., week of 8th Jan)?
>
>
> On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau 
> wrote:
>
>> So personally I’d be in favour or pushing to early January, doing a
>> release over the holidays is a little rough with herding all of people to
>> vote.
>>
>> On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
>> wrote:
>>
>>> I wanted to check in on the state of the 2.3 freeze schedule.  Original
>>> proposal was "late Dec", which is a bit open to interpretation.
>>>
>>> We are working to get some refactoring done on the integration testing
>>> for the Kubernetes back-end in preparation for testing upcoming release
>>> candidates, however holiday vacation time is about to begin taking its toll
>>> both on upstream reviewing and on the "downstream" spark-on-kube fork.
>>>
>>> If the freeze pushed into January, that would take some of the pressure
>>> off the kube back-end upstreaming. However, regardless, I was wondering if
>>> the dates could be clarified.
>>> Cheers,
>>> Erik
>>>
>>>
>>> On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
>>> wrote:
>>>
 Hi,

 What is the process to request an issue/fix to be included in the next
 release? Is there a place to vote for features?
 I am interested in https://issues.apache.org/jira/browse/SPARK-13127,
 to see
 if we can get Spark upgrade parquet to 1.9.0, which addresses the
 https://issues.apache.org/jira/browse/PARQUET-686.
 Can we include the fix in Spark 2.3 release?

 Thanks,

 Dong



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Timeline for Spark 2.3

2017-12-19 Thread Sameer Agarwal
Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that
(i.e., week of 8th Jan)?


On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau  wrote:

> So personally I’d be in favour or pushing to early January, doing a
> release over the holidays is a little rough with herding all of people to
> vote.
>
> On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
> wrote:
>
>> I wanted to check in on the state of the 2.3 freeze schedule.  Original
>> proposal was "late Dec", which is a bit open to interpretation.
>>
>> We are working to get some refactoring done on the integration testing
>> for the Kubernetes back-end in preparation for testing upcoming release
>> candidates, however holiday vacation time is about to begin taking its toll
>> both on upstream reviewing and on the "downstream" spark-on-kube fork.
>>
>> If the freeze pushed into January, that would take some of the pressure
>> off the kube back-end upstreaming. However, regardless, I was wondering if
>> the dates could be clarified.
>> Cheers,
>> Erik
>>
>>
>> On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
>> wrote:
>>
>>> Hi,
>>>
>>> What is the process to request an issue/fix to be included in the next
>>> release? Is there a place to vote for features?
>>> I am interested in https://issues.apache.org/jira/browse/SPARK-13127,
>>> to see
>>> if we can get Spark upgrade parquet to 1.9.0, which addresses the
>>> https://issues.apache.org/jira/browse/PARQUET-686.
>>> Can we include the fix in Spark 2.3 release?
>>>
>>> Thanks,
>>>
>>> Dong
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag


Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Sean Owen
I'd follow LEGAL-270, yes.  The best resource on licensing is
https://www.apache.org/legal/resolved.html ; it doesn't all have to be AL2,
but needs to be compatible (sometimes with additional conditions). Auditing
is basically entrusted to the PMC when voting on releases. I'll look at it
with you.

Only bits that are redistributed officially matter. That is a Dockerfile
itself has no licensing issues. Images with copies of software would be the
issue. Distributing a whole JVM and Python distro is probably going to
bring in far too much.

On Tue, Dec 19, 2017 at 2:59 PM Erik Erlandson  wrote:

>
> Here are some specific questions I'd recommend for the Apache Spark PMC to
> bring to ASF legal counsel:
>
> 1) Does the philosophy described on LEGAL-270 still represent a sanctioned
> approach to publishing releases via container image?
> 2) If the transitive closure of pulled-in licenses on each of these images
> is limited to licenses that are defined as compatible with Apache-2
> , does that satisfy ASF
> licensing and legal guidelines?
> 3) What form of documentation/auditing for (2) should be provided to meet
> legal requirements?
>
> I would define the proposed action this way; to include, as part of the
> Apache Spark official release process, publishing a "spark-base" image, to
> be tagged with the specific release, that consists of a build of the spark
> code for that release installed on a base-image (currently alpine, but
> possibly some other alternative like centos), combined with the jvm and
> python (and any of their transitive deps).  Additionally, some number of
> images derived from "spark-base" would be built, which consist of
> spark-base and a small layer of bash scripting for ENTRYPOINT and CMD, to
> support the kubernets back-end.  Optionally, similar images targeted for
> mesos or yarn might also be created.
>
>
> On Tue, Dec 19, 2017 at 1:28 PM, Mark Hamstra 
> wrote:
>
>> Reasoning by analogy to other Apache projects is generally not sufficient
>> when it come to securing legally permissible form or behavior -- that
>> another project is doing something is not a guarantee that they are doing
>> it right. If we have issues or legal questions, we need to formulate them
>> and our proposed actions as clearly and concretely as possible so that the
>> PMC can take those issues, questions and proposed actions to Apache counsel
>> for advice or guidance.
>>
>> On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson 
>> wrote:
>>
>>> I've been looking a bit more into ASF legal posture on licensing and
>>> container images. What I have found indicates that ASF considers container
>>> images to be just another variety of distribution channel.  As such, it is
>>> acceptable to publish official releases; for example an image such as
>>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>>> do something like regularly publish spark:latest built from the head of
>>> master.
>>>
>>> More detail here:
>>> https://issues.apache.org/jira/browse/LEGAL-270
>>>
>>> So as I understand it, making a release-tagged public image as part of
>>> each official release does not pose any problems.
>>>
>>> With respect to considering the licenses of other ancillary dependencies
>>> that are also installed on such container images, I noticed this clause in
>>> the legal boilerplate for the Flink images
>>> :
>>>
>>> As with all Docker images, these likely also contain other software
 which may be under other licenses (such as Bash, etc from the base
 distribution, along with any direct or indirect dependencies of the primary
 software being contained).

>>>
>>> So it may be sufficient to resolve this via disclaimer.
>>>
>>> -Erik
>>>
>>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
>>> wrote:
>>>
 Currently the containers are based off alpine, which pulls in BSD2 and
 MIT licensing:
 https://github.com/apache/spark/pull/19717#discussion_r154502824

 to the best of my understanding, neither of those poses a problem.  If
 we based the image off of centos I'd also expect the licensing of any image
 deps to be compatible.

 On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
 wrote:

> What licensing issues come into play?
>
> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
> wrote:
>
>> We've been discussing the topic of container images a bit more.  The
>> kubernetes back-end operates by executing some specific CMD and 
>> ENTRYPOINT
>> logic, which is different than mesos, and which is probably not practical
>> to unify at this level.
>>
>> However: These CMD and ENTRYPOINT configurations are essentially just
>> a thin skin on top of an image which is just an install of a 

Fwd: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
Here are some specific questions I'd recommend for the Apache Spark PMC to
bring to ASF legal counsel:

1) Does the philosophy described on LEGAL-270 still represent a sanctioned
approach to publishing releases via container image?
2) If the transitive closure of pulled-in licenses on each of these images
is limited to licenses that are defined as compatible with Apache-2
, does that satisfy ASF
licensing and legal guidelines?
3) What form of documentation/auditing for (2) should be provided to meet
legal requirements?

I would define the proposed action this way; to include, as part of the
Apache Spark official release process, publishing a "spark-base" image, to
be tagged with the specific release, that consists of a build of the spark
code for that release installed on a base-image (currently alpine, but
possibly some other alternative like centos), combined with the jvm and
python (and any of their transitive deps).  Additionally, some number of
images derived from "spark-base" would be built, which consist of
spark-base and a small layer of bash scripting for ENTRYPOINT and CMD, to
support the kubernets back-end.  Optionally, similar images targeted for
mesos or yarn might also be created.


On Tue, Dec 19, 2017 at 1:28 PM, Mark Hamstra 
wrote:

> Reasoning by analogy to other Apache projects is generally not sufficient
> when it come to securing legally permissible form or behavior -- that
> another project is doing something is not a guarantee that they are doing
> it right. If we have issues or legal questions, we need to formulate them
> and our proposed actions as clearly and concretely as possible so that the
> PMC can take those issues, questions and proposed actions to Apache counsel
> for advice or guidance.
>
> On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson 
> wrote:
>
>> I've been looking a bit more into ASF legal posture on licensing and
>> container images. What I have found indicates that ASF considers container
>> images to be just another variety of distribution channel.  As such, it is
>> acceptable to publish official releases; for example an image such as
>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>> do something like regularly publish spark:latest built from the head of
>> master.
>>
>> More detail here:
>> https://issues.apache.org/jira/browse/LEGAL-270
>>
>> So as I understand it, making a release-tagged public image as part of
>> each official release does not pose any problems.
>>
>> With respect to considering the licenses of other ancillary dependencies
>> that are also installed on such container images, I noticed this clause in
>> the legal boilerplate for the Flink images
>> :
>>
>> As with all Docker images, these likely also contain other software which
>>> may be under other licenses (such as Bash, etc from the base distribution,
>>> along with any direct or indirect dependencies of the primary software
>>> being contained).
>>>
>>
>> So it may be sufficient to resolve this via disclaimer.
>>
>> -Erik
>>
>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
>> wrote:
>>
>>> Currently the containers are based off alpine, which pulls in BSD2 and
>>> MIT licensing:
>>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>>
>>> to the best of my understanding, neither of those poses a problem.  If
>>> we based the image off of centos I'd also expect the licensing of any image
>>> deps to be compatible.
>>>
>>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
>>> wrote:
>>>
 What licensing issues come into play?

 On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
 wrote:

> We've been discussing the topic of container images a bit more.  The
> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
> logic, which is different than mesos, and which is probably not practical
> to unify at this level.
>
> However: These CMD and ENTRYPOINT configurations are essentially just
> a thin skin on top of an image which is just an install of a spark distro.
> We feel that a single "spark-base" image should be publishable, that is
> consumable by kube-spark images, and mesos-spark images, and likely any
> other community image whose primary purpose is running spark components.
> The kube-specific dockerfiles would be written "FROM spark-base" and just
> add the small command and entrypoint layers.  Likewise, the mesos images
> could add any specialization layers that are necessary on top of the
> "spark-base" image.
>
> Does this factorization sound reasonable to others?
> Cheers,
> Erik
>
>
> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>> We do support running on Apache 

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Mark Hamstra
Reasoning by analogy to other Apache projects is generally not sufficient
when it come to securing legally permissible form or behavior -- that
another project is doing something is not a guarantee that they are doing
it right. If we have issues or legal questions, we need to formulate them
and our proposed actions as clearly and concretely as possible so that the
PMC can take those issues, questions and proposed actions to Apache counsel
for advice or guidance.

On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson 
wrote:

> I've been looking a bit more into ASF legal posture on licensing and
> container images. What I have found indicates that ASF considers container
> images to be just another variety of distribution channel.  As such, it is
> acceptable to publish official releases; for example an image such as
> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
> do something like regularly publish spark:latest built from the head of
> master.
>
> More detail here:
> https://issues.apache.org/jira/browse/LEGAL-270
>
> So as I understand it, making a release-tagged public image as part of
> each official release does not pose any problems.
>
> With respect to considering the licenses of other ancillary dependencies
> that are also installed on such container images, I noticed this clause in
> the legal boilerplate for the Flink images
> :
>
> As with all Docker images, these likely also contain other software which
>> may be under other licenses (such as Bash, etc from the base distribution,
>> along with any direct or indirect dependencies of the primary software
>> being contained).
>>
>
> So it may be sufficient to resolve this via disclaimer.
>
> -Erik
>
> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
> wrote:
>
>> Currently the containers are based off alpine, which pulls in BSD2 and
>> MIT licensing:
>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>
>> to the best of my understanding, neither of those poses a problem.  If we
>> based the image off of centos I'd also expect the licensing of any image
>> deps to be compatible.
>>
>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
>> wrote:
>>
>>> What licensing issues come into play?
>>>
>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
>>> wrote:
>>>
 We've been discussing the topic of container images a bit more.  The
 kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
 logic, which is different than mesos, and which is probably not practical
 to unify at this level.

 However: These CMD and ENTRYPOINT configurations are essentially just a
 thin skin on top of an image which is just an install of a spark distro.
 We feel that a single "spark-base" image should be publishable, that is
 consumable by kube-spark images, and mesos-spark images, and likely any
 other community image whose primary purpose is running spark components.
 The kube-specific dockerfiles would be written "FROM spark-base" and just
 add the small command and entrypoint layers.  Likewise, the mesos images
 could add any specialization layers that are necessary on top of the
 "spark-base" image.

 Does this factorization sound reasonable to others?
 Cheers,
 Erik


 On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan  wrote:

> We do support running on Apache Mesos via docker images - so this
> would not be restricted to k8s.
> But unlike mesos support, which has other modes of running, I believe
> k8s support more heavily depends on availability of docker images.
>
>
> Regards,
> Mridul
>
>
> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen  wrote:
> > Would it be logical to provide Docker-based distributions of other
> pieces of
> > Spark? or is this specific to K8S?
> > The problem is we wouldn't generally also provide a distribution of
> Spark
> > for the reasons you give, because if that, then why not RPMs and so
> on.
> >
> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
> ramanath...@google.com>
> > wrote:
> >>
> >> In this context, I think the docker images are similar to the
> binaries
> >> rather than an extension.
> >> It's packaging the compiled distribution to save people the effort
> of
> >> building one themselves, akin to binaries or the python package.
> >>
> >> For reference, this is the base dockerfile for the main image that
> we
> >> intend to publish. It's not particularly complicated.
> >> The driver and executor images are based on said base image and only
> >> customize the CMD (any file/directory inclusions are extraneous and
> will be
> >> removed).
> >>
> >> Is there 

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Sean Owen
Unfortunately you'll need to chase down the license of all the bits that
are distributed directly by the project. This was a big job back in the day
for the Maven artifacts and some work to maintain. Most of the work is
one-time, at least.

On Tue, Dec 19, 2017 at 12:53 PM Erik Erlandson  wrote:

> Agreed that the GPL family would be "toxic."
>
> The current images have been at least informally confirmed to use licenses
> that are ASF compatible.  Is there an officially sanctioned method of
> license auditing that can be applied here?
>
> On Tue, Dec 19, 2017 at 11:45 AM, Sean Owen  wrote:
>
>> I think that's all correct, though the license of third party
>> dependencies is actually a difficult and sticky part. The ASF couldn't make
>> a software release including any GPL software for example, and it's not
>> just a matter of adding a disclaimer. Any actual bits distributed by the
>> PMC would have to follow all the license rules.
>>
>> On Tue, Dec 19, 2017 at 12:34 PM Erik Erlandson 
>> wrote:
>>
>>> I've been looking a bit more into ASF legal posture on licensing and
>>> container images. What I have found indicates that ASF considers container
>>> images to be just another variety of distribution channel.  As such, it is
>>> acceptable to publish official releases; for example an image such as
>>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>>> do something like regularly publish spark:latest built from the head of
>>> master.
>>>
>>> More detail here:
>>> https://issues.apache.org/jira/browse/LEGAL-270
>>>
>>> So as I understand it, making a release-tagged public image as part of
>>> each official release does not pose any problems.
>>>
>>> With respect to considering the licenses of other ancillary dependencies
>>> that are also installed on such container images, I noticed this clause in
>>> the legal boilerplate for the Flink images
>>> :
>>>
>>> As with all Docker images, these likely also contain other software
 which may be under other licenses (such as Bash, etc from the base
 distribution, along with any direct or indirect dependencies of the primary
 software being contained).

>>>
>>> So it may be sufficient to resolve this via disclaimer.
>>>
>>> -Erik
>>>
>>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
>>> wrote:
>>>
 Currently the containers are based off alpine, which pulls in BSD2 and
 MIT licensing:
 https://github.com/apache/spark/pull/19717#discussion_r154502824

 to the best of my understanding, neither of those poses a problem.  If
 we based the image off of centos I'd also expect the licensing of any image
 deps to be compatible.

 On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
 wrote:

> What licensing issues come into play?
>
> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
> wrote:
>
>> We've been discussing the topic of container images a bit more.  The
>> kubernetes back-end operates by executing some specific CMD and 
>> ENTRYPOINT
>> logic, which is different than mesos, and which is probably not practical
>> to unify at this level.
>>
>> However: These CMD and ENTRYPOINT configurations are essentially just
>> a thin skin on top of an image which is just an install of a spark 
>> distro.
>> We feel that a single "spark-base" image should be publishable, that is
>> consumable by kube-spark images, and mesos-spark images, and likely any
>> other community image whose primary purpose is running spark components.
>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>> add the small command and entrypoint layers.  Likewise, the mesos images
>> could add any specialization layers that are necessary on top of the
>> "spark-base" image.
>>
>> Does this factorization sound reasonable to others?
>> Cheers,
>> Erik
>>
>>
>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>>
>>> We do support running on Apache Mesos via docker images - so this
>>> would not be restricted to k8s.
>>> But unlike mesos support, which has other modes of running, I believe
>>> k8s support more heavily depends on availability of docker images.
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen 
>>> wrote:
>>> > Would it be logical to provide Docker-based distributions of other
>>> pieces of
>>> > Spark? or is this specific to K8S?
>>> > The problem is we wouldn't generally also provide a distribution
>>> of Spark
>>> > for the reasons you give, because if that, then why not RPMs and
>>> so on.
>>> >
>>> > On Wed, Nov 29, 2017 at 

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
Agreed that the GPL family would be "toxic."

The current images have been at least informally confirmed to use licenses
that are ASF compatible.  Is there an officially sanctioned method of
license auditing that can be applied here?

On Tue, Dec 19, 2017 at 11:45 AM, Sean Owen  wrote:

> I think that's all correct, though the license of third party dependencies
> is actually a difficult and sticky part. The ASF couldn't make a software
> release including any GPL software for example, and it's not just a matter
> of adding a disclaimer. Any actual bits distributed by the PMC would have
> to follow all the license rules.
>
> On Tue, Dec 19, 2017 at 12:34 PM Erik Erlandson 
> wrote:
>
>> I've been looking a bit more into ASF legal posture on licensing and
>> container images. What I have found indicates that ASF considers container
>> images to be just another variety of distribution channel.  As such, it is
>> acceptable to publish official releases; for example an image such as
>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>> do something like regularly publish spark:latest built from the head of
>> master.
>>
>> More detail here:
>> https://issues.apache.org/jira/browse/LEGAL-270
>>
>> So as I understand it, making a release-tagged public image as part of
>> each official release does not pose any problems.
>>
>> With respect to considering the licenses of other ancillary dependencies
>> that are also installed on such container images, I noticed this clause in
>> the legal boilerplate for the Flink images
>> :
>>
>> As with all Docker images, these likely also contain other software which
>>> may be under other licenses (such as Bash, etc from the base distribution,
>>> along with any direct or indirect dependencies of the primary software
>>> being contained).
>>>
>>
>> So it may be sufficient to resolve this via disclaimer.
>>
>> -Erik
>>
>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
>> wrote:
>>
>>> Currently the containers are based off alpine, which pulls in BSD2 and
>>> MIT licensing:
>>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>>
>>> to the best of my understanding, neither of those poses a problem.  If
>>> we based the image off of centos I'd also expect the licensing of any image
>>> deps to be compatible.
>>>
>>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
>>> wrote:
>>>
 What licensing issues come into play?

 On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
 wrote:

> We've been discussing the topic of container images a bit more.  The
> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
> logic, which is different than mesos, and which is probably not practical
> to unify at this level.
>
> However: These CMD and ENTRYPOINT configurations are essentially just
> a thin skin on top of an image which is just an install of a spark distro.
> We feel that a single "spark-base" image should be publishable, that is
> consumable by kube-spark images, and mesos-spark images, and likely any
> other community image whose primary purpose is running spark components.
> The kube-specific dockerfiles would be written "FROM spark-base" and just
> add the small command and entrypoint layers.  Likewise, the mesos images
> could add any specialization layers that are necessary on top of the
> "spark-base" image.
>
> Does this factorization sound reasonable to others?
> Cheers,
> Erik
>
>
> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>> We do support running on Apache Mesos via docker images - so this
>> would not be restricted to k8s.
>> But unlike mesos support, which has other modes of running, I believe
>> k8s support more heavily depends on availability of docker images.
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen 
>> wrote:
>> > Would it be logical to provide Docker-based distributions of other
>> pieces of
>> > Spark? or is this specific to K8S?
>> > The problem is we wouldn't generally also provide a distribution of
>> Spark
>> > for the reasons you give, because if that, then why not RPMs and so
>> on.
>> >
>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
>> ramanath...@google.com>
>> > wrote:
>> >>
>> >> In this context, I think the docker images are similar to the
>> binaries
>> >> rather than an extension.
>> >> It's packaging the compiled distribution to save people the effort
>> of
>> >> building one themselves, akin to binaries or the python package.
>> >>
>> >> For reference, this is the base dockerfile for the main 

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Sean Owen
I think that's all correct, though the license of third party dependencies
is actually a difficult and sticky part. The ASF couldn't make a software
release including any GPL software for example, and it's not just a matter
of adding a disclaimer. Any actual bits distributed by the PMC would have
to follow all the license rules.

On Tue, Dec 19, 2017 at 12:34 PM Erik Erlandson  wrote:

> I've been looking a bit more into ASF legal posture on licensing and
> container images. What I have found indicates that ASF considers container
> images to be just another variety of distribution channel.  As such, it is
> acceptable to publish official releases; for example an image such as
> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
> do something like regularly publish spark:latest built from the head of
> master.
>
> More detail here:
> https://issues.apache.org/jira/browse/LEGAL-270
>
> So as I understand it, making a release-tagged public image as part of
> each official release does not pose any problems.
>
> With respect to considering the licenses of other ancillary dependencies
> that are also installed on such container images, I noticed this clause in
> the legal boilerplate for the Flink images
> :
>
> As with all Docker images, these likely also contain other software which
>> may be under other licenses (such as Bash, etc from the base distribution,
>> along with any direct or indirect dependencies of the primary software
>> being contained).
>>
>
> So it may be sufficient to resolve this via disclaimer.
>
> -Erik
>
> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson 
> wrote:
>
>> Currently the containers are based off alpine, which pulls in BSD2 and
>> MIT licensing:
>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>
>> to the best of my understanding, neither of those poses a problem.  If we
>> based the image off of centos I'd also expect the licensing of any image
>> deps to be compatible.
>>
>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
>> wrote:
>>
>>> What licensing issues come into play?
>>>
>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
>>> wrote:
>>>
 We've been discussing the topic of container images a bit more.  The
 kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
 logic, which is different than mesos, and which is probably not practical
 to unify at this level.

 However: These CMD and ENTRYPOINT configurations are essentially just a
 thin skin on top of an image which is just an install of a spark distro.
 We feel that a single "spark-base" image should be publishable, that is
 consumable by kube-spark images, and mesos-spark images, and likely any
 other community image whose primary purpose is running spark components.
 The kube-specific dockerfiles would be written "FROM spark-base" and just
 add the small command and entrypoint layers.  Likewise, the mesos images
 could add any specialization layers that are necessary on top of the
 "spark-base" image.

 Does this factorization sound reasonable to others?
 Cheers,
 Erik


 On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan  wrote:

> We do support running on Apache Mesos via docker images - so this
> would not be restricted to k8s.
> But unlike mesos support, which has other modes of running, I believe
> k8s support more heavily depends on availability of docker images.
>
>
> Regards,
> Mridul
>
>
> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen  wrote:
> > Would it be logical to provide Docker-based distributions of other
> pieces of
> > Spark? or is this specific to K8S?
> > The problem is we wouldn't generally also provide a distribution of
> Spark
> > for the reasons you give, because if that, then why not RPMs and so
> on.
> >
> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
> ramanath...@google.com>
> > wrote:
> >>
> >> In this context, I think the docker images are similar to the
> binaries
> >> rather than an extension.
> >> It's packaging the compiled distribution to save people the effort
> of
> >> building one themselves, akin to binaries or the python package.
> >>
> >> For reference, this is the base dockerfile for the main image that
> we
> >> intend to publish. It's not particularly complicated.
> >> The driver and executor images are based on said base image and only
> >> customize the CMD (any file/directory inclusions are extraneous and
> will be
> >> removed).
> >>
> >> Is there only one way to build it? That's a bit harder to reason
> about.
> >> The base image I'd argue is likely going to always be 

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Erik Erlandson
I've been looking a bit more into ASF legal posture on licensing and
container images. What I have found indicates that ASF considers container
images to be just another variety of distribution channel.  As such, it is
acceptable to publish official releases; for example an image such as
spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
do something like regularly publish spark:latest built from the head of
master.

More detail here:
https://issues.apache.org/jira/browse/LEGAL-270

So as I understand it, making a release-tagged public image as part of each
official release does not pose any problems.

With respect to considering the licenses of other ancillary dependencies
that are also installed on such container images, I noticed this clause in
the legal boilerplate for the Flink images
:

As with all Docker images, these likely also contain other software which
> may be under other licenses (such as Bash, etc from the base distribution,
> along with any direct or indirect dependencies of the primary software
> being contained).
>

So it may be sufficient to resolve this via disclaimer.

-Erik

On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson  wrote:

> Currently the containers are based off alpine, which pulls in BSD2 and MIT
> licensing:
> https://github.com/apache/spark/pull/19717#discussion_r154502824
>
> to the best of my understanding, neither of those poses a problem.  If we
> based the image off of centos I'd also expect the licensing of any image
> deps to be compatible.
>
> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra 
> wrote:
>
>> What licensing issues come into play?
>>
>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson 
>> wrote:
>>
>>> We've been discussing the topic of container images a bit more.  The
>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>> logic, which is different than mesos, and which is probably not practical
>>> to unify at this level.
>>>
>>> However: These CMD and ENTRYPOINT configurations are essentially just a
>>> thin skin on top of an image which is just an install of a spark distro.
>>> We feel that a single "spark-base" image should be publishable, that is
>>> consumable by kube-spark images, and mesos-spark images, and likely any
>>> other community image whose primary purpose is running spark components.
>>> The kube-specific dockerfiles would be written "FROM spark-base" and just
>>> add the small command and entrypoint layers.  Likewise, the mesos images
>>> could add any specialization layers that are necessary on top of the
>>> "spark-base" image.
>>>
>>> Does this factorization sound reasonable to others?
>>> Cheers,
>>> Erik
>>>
>>>
>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan 
>>> wrote:
>>>
 We do support running on Apache Mesos via docker images - so this
 would not be restricted to k8s.
 But unlike mesos support, which has other modes of running, I believe
 k8s support more heavily depends on availability of docker images.


 Regards,
 Mridul


 On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen  wrote:
 > Would it be logical to provide Docker-based distributions of other
 pieces of
 > Spark? or is this specific to K8S?
 > The problem is we wouldn't generally also provide a distribution of
 Spark
 > for the reasons you give, because if that, then why not RPMs and so
 on.
 >
 > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
 ramanath...@google.com>
 > wrote:
 >>
 >> In this context, I think the docker images are similar to the
 binaries
 >> rather than an extension.
 >> It's packaging the compiled distribution to save people the effort of
 >> building one themselves, akin to binaries or the python package.
 >>
 >> For reference, this is the base dockerfile for the main image that we
 >> intend to publish. It's not particularly complicated.
 >> The driver and executor images are based on said base image and only
 >> customize the CMD (any file/directory inclusions are extraneous and
 will be
 >> removed).
 >>
 >> Is there only one way to build it? That's a bit harder to reason
 about.
 >> The base image I'd argue is likely going to always be built that
 way. The
 >> driver and executor images, there may be cases where people want to
 >> customize it - (like putting all dependencies into it for example).
 >> In those cases, as long as our images are bare bones, they can use
 the
 >> spark-driver/spark-executor images we publish as the base, and build
 their
 >> customization as a layer on top of it.
 >>
 >> I think the composability of docker images, makes this a bit
 different
 >> from say - debian packages.
 >> We can publish canonical 

Re: Decimals

2017-12-19 Thread Marco Gaido
Hello everybody,

I did some further researches and now I am sharing my findings. I am sorry,
it is going to be a quite long e-mail, but I'd really appreciate some
feedbacks when you have time to read it.

Spark's current implementation of arithmetic operations on decimals was
"copied" from Hive. Thus, the initial goal of the implementation was to be
compliant with Hive, which itself aims to reproduce SQLServer behavior.
Therefore I compared these 3 DBs and of course I checked the SQL ANSI
standard 2011 (you can find it at
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
and a late draft of the standard 2003 (
http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:

   1. how to determine the precision and scale of a result;
   2. how to behave when the result is a number which is not representable
   exactly with the result's precision and scale (ie. requires precision loss);
   3. how to behave when the result is out of the range of the
   representable values with the result's precision and scale (ie. it is
   bigger of the biggest number representable or lower the lowest one).

Currently, Spark behaves like follows:

   1. It follows some rules taken from intial Hive implementation;
   2. it returns NULL;
   3. it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely
nothing about point 1, I am citing SQL ANSI:2011 page 27:

If the result cannot be represented exactly in the result type, then
> whether it is rounded
> or truncated is implementation-defined. An exception condition is raised
> if the result is
> outside the range of numeric values of the result type, or if the
> arithmetic operation
> is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for
point 2 and 3. Someone, then might argue that we need compatibility with
Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's
behavior is:

   1. Rules are a bit changed, to reflect SQLServer implementation as
   described in this blog (
   
https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/
   );
   2. It rounds the result;
   3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI
   standard and throw an Exception).

As far as the other DBs are regarded, there is little to say about Oracle
and Postgres, since they have a nearly infinite precision, thus it is hard
also to test the behavior in these conditions, but SQLServer has the same
precision as Hive and Spark. Thus, this is SQLServer behavior:

   1. Rules should be the same as Hive, as described on their post (tests
   about the behavior confirm);
   2. It rounds the result;
   3. It throws an Exception.

Therefore, since I think that Spark should be compliant to SQL ANSI (first)
and Hive, I propose the following changes:

   1. Update the rules to derive the result type in order to reflect new
   Hive's one (which are SQLServer's one);
   2. Change Spark behavior to round the result, as done by Hive and
   SQLServer and prescribed by the SQL standard;
   3. Change Spark's behavior, introducing a configuration parameter in
   order to determine whether to return null or throw an Exception (by default
   I propose to throw an exception in order to be compliant with the SQL
   standard, which IMHO is more important that being compliant with Hive).

For 1 and 2, I prepared a PR, which is
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your
feedbacks in order to agree on what to do and then I will eventually do a
PR which reflect what decided here by the community.
I would really love to get your feedback either here or on the PR.

Thanks for your patience and your time reading this long email,
Best regards.
Marco


2017-12-13 9:08 GMT+01:00 Reynold Xin :

> Responses inline
>
> On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> I saw in these weeks that there are a lot of problems related to decimal
>> values (SPARK-22036, SPARK-22755, for instance). Some are related to
>> historical choices, which I don't know, thus please excuse me if I am
>> saying dumb things:
>>
>>  - why are we interpreting literal constants in queries as Decimal and
>> not as Double? I think it is very unlikely that a user can enter a number
>> which is beyond Double precision.
>>
>
> Probably just to be consistent with some popular databases.
>
>
>
>>  - why are we returning null in case of precision loss? Is this approach
>> better than just giving a result which might loose some accuracy?
>>
>
> The contract with decimal is that it should never lose precision (it is
> created for financial reports, accounting, etc). Returning null is at least
> telling the user the data type can no longer support the precision required.
>
>
>
>>
>> Thanks,
>> Marco
>>
>
>