Any way for users to help "stuck" JIRAs with pull requests for Spark 2.3 / future releases?

2017-12-21 Thread Ewan Leith
Hi all,

I was wondering with the approach of Spark 2.3 if there's any way us "regular" 
users can help advance any of JIRAs that could have made it into Spark 2.3 but 
are likely to miss now as the pull requests are awaiting detailed review.

For example:

https://issues.apache.org/jira/browse/SPARK-4502 - Spark SQL reads unneccesary 
nested fields from Parquet

Has a pull request from January 2017 with significant performance benefits for 
parquet reads.

https://issues.apache.org/jira/browse/SPARK-21657 - Spark has exponential time 
complexity to explode(array of structs)

Probably affects fewer users, but will be a real help for those users.

Both of these example tickets probably need more testing, but without them 
getting merged into the master branch and included in a release with a default 
config setting disabling them, the testing will be pretty limited.

Is there anything us users can do to help out with these kind of tickets, or do 
they need to wait for some additional core developer time to free up (I know 
that's in huge demand everywhere in the project!).

Thanks,
Ewan






This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


Re: Decimals

2017-12-21 Thread Xiao Li
Losing precision is not acceptable to financial customers. Thus, instead of
returning NULL, I saw DB2 issues the following error message:

SQL0802N  Arithmetic overflow or other arithmetic exception occurred.
SQLSTATE=22003

DB2 on z/OS is being used by most of biggest banks and financial intuitions
since 1980s. Either issuing exceptions (what DB2 does) or returning NULL
(what we are doing) looks fine to me. If they want to avoid getting NULL or
exceptions, users should manually putting the round functions by
themselves.

Also see the technote of DB2 zOS:
http://www-01.ibm.com/support/docview.wss?uid=swg21161024






2017-12-19 8:41 GMT-08:00 Marco Gaido :

> Hello everybody,
>
> I did some further researches and now I am sharing my findings. I am
> sorry, it is going to be a quite long e-mail, but I'd really appreciate
> some feedbacks when you have time to read it.
>
> Spark's current implementation of arithmetic operations on decimals was
> "copied" from Hive. Thus, the initial goal of the implementation was to be
> compliant with Hive, which itself aims to reproduce SQLServer behavior.
> Therefore I compared these 3 DBs and of course I checked the SQL ANSI
> standard 2011 (you can find it at http://standards.iso.org/ittf/
> PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip) and a late
> draft of the standard 2003 (http://www.wiscorp.com/sql_2003_standard.zip).
> The main topics are 3:
>
>1. how to determine the precision and scale of a result;
>2. how to behave when the result is a number which is not
>representable exactly with the result's precision and scale (ie. requires
>precision loss);
>3. how to behave when the result is out of the range of the
>representable values with the result's precision and scale (ie. it is
>bigger of the biggest number representable or lower the lowest one).
>
> Currently, Spark behaves like follows:
>
>1. It follows some rules taken from intial Hive implementation;
>2. it returns NULL;
>3. it returns NULL.
>
>
> The SQL ANSI is pretty clear about points 2 and 3, while it says barely
> nothing about point 1, I am citing SQL ANSI:2011 page 27:
>
> If the result cannot be represented exactly in the result type, then
>> whether it is rounded
>> or truncated is implementation-defined. An exception condition is raised
>> if the result is
>> outside the range of numeric values of the result type, or if the
>> arithmetic operation
>> is not defined for the operands.
>
>
> Then, as you can see, Spark is not respecting the SQL standard neither for
> point 2 and 3. Someone, then might argue that we need compatibility with
> Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's
> behavior is:
>
>1. Rules are a bit changed, to reflect SQLServer implementation as
>described in this blog (https://blogs.msdn.microsoft.
>com/sqlprogrammability/2006/03/29/multiplication-and-
>division-with-numerics/
>
> 
>);
>2. It rounds the result;
>3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI
>standard and throw an Exception).
>
> As far as the other DBs are regarded, there is little to say about Oracle
> and Postgres, since they have a nearly infinite precision, thus it is hard
> also to test the behavior in these conditions, but SQLServer has the same
> precision as Hive and Spark. Thus, this is SQLServer behavior:
>
>1. Rules should be the same as Hive, as described on their post (tests
>about the behavior confirm);
>2. It rounds the result;
>3. It throws an Exception.
>
> Therefore, since I think that Spark should be compliant to SQL ANSI
> (first) and Hive, I propose the following changes:
>
>1. Update the rules to derive the result type in order to reflect new
>Hive's one (which are SQLServer's one);
>2. Change Spark behavior to round the result, as done by Hive and
>SQLServer and prescribed by the SQL standard;
>3. Change Spark's behavior, introducing a configuration parameter in
>order to determine whether to return null or throw an Exception (by default
>I propose to throw an exception in order to be compliant with the SQL
>standard, which IMHO is more important that being compliant with Hive).
>
> For 1 and 2, I prepared a PR, which is https://github.com/apache/
> spark/pull/20023. For 3, I'd love to get your feedbacks in order to agree
> on what to do and then I will eventually do a PR which reflect what decided
> here by the community.
> I would really love to get your feedback either here or on the PR.
>
> Thanks for your patience and your time reading this long email,
> Best regards.
> Marco
>
>
> 2017-12-13 9:08 GMT+01:00 Reynold Xin :
>
>> Responses inline
>>
>> On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido 
>> wrote:
>>
>>> Hi all,
>>>
>>> I saw in these weeks that there are a lot of problems related to 

Anyone know how to bypass tools.jar problem in JDK9 when mvn clean install SPARK code

2017-12-21 Thread Zhang, Liyun
Hi all:
  Now I am using JDK9 to compile Spark by (mvn clean install -DskipTests), but 
exception was thrown

[root@bdpe41 spark_source]# java -version
java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)

#mvn clean install -Pscala-2.12  -Pyarn -Pparquet-provided -DskipTests  
-X>log.mvn.clean.installed 2>&1

654189 [INFO] Spark Project Hive . SUCCESS [ 
30.708 s]
654190 [INFO] Spark Project REPL . SUCCESS [  
2.795 s]
654191 [INFO] Spark Project YARN Shuffle Service . SUCCESS [  
6.411 s]
654192 [INFO] Spark Project YARN . FAILURE [  
0.047 s]
654193 [INFO] Spark Project Assembly . SKIPPED
654194 [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
654195 [INFO] Kafka 0.10 Source for Structured Streaming . SKIPPED
654196 [INFO] Spark Project Examples . SKIPPED
654197 [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
654198 [INFO] 

654199 [INFO] BUILD FAILURE
654200 [INFO] 

654201 [INFO] Total time: 07:04 min
654202 [INFO] Finished at: 2017-12-21T03:38:40+08:00
654203 [INFO] Final Memory: 85M/280M
654204 [INFO] 

654205 [ERROR] Failed to execute goal on project spark-yarn_2.12: Could not 
resolve dependencies for project 
org.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPS   HOT: Could not find 
artifact jdk.tools:jdk.tools:jar:1.6 at specified path 
/home/zly/prj/oss/jdk-9.0.1/../lib/tools.jar -> [Help 1]
654206 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to 
execute goal on project spark-yarn_2.12: Could not resolve dependencies for 
projectorg.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPSHOT: Could not 
find artifact jdk.tools:jdk.tools:jar:1.6 at specified path 
/home/zly/prj/oss/jdk-9.0.1   /../lib/tools.jar


There is no tools.jar in 
JDK9.
 I need to generate spark 2.3-SNAPSHOT in my local mvn repository to build 
other component(Hive on Spark) .  Anyone knows how to bypass this problem?




Best Regards
Kelly Zhang/Zhang,Liyun



Re: Anyone know how to bypass tools.jar problem in JDK9 when mvn clean install SPARK code

2017-12-21 Thread Sean Owen
You need to run ./dev/change-scala-version.sh 2.12 first

On Thu, Dec 21, 2017 at 4:38 PM Zhang, Liyun  wrote:

> Hi all:
>
>   Now I am using JDK9 to compile Spark by (mvn clean install –DskipTests),
> but exception was thrown
>
>
>
> [root@bdpe41 spark_source]# java -version
>
> java version "9.0.1"
>
> Java(TM) SE Runtime Environment (build 9.0.1+11)
>
> Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)
>
>
>
> #mvn clean install -Pscala-2.12  -Pyarn -Pparquet-provided -DskipTests
> -X>log.mvn.clean.installed 2>&1
>
>
>
> 654189 [INFO] Spark Project Hive . SUCCESS
> [ 30.708 s]
>
> 654190 [INFO] Spark Project REPL . SUCCESS
> [  2.795 s]
>
> 654191 [INFO] Spark Project YARN Shuffle Service . SUCCESS
> [  6.411 s]
>
> 654192 [INFO] Spark Project YARN . FAILURE
> [  0.047 s]
>
> 654193 [INFO] Spark Project Assembly . SKIPPED
>
> 654194 [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
>
> 654195 [INFO] Kafka 0.10 Source for Structured Streaming . SKIPPED
>
> 654196 [INFO] Spark Project Examples . SKIPPED
>
> 654197 [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
>
> 654198 [INFO]
> 
>
> 654199 [INFO] BUILD FAILURE
>
> 654200 [INFO]
> 
>
> 654201 [INFO] Total time: 07:04 min
>
> 654202 [INFO] Finished at: 2017-12-21T03:38:40+08:00
>
> 654203 [INFO] Final Memory: 85M/280M
>
> 654204 [INFO]
> 
>
> 654205 [ERROR] Failed to execute goal on project spark-yarn_2.12: Could
> not resolve dependencies for project
> org.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPS   HOT: Could not find
> artifact jdk.tools:jdk.tools:jar:1.6 at specified path
> /home/zly/prj/oss/jdk-9.0.1/../lib/tools.jar -> [Help 1]
>
> 654206 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal on project spark-yarn_2.12: Could not resolve dependencies for
> projectorg.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPSHOT: Could
> not find artifact jdk.tools:jdk.tools:jar:1.6 at specified path
> /home/zly/prj/oss/jdk-9.0.1   /../lib/tools.jar
>
>
>
>
>
> There is no tools.jar in JDK9
> .
> I need to generate spark 2.3-SNAPSHOT in my local mvn repository to build
> other component(Hive on Spark) .  Anyone knows how to bypass this problem?
>
>
>
>
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>


FW: Anyone know how to bypass tools.jar problem in JDK9 when mvn clean install SPARK code

2017-12-21 Thread Zhang, Liyun
Hi  Sean:

Thanks for your reply,
You mentioned that “You need to run ./dev/change-scala-version.sh 2.12 first” I 
have done that but still have this problem,   currently the problem is about 
tools.jar does not exist in JDK9 but some dependencies on it when using “mvn 
clean install –DskipTests”.



Best Regards
Kelly Zhang/Zhang,Liyun


From: Sean Owen [mailto:so...@cloudera.com]
Sent: Friday, December 22, 2017 6:56 AM
To: Zhang, Liyun 
Cc: dev@spark.apache.org
Subject: Re: Anyone know how to bypass tools.jar problem in JDK9 when mvn clean 
install SPARK code

You need to run ./dev/change-scala-version.sh 2.12 first

On Thu, Dec 21, 2017 at 4:38 PM Zhang, Liyun 
mailto:liyun.zh...@intel.com>> wrote:
Hi all:
  Now I am using JDK9 to compile Spark by (mvn clean install –DskipTests), but 
exception was thrown

[root@bdpe41 spark_source]# java -version
java version "9.0.1"
Java(TM) SE Runtime Environment (build 9.0.1+11)
Java HotSpot(TM) 64-Bit Server VM (build 9.0.1+11, mixed mode)

#mvn clean install -Pscala-2.12  -Pyarn -Pparquet-provided -DskipTests  
-X>log.mvn.clean.installed 2>&1

654189 [INFO] Spark Project Hive . SUCCESS [ 
30.708 s]
654190 [INFO] Spark Project REPL . SUCCESS [  
2.795 s]
654191 [INFO] Spark Project YARN Shuffle Service . SUCCESS [  
6.411 s]
654192 [INFO] Spark Project YARN . FAILURE [  
0.047 s]
654193 [INFO] Spark Project Assembly . SKIPPED
654194 [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
654195 [INFO] Kafka 0.10 Source for Structured Streaming . SKIPPED
654196 [INFO] Spark Project Examples . SKIPPED
654197 [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
654198 [INFO] 

654199 [INFO] BUILD FAILURE
654200 [INFO] 

654201 [INFO] Total time: 07:04 min
654202 [INFO] Finished at: 2017-12-21T03:38:40+08:00
654203 [INFO] Final Memory: 85M/280M
654204 [INFO] 

654205 [ERROR] Failed to execute goal on project spark-yarn_2.12: Could not 
resolve dependencies for project 
org.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPS   HOT: Could not find 
artifact jdk.tools:jdk.tools:jar:1.6 at specified path 
/home/zly/prj/oss/jdk-9.0.1/../lib/tools.jar -> [Help 1]
654206 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to 
execute goal on project spark-yarn_2.12: Could not resolve dependencies for 
projectorg.apache.spark:spark-yarn_2.12:jar:2.3.0-SNAPSHOT: Could not 
find artifact jdk.tools:jdk.tools:jar:1.6 at specified path 
/home/zly/prj/oss/jdk-9.0.1   /../lib/tools.noite

There is no tools.jar in 
JDK9.
 I need to generate spark 2.3-SNAPSHOT in my local mvn repository to build 
other component(Hive on Spark) .  Anyone knows how to bypass this problem?




Best Regards
Kelly Zhang/Zhang,Liyun



R: Decimals

2017-12-21 Thread Marco Gaido
Thanks for your answer Xiao. The point is that behaving like this is against 
SQL standard and is different also from Hive's behavior. Then I would propose 
to add a configuration flag to switch between the two behaviors, either being 
SQL compliant and Hive compliant or behaving like now (as Hermann was 
suggesting in the PR). Do we agree on this way? If so, is there any way to read 
a configuration property in the catalyst project?

Thank you,
Marco

- Messaggio originale -
Da: "Xiao Li" 
Inviato: ‎21/‎12/‎2017 22:46
A: "Marco Gaido" 
Cc: "Reynold Xin" ; "dev@spark.apache.org" 

Oggetto: Re: Decimals

Losing precision is not acceptable to financial customers. Thus, instead of 
returning NULL, I saw DB2 issues the following error message:


SQL0802N  Arithmetic overflow or other arithmetic exception occurred.  

SQLSTATE=22003


DB2 on z/OS is being used by most of biggest banks and financial intuitions 
since 1980s. Either issuing exceptions (what DB2 does) or returning NULL (what 
we are doing) looks fine to me. If they want to avoid getting NULL or 
exceptions, users should manually putting the round functions by themselves. 


Also see the technote of DB2 zOS: 
http://www-01.ibm.com/support/docview.wss?uid=swg21161024












2017-12-19 8:41 GMT-08:00 Marco Gaido :

Hello everybody,


I did some further researches and now I am sharing my findings. I am sorry, it 
is going to be a quite long e-mail, but I'd really appreciate some feedbacks 
when you have time to read it.


Spark's current implementation of arithmetic operations on decimals was 
"copied" from Hive. Thus, the initial goal of the implementation was to be 
compliant with Hive, which itself aims to reproduce SQLServer behavior. 
Therefore I compared these 3 DBs and of course I checked the SQL ANSI standard 
2011 (you can find it at 
http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip)
 and a late draft of the standard 2003 
(http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3:
how to determine the precision and scale of a result;
how to behave when the result is a number which is not representable exactly 
with the result's precision and scale (ie. requires precision loss);
how to behave when the result is out of the range of the representable values 
with the result's precision and scale (ie. it is bigger of the biggest number 
representable or lower the lowest one).
Currently, Spark behaves like follows:
It follows some rules taken from intial Hive implementation;
it returns NULL;
it returns NULL.


The SQL ANSI is pretty clear about points 2 and 3, while it says barely nothing 
about point 1, I am citing SQL ANSI:2011 page 27:


If the result cannot be represented exactly in the result type, then whether it 
is rounded
or truncated is implementation-defined. An exception condition is raised if the 
result is
outside the range of numeric values of the result type, or if the arithmetic 
operation
is not defined for the operands.


Then, as you can see, Spark is not respecting the SQL standard neither for 
point 2 and 3. Someone, then might argue that we need compatibility with Hive. 
Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's behavior 
is:
Rules are a bit changed, to reflect SQLServer implementation as described in 
this blog 
(https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/);
It rounds the result;
It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI standard and 
throw an Exception).
As far as the other DBs are regarded, there is little to say about Oracle and 
Postgres, since they have a nearly infinite precision, thus it is hard also to 
test the behavior in these conditions, but SQLServer has the same precision as 
Hive and Spark. Thus, this is SQLServer behavior:
Rules should be the same as Hive, as described on their post (tests about the 
behavior confirm);
It rounds the result;
It throws an Exception.
Therefore, since I think that Spark should be compliant to SQL ANSI (first) and 
Hive, I propose the following changes:
Update the rules to derive the result type in order to reflect new Hive's one 
(which are SQLServer's one);
Change Spark behavior to round the result, as done by Hive and SQLServer and 
prescribed by the SQL standard;
Change Spark's behavior, introducing a configuration parameter in order to 
determine whether to return null or throw an Exception (by default I propose to 
throw an exception in order to be compliant with the SQL standard, which IMHO 
is more important that being compliant with Hive).
For 1 and 2, I prepared a PR, which is 
https://github.com/apache/spark/pull/20023. For 3, I'd love to get your 
feedbacks in order to agree on what to do and then I will eventually do a PR 
which reflect what decided here by the community.
I would really love to get your feedback either here or on the PR.


Thanks for your patience and your time reading this lo

Re: Timeline for Spark 2.3

2017-12-21 Thread Kazuaki Ishizaki
+1 for cutting a branch earlier.
In some Asian countries, 1st, 2nd, and 3rd January are off. 
https://www.timeanddate.com/holidays/
How about 4th or 5th?

Regards,
Kazuaki Ishizaki



From:   Felix Cheung 
To: Michael Armbrust , Holden Karau 

Cc: Sameer Agarwal , Erik Erlandson 
, dev 
Date:   2017/12/21 04:48
Subject:Re: Timeline for Spark 2.3



+1
I think the earlier we cut a branch the better.


From: Michael Armbrust 
Sent: Tuesday, December 19, 2017 4:41:44 PM
To: Holden Karau
Cc: Sameer Agarwal; Erik Erlandson; dev
Subject: Re: Timeline for Spark 2.3 
 
Do people really need to be around for the branch cut (modulo the person 
cutting the branch)? 

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as 
soon as we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau  
wrote:
Sounds reasonable, although I'd choose the 2nd perhaps just since lots of 
folks are off on the 1st?

On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal  
wrote:
Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that 
(i.e., week of 8th Jan)? 


On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau  
wrote:
So personally I’d be in favour or pushing to early January, doing a 
release over the holidays is a little rough with herding all of people to 
vote. 

On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson  
wrote:
I wanted to check in on the state of the 2.3 freeze schedule.  Original 
proposal was "late Dec", which is a bit open to interpretation.

We are working to get some refactoring done on the integration testing for 
the Kubernetes back-end in preparation for testing upcoming release 
candidates, however holiday vacation time is about to begin taking its 
toll both on upstream reviewing and on the "downstream" spark-on-kube 
fork.

If the freeze pushed into January, that would take some of the pressure 
off the kube back-end upstreaming. However, regardless, I was wondering if 
the dates could be clarified.
Cheers,
Erik


On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com  
wrote:
Hi,

What is the process to request an issue/fix to be included in the next
release? Is there a place to vote for features?
I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to 
see
if we can get Spark upgrade parquet to 1.9.0, which addresses the
https://issues.apache.org/jira/browse/PARQUET-686.
Can we include the fix in Spark 2.3 release?

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-- 
Twitter: https://twitter.com/holdenkarau



-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag



-- 
Twitter: https://twitter.com/holdenkarau