[jira] [Resolved] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18617.
-
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.1.0

> Close "kryo auto pick" feature for Spark Streaming
> --
>
> Key: SPARK-18617
> URL: https://issues.apache.org/jira/browse/SPARK-18617
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Genmao Yu
> Fix For: 2.1.0
>
>
> [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to 
> fix the bug, i.e. {{receiver data can not be deserialized properly}}. As 
> [~zsxwing] said, it is a critical bug, but we should not break APIs between 
> maintenance releases. It may be a rational choice to close {{auto pick kryo 
> serializer}} for Spark Streaming in the first step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18622) Missing Reference in Multi Union Clauses Cause by TypeCoercion

2016-11-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18622.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0

> Missing Reference in Multi Union Clauses Cause by TypeCoercion
> --
>
> Key: SPARK-18622
> URL: https://issues.apache.org/jira/browse/SPARK-18622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Yerui Sun
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> {code}
> spark-sql> explain extended
>  > select a
>  > from
>  > (
>  >   select 0 a, 0 b
>  > union all
>  >   select sum(1) a, cast(0 as bigint) b
>  > union all
>  >   select 0 a, 0 b
>  > )t;
>  
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'SubqueryAlias t
>+- 'Union
>   :- 'Union
>   :  :- Project [0 AS a#0, 0 AS b#1]
>   :  :  +- OneRowRelation$
>   :  +- 'Project ['sum(1) AS a#2, cast(0 as bigint) AS b#3L]
>   : +- OneRowRelation$
>   +- Project [0 AS a#4, 0 AS b#5]
>  +- OneRowRelation$
>  
> == Analyzed Logical Plan ==
> a: int
> Project [a#0]
> +- SubqueryAlias t
>+- Union
>   :- !Project [a#0, b#9L]
>   :  +- Union
>   : :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
>   : :  +- Project [a#0, cast(b#1 as bigint) AS b#9L]
>   : : +- Project [0 AS a#0, 0 AS b#1]
>   : :+- OneRowRelation$
>   : +- Project [a#2L, b#3L]
>   :+- Project [a#2L, b#3L]
>   :   +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as 
> bigint) AS b#3L]
>   :  +- OneRowRelation$
>   +- Project [a#4, cast(b#5 as bigint) AS b#10L]
>  +- Project [0 AS a#4, 0 AS b#5]
> +- OneRowRelation$
>  
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing 
> from a#11L,b#9L in operator !Project [a#0, b#9L];;
> Project [a#0]
> +- SubqueryAlias t
>+- Union
>   :- !Project [a#0, b#9L]
>   :  +- Union
>   : :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
>   : :  +- Project [a#0, cast(b#1 as bigint) AS b#9L]
>   : : +- Project [0 AS a#0, 0 AS b#1]
>   : :+- OneRowRelation$
>   : +- Project [a#2L, b#3L]
>   :+- Project [a#2L, b#3L]
>   :   +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as 
> bigint) AS b#3L]
>   :  +- OneRowRelation$
>   +- Project [a#4, cast(b#5 as bigint) AS b#10L]
>  +- Project [0 AS a#4, 0 AS b#5]
> +- OneRowRelation$
>  
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing 
> from a#11L,b#9L in operator !Project [a#0, b#9L];;
> Project [a#0]
> +- SubqueryAlias t
>+- Union
>   :- !Project [a#0, b#9L]
>   :  +- Union
>   : :- Project [cast(a#0 as bigint) AS a#11L, b#9L]
>   : :  +- Project [a#0, cast(b#1 as bigint) AS b#9L]
>   : : +- Project [0 AS a#0, 0 AS b#1]
>   : :+- OneRowRelation$
>   : +- Project [a#2L, b#3L]
>   :+- Project [a#2L, b#3L]
>   :   +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as 
> bigint) AS b#3L]
>   :  +- OneRowRelation$
>   +- Project [a#4, cast(b#5 as bigint) AS b#10L]
>  +- Project [0 AS a#4, 0 AS b#5]
> +- OneRowRelation$
> {code}
> Key Points to re-produce issue:
> * 3 or more union clauses;
> * One column is sum aggregate in one union clause, and is Integer type in 
> other union clause;
> * Another column has different date types in union clauses;
> The reason of issue:
> - Step 1: Apply TypeCoercion.WidenSetOperationTypes, add project with cast 
> since the union clauses has different datatypes for one column; With 3 union 
> clauses, the inner union clause also be projected with cast;
> - Step 2: Apply TypeCoercion.FunctionArgumentConversion, the return type of 
> sum(int) will be extended to BigInt, meaning one column in union clauses 
> changed datatype;
> - Step 3: Apply TypeCoercion.WidenSetOperationTypes again, another cast 
> project added in inner union clause, since sum(int) datatype changed; at this 
> point, the reference of project ON inner union will be missed, since the 
> project IN inner union is newly added, see the  Analyzed Logical Plan;
> Solutions to fix:
> * Since set operation type coercion should be applied after inner clause be 
> stabled, apply WidenSetOperationTypes at last will fix the issue;
> * To avoiding multi level projects on set operation clause, handle the 
> existing cast project carefully in 

[jira] [Resolved] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17680.
-
Resolution: Fixed

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18643:
-
Description: 
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR (not from sparkR shell)
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
backend.
{code}
Launching java with spark-submit command spark-submit   sparkr-shell 
/tmp/RtmpYbAYt5/backend_port5849dc2273
sh: 1: spark-submit: not found
{code}

If SparkR is running as a package and it has previously downloaded Spark Jar it 
should be able to run as before without having to set SPARK_HOME. Basically 
with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.


  was:
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR (not from sparkR shell)
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
backend.

If SparkR is running as a package and it has previously downloaded Spark Jar it 
should be able to run as before without having to set SPARK_HOME. Basically 
with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.



> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> {code}
> Launching java with spark-submit command spark-submit   sparkr-shell 
> /tmp/RtmpYbAYt5/backend_port5849dc2273
> sh: 1: spark-submit: not found
> {code}
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17934) Support percentile scale in ml.feature

2016-11-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707710#comment-15707710
 ] 

yuhao yang commented on SPARK-17934:


We can probably implement something like Robust Scaler in sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
 


> Support percentile scale in ml.feature
> --
>
> Key: SPARK-17934
> URL: https://issues.apache.org/jira/browse/SPARK-17934
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>
> Percentile scale is often used in feature scale.
> In my project, I need to use this scaler.
> Compared to MinMaxScaler, PercentileScaler will not produce unstable result 
> due to anomaly large value.
> About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707704#comment-15707704
 ] 

Felix Cheung commented on SPARK-18643:
--

A workaround is to start as sparkR.session(master="local") - but it might not 
be always correct (not if the user is going to run Spark in non-local mode)

> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18643:


Assignee: Apache Spark  (was: Felix Cheung)

> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18643:


Assignee: Felix Cheung  (was: Apache Spark)

> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707701#comment-15707701
 ] 

Apache Spark commented on SPARK-18643:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16077

> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema

2016-11-29 Thread Pramod Anarase (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707699#comment-15707699
 ] 

Pramod Anarase commented on SPARK-16848:


+1

> Make jdbc() and read.format("jdbc") consistently throwing exception for 
> user-specified schema
> -
>
> Key: SPARK-16848
> URL: https://issues.apache.org/jira/browse/SPARK-16848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently,
> {code}
> spark.read.schema(StructType(Seq())).jdbc(...),show()
> {code}
> does not throws an exception whereas
> {code}
> spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show()
> {code}
> does as below:
> {code}
> jdbc does not allow user-specified schemas.;
> org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified 
> schemas.;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at 
> org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351)
> {code}
> It'd make sense throwing the exception when user specifies schema identically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18643:
-
Summary: SparkR hangs at session start when installed as a package without 
SPARK_HOME set  (was: SparkR hangs when installed as a package without 
SPARK_HOME set)

> SparkR hangs at session start when installed as a package without SPARK_HOME 
> set
> 
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18643:
-
Description: 
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR (not from sparkR shell)
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
backend.

If SparkR is running as a package and it has previously downloaded Spark Jar it 
should be able to run as before without having to set SPARK_HOME. Basically 
with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.


  was:
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR (not from sparkR shell)
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
backend.

If SparkR is running as a package and it has previously downloaded Spark Jar it 
should be able to run as before without having to set SPARK_HOME. This seems to 
be a regression on the earlier behavior.



> SparkR hangs when installed as a package without SPARK_HOME set
> ---
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. 
> Basically with this bug the auto install Spark will only work in the first 
> session.
> This seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18643:
-
Description: 
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR (not from sparkR shell)
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
backend.

If SparkR is running as a package and it has previously downloaded Spark Jar it 
should be able to run as before without having to set SPARK_HOME. This seems to 
be a regression on the earlier behavior.


  was:
1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend


> SparkR hangs when installed as a package without SPARK_HOME set
> ---
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR (not from sparkR shell)
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend.
> If SparkR is running as a package and it has previously downloaded Spark Jar 
> it should be able to run as before without having to set SPARK_HOME. This 
> seems to be a regression on the earlier behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707675#comment-15707675
 ] 

Apache Spark commented on SPARK-18324:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16076

> ML, Graph 2.1 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-18324
> URL: https://issues.apache.org/jira/browse/SPARK-18324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18643:


 Summary: SparkR hangs when installed as a package without 
SPARK_HOME set
 Key: SPARK-18643
 URL: https://issues.apache.org/jira/browse/SPARK-18643
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung
Priority: Critical


1) Install SparkR from source package, ie.
R CMD INSTALL SparkR_2.1.0.tar.gz

2) Start SparkR
library(SparkR)
sparkR.session()

Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set

2016-11-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707669#comment-15707669
 ] 

Felix Cheung commented on SPARK-18643:
--

Related PR: https://github.com/apache/spark/pull/15888


> SparkR hangs when installed as a package without SPARK_HOME set
> ---
>
> Key: SPARK-18643
> URL: https://issues.apache.org/jira/browse/SPARK-18643
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Critical
>
> 1) Install SparkR from source package, ie.
> R CMD INSTALL SparkR_2.1.0.tar.gz
> 2) Start SparkR
> library(SparkR)
> sparkR.session()
> Notice SparkR hangs when it couldn't find spark-submit to launch the JVM 
> backend



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-11-29 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17692.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
> Fix For: 2.1.0
>
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  
> * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for 
> SelectKBest features.
> * SPARK-3261: KMeans returns potentially fewer than k cluster centers in 
> cases where k distinct centroids aren't available or aren't selected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-11-29 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707665#comment-15707665
 ] 

Yanbo Liang commented on SPARK-17692:
-

All behavior changes has been documented in the PR of SPARK-18324, so I will 
close this one.

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  
> * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for 
> SelectKBest features.
> * SPARK-3261: KMeans returns potentially fewer than k cluster centers in 
> cases where k distinct centroids aren't available or aren't selected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707663#comment-15707663
 ] 

yuhao yang commented on SPARK-18608:


Agree. we can just add an extra parameter handlePersistence: Boolean to the 
train method in Predictor. 

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707659#comment-15707659
 ] 

yuhao yang commented on SPARK-18374:


Yes. Currently we're discussing if we should put "wouldn't" (rather than 
"wonldnt") directly into MLlib's stop words list, because by default Tokenizer 
in Spark does not split on apostrophes or quotes.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-29 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-18278:
---
Attachment: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf

I attached a proposal outlining a potential long term plan for this feature. 
Any feedback about it would be appreciated.

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707642#comment-15707642
 ] 

Xiangrui Meng commented on SPARK-18374:
---

See the discussion here: https://github.com/nltk/nltk_data/issues/22. Including 
`won` is apparently a mistake.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-29 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707629#comment-15707629
 ] 

Kazuaki Ishizaki commented on SPARK-17680:
--

Sorry, it is my mistake.

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Description: 
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
spark's classpath.

Here is the stack:
===
16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
at 
org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


  was:
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
spark's 

[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Priority: Major  (was: Minor)

> Show databases NullPointerException while Sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is thestack:
> ===
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at 
> 

[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Priority: Minor  (was: Major)

> Show databases NullPointerException while Sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>Priority: Minor
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is thestack:
> ===
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-11-29 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-18642:
--
Description: 
When doing a left-join between two tables, say A and B,  Catalyst has 
information about the projection required for table B. Only the required 
columns should be scanned.

Code snippet below explains the scenario:

scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]

scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]

scala> dfA.registerTempTable("A")
scala> dfB.registerTempTable("B")

scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
where B.bid<2").explain

== Physical Plan ==
Project [aid#15,bid#17]
+- Filter (bid#17 < 2)
   +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
  :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA
  +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB

This is a watered-down example from a production issue which has a huge 
performance impact.
External reference: 
http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns

  was:
When doing a left-join between two tables, say A and B,  Catalyst has 
information about the projection required for table B. Only the required 
columns should be scanned.

Code snippet below explains the scenario:

scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]

scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]

scala> dfA.registerTempTable("A")
scala> dfB.registerTempTable("B")

scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
where B.bid<2").explain

== Physical Plan ==
Project [aid#15,bid#17]
+- Filter (bid#17 < 2)
   +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
  :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA
  +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB

This is a watered-down example from a production issue which has a huge 
performance impact.


> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-11-29 Thread Mohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit updated SPARK-18642:
--
Description: 
When doing a left-join between two tables, say A and B,  Catalyst has 
information about the projection required for table B. Only the required 
columns should be scanned.

Code snippet below explains the scenario:

scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]

scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]

scala> dfA.registerTempTable("A")
scala> dfB.registerTempTable("B")

scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
where B.bid<2").explain

== Physical Plan ==
Project [aid#15,bid#17]
+- Filter (bid#17 < 2)
   +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
  :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA
  +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB

This is a watered-down example from a production issue which has a huge 
performance impact.

  was:
When doing a left-join between two tables, say A and B,  Catalyst has 
information about the projection required for table B. 

Code snippet below explains the scenario:

scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]

scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]

scala> dfA.registerTempTable("A")
scala> dfB.registerTempTable("B")

scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
where B.bid<2").explain

== Physical Plan ==
Project [aid#15,bid#17]
+- Filter (bid#17 < 2)
   +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
  :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA
  +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB

This is a watered-down example from a production issue which has a huge 
performance impact.


> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>  Labels: performance
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-11-29 Thread Mohit (JIRA)
Mohit created SPARK-18642:
-

 Summary: Spark SQL: Catalyst is scanning undesired columns
 Key: SPARK-18642
 URL: https://issues.apache.org/jira/browse/SPARK-18642
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
 Environment: Ubuntu 14.04
Spark: Local Mode
Reporter: Mohit


When doing a left-join between two tables, say A and B,  Catalyst has 
information about the projection required for table B. 

Code snippet below explains the scenario:

scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]

scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]

scala> dfA.registerTempTable("A")
scala> dfB.registerTempTable("B")

scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
where B.bid<2").explain

== Physical Plan ==
Project [aid#15,bid#17]
+- Filter (bid#17 < 2)
   +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
  :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA
  +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB

This is a watered-down example from a production issue which has a huge 
performance impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707594#comment-15707594
 ] 

Apache Spark commented on SPARK-17732:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15987

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707560#comment-15707560
 ] 

yuhao yang commented on SPARK-18374:


cc [~mengxr] to see if he recalls any specific reason.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18531) Apache Spark FPGrowth algorithm implementation fails with java.lang.StackOverflowError

2016-11-29 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707542#comment-15707542
 ] 

yuhao yang commented on SPARK-18531:


[~tuxdna] Does it work for you?

> Apache Spark FPGrowth algorithm implementation fails with 
> java.lang.StackOverflowError
> --
>
> Key: SPARK-18531
> URL: https://issues.apache.org/jira/browse/SPARK-18531
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Saleem Ansari
>
> More details can be found here: 
> https://gist.github.com/tuxdna/37a69b53e6f9a9442fa3b1d5e53c2acb
> *Spark FPGrowth algorithm croaks with a small dataset as shown below*
> $ spark-shell --master "local[*]" --driver-memory 5g
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> import org.apache.spark.mllib.fpm.FPGrowth
> import org.apache.spark.mllib.fpm.FPGrowth
> scala> import org.apache.spark.rdd.RDD
> import org.apache.spark.rdd.RDD
> scala> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SQLContext
> scala> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.{SparkConf, SparkContext}
> scala> val data = sc.textFile("bug.data")
> data: org.apache.spark.rdd.RDD[String] = bug.data MapPartitionsRDD[1] at 
> textFile at :31
> scala> val transactions: RDD[Array[String]] = data.map(l => 
> l.split(",").distinct)
> transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] 
> at map at :33
> scala> transactions.cache()
> res0: transactions.type = MapPartitionsRDD[2] at map at :33
> scala> val fpg = new FPGrowth().setMinSupport(0.05).setNumPartitions(10)
> fpg: org.apache.spark.mllib.fpm.FPGrowth = 
> org.apache.spark.mllib.fpm.FPGrowth@66d62c59
> scala> val model = fpg.run(transactions)
> model: org.apache.spark.mllib.fpm.FPGrowthModel[String] = 
> org.apache.spark.mllib.fpm.FPGrowthModel@6e92f150
> scala> model.freqItemsets.take(1).foreach { i => i.items.mkString("[", ",", 
> "]") + ", " + i.freq }
> [Stage 3:>  (0 + 2) / 
> 2]16/11/21 23:56:14 ERROR Executor: Managed memory leak detected; size = 
> 18068980 bytes, TID = 14
> 16/11/21 23:56:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 14)
> java.lang.StackOverflowError
> at org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84)
> at 
> org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:273)
> at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:115)
> at 
> org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:202)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> *This failure is likely due to the size of baskets which contains over 
> thousands of items.*
> scala> val maxBasketSize = transactions.map(_.length).max()
> maxBasketSize: Int = 1171 
>   
> scala> transactions.filter(_.length == maxBasketSize).collect()
> res3: Array[Array[String]] = Array(Array(3858, 109, 5842, 2184, 2481, 534



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-11-29 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15819.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.1.1, 2.2.0
>
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18145.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17861) Store data source partitions in metastore and push partition pruning into metastore

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17861.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Store data source partitions in metastore and push partition pruning into 
> metastore
> ---
>
> Key: SPARK-17861
> URL: https://issues.apache.org/jira/browse/SPARK-17861
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Eric Liang
>Priority: Critical
> Fix For: 2.1.0
>
>
> Initially, Spark SQL does not store any partition information in the catalog 
> for data source tables, because initially it was designed to work with 
> arbitrary files. This, however, has a few issues for catalog tables:
> 1. Listing partitions for a large table (with millions of partitions) can be 
> very slow during cold start.
> 2. Does not support heterogeneous partition naming schemes.
> 3. Cannot leverage pushing partition pruning into the metastore.
> This ticket tracks the work required to push the tracking of partitions into 
> the metastore. This change should be feature flagged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18632.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> AggregateFunction should not ImplicitCastInputTypes
> ---
>
> Key: SPARK-18632
> URL: https://issues.apache.org/jira/browse/SPARK-18632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which 
> enables implicit input type casting). This can lead to unexpected results, 
> and should only be enabled when it is suitable for the function at hand. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-11-29 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707435#comment-15707435
 ] 

holdenk commented on SPARK-15369:
-

So I'm probably going to be busy until after the 2.1 release (also trying to 
finish a book and have some talks in the middle) just but I'll take a look 
after that.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-11-29 Thread Marius Van Niekerk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707428#comment-15707428
 ] 

Marius Van Niekerk edited comment on SPARK-15369 at 11/30/16 3:49 AM:
--

Oh yeah, once we have a pip installable spark it should be pretty easy testing 
this with some docker pieces with travis.

Basic idea is to convert the benchmarks into an integration test.

Feel free to open issues on that project.


was (Author: mariusvniekerk):
Oh yeah, once we have a pip installable spark it should be pretty easy testing 
this with some docker pieces with travis.

Basic idea is to convert the benchmarks into an integration test.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-11-29 Thread Marius Van Niekerk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707428#comment-15707428
 ] 

Marius Van Niekerk commented on SPARK-15369:


Oh yeah, once we have a pip installable spark it should be pretty easy testing 
this with some docker pieces with travis.

Basic idea is to convert the benchmarks into an integration test.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-11-29 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707421#comment-15707421
 ] 

holdenk commented on SPARK-15369:
-

That looks like a great start :) Probably the packaging is going to be a bit 
trixie and it would probably make sense to have some testing as well but thanks 
for getting started making a spark package for this.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-11-29 Thread Marius Van Niekerk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707411#comment-15707411
 ] 

Marius Van Niekerk commented on SPARK-15369:


I'm in the process of an initial stab at turning this into a spark package.

https://github.com/mariusvniekerk/spark-jython-udf

Feedback would be appreciated.

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707399#comment-15707399
 ] 

Apache Spark commented on SPARK-18516:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16075

> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.1.0
>
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Summary: Show databases NullPointerException while Sentry turned on  (was: 
Show databases NullPointerException while sentry turned on)

> Show databases NullPointerException while Sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is thestack:
> ===
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> 

[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Description: 
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
spark's classpath.

Here is thestack:
===
16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
at 
org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


  was:
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
spark's 

[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Affects Version/s: (was: 2.0.1)
   2.0.0

> Show databases NullPointerException while sentry turned on
> --
>
> Key: SPARK-18641
> URL: https://issues.apache.org/jira/browse/SPARK-18641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
>Reporter: zhangqw
>
> I've traced into source code, and it seems that  of 
> Sentry not set when spark sql started a session. This operation should be 
> done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is 
> not called in spark sql.
> Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
> spark's classpath.
> Here is stack:
> 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
> java.lang.NullPointerException
> at 
> java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
> at 
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
> at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
> at 
> org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
> at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
> at 
> org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
> at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
> at 
> 

[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on

2016-11-29 Thread zhangqw (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangqw updated SPARK-18641:

Description: 
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into 
spark's classpath.

Here is stack:

16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
at 
org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


  was:
I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Here is stack:

16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at 

[jira] [Created] (SPARK-18641) Show databases NullPointerException while sentry turned on

2016-11-29 Thread zhangqw (JIRA)
zhangqw created SPARK-18641:
---

 Summary: Show databases NullPointerException while sentry turned on
 Key: SPARK-18641
 URL: https://issues.apache.org/jira/browse/SPARK-18641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1
Reporter: zhangqw


I've traced into source code, and it seems that  of 
Sentry not set when spark sql started a session. This operation should be done 
in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not 
called in spark sql.

Here is stack:

16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list
java.lang.NullPointerException
at 
java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:162)
at 
org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956)
at 
org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131)
at 
org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs

2016-11-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707320#comment-15707320
 ] 

Joseph K. Bradley commented on SPARK-18613:
---

I can after 2.1 QA, but feel free to go ahead if you'd like.

> spark.ml LDA classes should not expose spark.mllib in APIs
> --
>
> Key: SPARK-18613
> URL: https://issues.apache.org/jira/browse/SPARK-18613
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it 
> should not:
> * {{def oldLocalModel: OldLocalLDAModel}}
> * {{def getModel: OldLDAModel}}
> This task is to deprecate those methods.  I recommend creating 
> {{private[ml]}} versions of the methods which are used internally in order to 
> avoid deprecation warnings.
> Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-11-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18319.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 15972
[https://github.com/apache/spark/pull/15972]

> ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-18319
> URL: https://issues.apache.org/jira/browse/SPARK-18319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18145:


Assignee: (was: Apache Spark)

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18145:


Assignee: Apache Spark

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707288#comment-15707288
 ] 

Apache Spark commented on SPARK-18145:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16074

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707265#comment-15707265
 ] 

Apache Spark commented on SPARK-18640:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16073

> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-11-29 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18640:
--

 Summary: Fix minor synchronization issue in 
TaskSchedulerImpl.runningTasksByExecutors
 Key: SPARK-18640
 URL: https://issues.apache.org/jira/browse/SPARK-18640
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Reporter: Josh Rosen
Priority: Minor


The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
executorIdToRunningTaskIds map without proper synchronization. We should fix 
this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors

2016-11-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-18640:
--

Assignee: Josh Rosen

> Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
> 
>
> Key: SPARK-18640
> URL: https://issues.apache.org/jira/browse/SPARK-18640
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable 
> executorIdToRunningTaskIds map without proper synchronization. We should fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18639) Build only a single pip package

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18639:


Assignee: Reynold Xin  (was: Apache Spark)

> Build only a single pip package
> ---
>
> Key: SPARK-18639
> URL: https://issues.apache.org/jira/browse/SPARK-18639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We current build 5 separate pip binary tar balls, doubling the release script 
> runtime. It'd be better to build one, especially for use cases that are just 
> using Spark locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18639) Build only a single pip package

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707258#comment-15707258
 ] 

Apache Spark commented on SPARK-18639:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16072

> Build only a single pip package
> ---
>
> Key: SPARK-18639
> URL: https://issues.apache.org/jira/browse/SPARK-18639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We current build 5 separate pip binary tar balls, doubling the release script 
> runtime. It'd be better to build one, especially for use cases that are just 
> using Spark locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18639) Build only a single pip package

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18639:


Assignee: Apache Spark  (was: Reynold Xin)

> Build only a single pip package
> ---
>
> Key: SPARK-18639
> URL: https://issues.apache.org/jira/browse/SPARK-18639
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We current build 5 separate pip binary tar balls, doubling the release script 
> runtime. It'd be better to build one, especially for use cases that are just 
> using Spark locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18639) Build only a single pip package

2016-11-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18639:
---

 Summary: Build only a single pip package
 Key: SPARK-18639
 URL: https://issues.apache.org/jira/browse/SPARK-18639
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin


We current build 5 separate pip binary tar balls, doubling the release script 
runtime. It'd be better to build one, especially for use cases that are just 
using Spark locally.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18635:


Assignee: Apache Spark

> Partition name/values not escaped correctly in some cases
> -
>
> Key: SPARK-18635
> URL: https://issues.apache.org/jira/browse/SPARK-18635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Critical
>
> For example, the following command does not insert data properly into the 
> table
> {code}
> spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
> B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18635:


Assignee: (was: Apache Spark)

> Partition name/values not escaped correctly in some cases
> -
>
> Key: SPARK-18635
> URL: https://issues.apache.org/jira/browse/SPARK-18635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Critical
>
> For example, the following command does not insert data properly into the 
> table
> {code}
> spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
> B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707211#comment-15707211
 ] 

Apache Spark commented on SPARK-18635:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/16071

> Partition name/values not escaped correctly in some cases
> -
>
> Key: SPARK-18635
> URL: https://issues.apache.org/jira/browse/SPARK-18635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Critical
>
> For example, the following command does not insert data properly into the 
> table
> {code}
> spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
> B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-11-29 Thread Alex Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707201#comment-15707201
 ] 

Alex Jiang commented on SPARK-14437:


[~hogeland] Did you get your issue resolved in 2.0.0? We are seeing a similar 
issue if we run our app in intellij. If we run our app via command line, like 
"jar -jar app.jar -c app.conf", everything was fine. However, it got issue when 
running in intellij. 

used by: java.lang.RuntimeException: Stream '/jars/classes' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)

2016-11-29 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707196#comment-15707196
 ] 

Takeshi Yamamuro commented on SPARK-18502:
--

Currently, AFAIK no.
However, the SQL standard 
(http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier) specifies a 
double quotation (") as an escape one and I feel we need a general approach to 
escape these metacharacters in Spark.
Certainly, other databases can use back quotations in column names.
ex) PostgreSQL
{code}
postgres=# create table test_table("i`d" INT, "value" VARCHAR);
CREATE TABLE
postgres=# \d test_table
   Table "public.test_table"
 Column |   Type| Modifiers 
+---+---
 i`d| integer   | 
 value  | character varying | 

postgres=# insert into test_table values(1, 'aa');
INSERT 0 1
postgres=# select "i`d" from test_table; 
  i`d 
-
   1
(1 row)

{code}


> Spark does not handle columns that contain backquote (`)
> 
>
> Key: SPARK-18502
> URL: https://issues.apache.org/jira/browse/SPARK-18502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Barry Becker
>Priority: Minor
>
> I know that if a column contains dots or hyphens we can put 
> backquotes/backticks around it, but what if the column contains a backtick 
> (`)? Can the back tick be escaped by some means?
> Here is an example of the sort of error I see
> {code}
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90)
>  org.apache.spark.sql.Column.(Column.scala:113) 
> org.apache.spark.sql.Column$.apply(Column.scala:36) 
> org.apache.spark.sql.functions$.min(functions.scala:407) 
> com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158)
>  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18516) Separate instantaneous state from progress performance statistics

2016-11-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18516.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15954
[https://github.com/apache/spark/pull/15954]

> Separate instantaneous state from progress performance statistics
> -
>
> Key: SPARK-18516
> URL: https://issues.apache.org/jira/browse/SPARK-18516
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 2.1.0
>
>
> There are two types of information that you want to be able to extract from a 
> running query: instantaneous _status_ and metrics about the performance as 
> make _progress_ in query processing.
> Today, these are conflated in a single {{StreamingQueryStatus}} object.  The 
> downside to this approach is that a user now needs to reason about what state 
> the query is in anytime they retrieve a status object.  Fields like 
> {{statusMessage}} don't appear in updates that come from listener bus.  
> Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when 
> you retrieve a status object from the query itself.
> I propose we make the follow changes:
>  - Make {{status}} only report instantaneous things, such as if data is 
> available or a human readable message about what phase we are currently in.
>  - Have a separate {{progress}} message that we report for each trigger with 
> the other performance information that lives in status today.  You should be 
> able to easily retrieve a configurable number of the most recent progress 
> messages instead of just the most recent.
> While we are making these changes, I propose that we also change {{id}} to be 
> a globally unique identifier, rather than a JVM unique one.  Without this its 
> hard to correlate performance across restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18553) Executor loss may cause TaskSetManager to be leaked

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707105#comment-15707105
 ] 

Apache Spark commented on SPARK-18553:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16070

> Executor loss may cause TaskSetManager to be leaked
> ---
>
> Key: SPARK-18553
> URL: https://issues.apache.org/jira/browse/SPARK-18553
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 2.0.3, 2.1.0, 2.2.0
>
>
> Due to a bug in TaskSchedulerImpl, the complete sudden loss of an executor 
> may cause a TaskSetManager to be leaked, causing ShuffleDependencies and 
> other data structures to be kept alive indefinitely, leading to various types 
> of resource leaks (including shuffle file leaks).
> In a nutshell, the problem is that TaskSchedulerImpl did not maintain its own 
> mapping from executorId to running task ids, leaving it unable to clean up 
> taskId to taskSetManager maps when an executor is totally lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18638) Upgrade sbt to 0.13.13

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707063#comment-15707063
 ] 

Apache Spark commented on SPARK-18638:
--

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16069

> Upgrade sbt to 0.13.13
> --
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18638) Upgrade sbt to 0.13.13

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18638:


Assignee: (was: Apache Spark)

> Upgrade sbt to 0.13.13
> --
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Minor
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18638) Upgrade sbt to 0.13.13

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18638:


Assignee: Apache Spark

> Upgrade sbt to 0.13.13
> --
>
> Key: SPARK-18638
> URL: https://issues.apache.org/jira/browse/SPARK-18638
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Assignee: Apache Spark
>Priority: Minor
>
> v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
> upgrade it from 0.13.11 to 0.13.13. The release notes since the last version 
> we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
> https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
> regression fixes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18553) Executor loss may cause TaskSetManager to be leaked

2016-11-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18553:
---
Fix Version/s: 2.2.0
   2.1.0

> Executor loss may cause TaskSetManager to be leaked
> ---
>
> Key: SPARK-18553
> URL: https://issues.apache.org/jira/browse/SPARK-18553
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 2.0.3, 2.1.0, 2.2.0
>
>
> Due to a bug in TaskSchedulerImpl, the complete sudden loss of an executor 
> may cause a TaskSetManager to be leaked, causing ShuffleDependencies and 
> other data structures to be kept alive indefinitely, leading to various types 
> of resource leaks (including shuffle file leaks).
> In a nutshell, the problem is that TaskSchedulerImpl did not maintain its own 
> mapping from executorId to running task ids, leaving it unable to clean up 
> taskId to taskSetManager maps when an executor is totally lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18638) Upgrade sbt to 0.13.13

2016-11-29 Thread Weiqing Yang (JIRA)
Weiqing Yang created SPARK-18638:


 Summary: Upgrade sbt to 0.13.13
 Key: SPARK-18638
 URL: https://issues.apache.org/jira/browse/SPARK-18638
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Weiqing Yang
Priority: Minor


v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and 
upgrade it from 0.13.11 to 0.13.13. The release notes since the last version we 
used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and 
https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some 
regression fixes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang commented on SPARK-18637:


[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961
 ] 

Zhan Zhang edited comment on SPARK-18637 at 11/29/16 11:52 PM:
---

[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */


was (Author: zhzhan):
[~hvanhovell] It is an annotation.

/**
 * UDFType annotations are used to describe properties of a UDF. This gives
 * important information to the optimizer.
 * If the UDF is not deterministic, or if it is stateful, it is necessary to
 * annotate it as such for correctness.
 *
 */
@Public
@Evolving
@Target(ElementType.TYPE)
@Retention(RetentionPolicy.RUNTIME)
@Inherited
public @interface UDFType {
  /**
   * Certain optimizations should not be applied if UDF is not deterministic.
   * Deterministic UDF returns same result each time it is invoked with a
   * particular input. This determinism just needs to hold within the context of
   * a query.
   *
   * @return true if the UDF is deterministic
   */
  boolean deterministic() default true;

  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

  /**
   * A UDF is considered distinctLike if the UDF can be evaluated on just the
   * distinct values of a column. Examples include min and max UDFs. This
   * information is used by metadata-only optimizer.
   *
   * @return true if UDF is distinctLike
   */
  boolean distinctLike() default false;

  /**
   * Using in analytical functions to specify that UDF implies an ordering
   *
   * @return true if the function implies order
   */
  boolean impliesOrder() default false;
}


> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18637:


Assignee: Apache Spark

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-18637:
---
Component/s: SQL

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706935#comment-15706935
 ] 

Apache Spark commented on SPARK-18637:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16068

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18637:


Assignee: (was: Apache Spark)

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706928#comment-15706928
 ] 

Herman van Hovell edited comment on SPARK-18637 at 11/29/16 11:35 PM:
--

{{UDFType}} is a Hive construct right?


was (Author: hvanhovell):
{{UDFType}} is a Hive contruct right?

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706928#comment-15706928
 ] 

Herman van Hovell commented on SPARK-18637:
---

{{UDFType}} is a Hive contruct right?

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin

2016-11-29 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18614.
---
   Resolution: Fixed
 Assignee: Nattavut Sutyanyong
Fix Version/s: 2.1.0

> Incorrect predicate pushdown from ExistenceJoin
> ---
>
> Key: SPARK-18614
> URL: https://issues.apache.org/jira/browse/SPARK-18614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
>Priority: Minor
> Fix For: 2.1.0
>
>
> This is a follow-up work from SPARK-18597 to close a potential incorrect 
> rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706905#comment-15706905
 ] 

Zhan Zhang commented on SPARK-18637:


Here is the comments from UDFType 
  /**
   * If a UDF stores state based on the sequence of records it has processed, it
   * is stateful. A stateful UDF cannot be used in certain expressions such as
   * case statement and certain optimizations such as AND/OR short circuiting
   * don't apply for such UDFs, as they need to be invoked for each record.
   * row_sequence is an example of stateful UDF. A stateful UDF is considered to
   * be non-deterministic, irrespective of what deterministic() returns.
   *
   * @return true
   */
  boolean stateful() default false;

> Stateful UDF should be considered as nondeterministic
> -
>
> Key: SPARK-18637
> URL: https://issues.apache.org/jira/browse/SPARK-18637
> Project: Spark
>  Issue Type: Bug
>Reporter: Zhan Zhang
>
> If the annotation UDFType of a udf is stateful, it shoudl be considered as 
> non-deterministic. Otherwise, the catalyst may optimize the plan and return 
> the wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18637) Stateful UDF should be considered as nondeterministic

2016-11-29 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-18637:
--

 Summary: Stateful UDF should be considered as nondeterministic
 Key: SPARK-18637
 URL: https://issues.apache.org/jira/browse/SPARK-18637
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang


If the annotation UDFType of a udf is stateful, it shoudl be considered as 
non-deterministic. Otherwise, the catalyst may optimize the plan and return the 
wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18631) Avoid making data skew worse in ExchangeCoordinator

2016-11-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18631.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16065
[https://github.com/apache/spark/pull/16065]

> Avoid making data skew worse in ExchangeCoordinator
> ---
>
> Key: SPARK-18631
> URL: https://issues.apache.org/jira/browse/SPARK-18631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
> Fix For: 2.2.0
>
>
> The logic to resize partitions in the ExchangeCoordinator is to not start a 
> new partition until the targetPostShuffleInputSize is equalled or exceeded.  
> This can make data skew problems worse since a number of small partitions can 
> first be combined as long as the combined size remains smaller than the 
> targetPostShuffleInputSize, and then a large, data-skewed partition can be 
> further combined, making it even bigger than it already was.
> It's a fairly simple to change the logic to create a new partition if adding 
> a new piece would exceed the targetPostShuffleInputSize instead of only 
> creating a new partition after the targetPostShuffleInputSize has already 
> been exceeded.  This results in a few more partitions being created by the 
> ExchangeCoordinator, but data skew problems are at least not made worse even 
> though they are not made any better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18636) UnsafeShuffleWriter and DiskBlockObjectWriter do not consider encryption / compression in metrics

2016-11-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-18636:
--

 Summary: UnsafeShuffleWriter and DiskBlockObjectWriter do not 
consider encryption / compression in metrics
 Key: SPARK-18636
 URL: https://issues.apache.org/jira/browse/SPARK-18636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


The code in {{UnsafeShuffleWriter}} and {{DiskBlockObjectWriter}} only wraps 
the file output stream when collecting metrics, so it does not count the time 
it takes to compress and / or encrypt the data. This makes the metrics a little 
less accurate than they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706814#comment-15706814
 ] 

Cody Koeninger commented on SPARK-18475:


Glad you agree it shouldn't be enabled by default.

If you're in an organization where you are responsible for shit that other 
people broke, but have no power to actually fix it correctly...  I'm not sure 
there's anything useful I can say there.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706788#comment-15706788
 ] 

Burak Yavuz commented on SPARK-18475:
-

I'd be happy to share performance results. You're right, I never tried it with 
SSL on. One thing to note is that I was never planning to have this enabled by 
default, because there is no way to think of a sane default parallelism value.

What I was hoping to achieve was provide Spark users, who may not be Kafka 
experts a "Break in case of emergency" way out. It's easy to say "Partition 
your data properly" to people, until someone upstream in your organization 
changes one thing and the data engineer has to deal with the mess of skewed 
data.

You may want to tell people, "hey increase your Kafka partitions" if you want 
to increase Kafka parallelism, but is that a viable operation when your queues 
are already messed up, and the damage has been already done. Are you going to 
have them empty the queue, delete the topic, create a topic with increased 
number of partitions and re-consume everything so that it is properly 
partitioned again?

It's easy to talk about what needs to be done, and what is the proper way to do 
things until shit hits the fan in production with something that is/was totally 
out of your control and you have to clean up the mess.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16551) Accumulator Examples should demonstrate different use case from UDAFs

2016-11-29 Thread Ruiming Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706764#comment-15706764
 ] 

Ruiming Zhou commented on SPARK-16551:
--

I can look at this issue.

> Accumulator Examples should demonstrate different use case from UDAFs
> -
>
> Key: SPARK-16551
> URL: https://issues.apache.org/jira/browse/SPARK-16551
> Project: Spark
>  Issue Type: Documentation
>Reporter: Vladimir Feinberg
>Priority: Minor
>
> Currently, the Spark programming guide demonstrates Accumulators 
> (http://spark.apache.org/docs/latest/programming-guide.html#accumulators) by 
> taking the sum of an RDD.
> This example makes new users think that Accumulators serve the role that 
> UDAFs do, which they don't. They're meant to be out-of-band, small values 
> that don't break pipe-lining. Documentation examples and notes should reflect 
> this (and warn that they may cause driver bottlenecks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706758#comment-15706758
 ] 

Cody Koeninger commented on SPARK-18475:


Burak hasn't empirically shown that it is of benefit for a properly 
partitioned, non-skewed kafka topic, especially if SSL is enabled (because of 
the effect on consumer caching).

Any output operation can tell the difference in ordering.

People are welcome to convince you that this is a worthwhile option, but there 
is no way it should be on by default.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-18475:
--

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706692#comment-15706692
 ] 

Michael Armbrust commented on SPARK-18475:
--

I think that this suggestion was closed prematurely.  While I don't think that 
we want to always perform this optimization, I think that for a large subset of 
the {{DataFrame}} operations that we support this is valid.  Furthermore, Burak 
has already show empirically that it significantly increases throughput, and I 
don't think that should be dismissed.  Spark users are not always the same 
people who are configuring Kafka, and I don't see a reason to tie their hands.

To unpack some of the specific concerns:
 - *Violation of Kafka's Ordering* - The proposal doesn't change the order of 
data presented by an iterator.  It just subdivides further than the existing 
batching mechanism and parallelizes.  For an operation like {{mapPartitions}}, 
running two correctly ordered partitions in parallel is indistinguishable from 
running them serially at batch boundaries. That is, unless your computation is 
non-deterministic as a result of communication with an external store.  Here, 
it should be noted that non-deterministic computation violates our recovery 
semantics, and should be avoided anyway.  That said, there certainly are cases 
where people may choose to give up correctness during recovery and that is why 
I agree this optimization should be optional.  Perhaps even off by default.
 - *Partitions are the answer* - Sufficient partitions are helpful, but this 
optimization would allow you to increase throughput through the use of replicas 
as well.  And again, Spark users are not always Kafka administrators.

Now, there is an operation, {{mapWithState}}, where this optimization could 
change the result.  I do think we will want to support this operation 
eventually (maybe in 2.2). I haven't really figured out the specifics, but I 
would imagine we can use existing mechanisms in the query planner, such as 
{{requiredChildOrdering}} or {{requiredChildDistribution}} to make sure that we 
only turn this on when it can't change the answer.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull

2016-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706606#comment-15706606
 ] 

Apache Spark commented on SPARK-17897:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16067

> not isnotnull is converted to the always false condition isnotnull && not 
> isnotnull
> ---
>
> Key: SPARK-17897
> URL: https://issues.apache.org/jira/browse/SPARK-17897
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jordan Halterman
>  Labels: correctness
>
> When a logical plan is built containing the following somewhat nonsensical 
> filter:
> {{Filter (NOT isnotnull($f0#212))}}
> During optimization the filter is converted into a condition that will always 
> fail:
> {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}}
> This appears to be caused by the following check for {{NullIntolerant}}:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63
> Which recurses through the expression and extracts nested {{IsNotNull}} 
> calls, converting them to {{IsNotNull}} calls on the attribute at the root 
> level:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49
> This results in the nonsensical condition above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18635:
---
Target Version/s: 2.1.0
Priority: Critical  (was: Major)

> Partition name/values not escaped correctly in some cases
> -
>
> Key: SPARK-18635
> URL: https://issues.apache.org/jira/browse/SPARK-18635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Critical
>
> For example, the following command does not insert data properly into the 
> table
> {code}
> spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
> B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18635:
--

 Summary: Partition name/values not escaped correctly in some cases
 Key: SPARK-18635
 URL: https://issues.apache.org/jira/browse/SPARK-18635
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang


For example, the following command does not insert data properly into the table

{code}
spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18545:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-17861

> Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
> ---
>
> Key: SPARK-18545
> URL: https://issues.apache.org/jira/browse/SPARK-18545
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> To avoid performance regressions like 
> https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should 
> add a metric for the number of Hive client RPC issued and check it in the 
> perf stats suite.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18507) Major performance regression in SHOW PARTITIONS on partitioned Hive tables

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18507:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Major performance regression in SHOW PARTITIONS on partitioned Hive tables
> --
>
> Key: SPARK-18507
> URL: https://issues.apache.org/jira/browse/SPARK-18507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.1.0
>
>
> Commit {{ccb11543048dccd4cc590a8db1df1d9d5847d112}} 
> (https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112)
>  appears to have introduced a major regression in the performance of the Hive 
> {{SHOW PARTITIONS}} command. Running that command on a Hive table with 17,337 
> partitions in the {{spark-sql}} shell with the parent commit of {{ccb1154}} 
> takes approximately 7.3 seconds. Running the same command with commit 
> {{ccb1154}} takes approximately 250 seconds.
> I have not had the opportunity to complete a thorough investigation, but I 
> suspect the problem lies in the diff hunk beginning at 
> https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112#diff-159191585e10542f013cb3a714f26075L675.
>  If that's the case, this performance issue should manifest itself in other 
> areas as this programming pattern was used elsewhere in this commit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18429:

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-16026

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18429.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.2.0

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18429:

Summary: SQL aggregate function for CountMinSketch  (was: implement a new 
Aggregate for CountMinSketch)

> SQL aggregate function for CountMinSketch
> -
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes

2016-11-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18632:

Target Version/s: 2.2.0

> AggregateFunction should not ImplicitCastInputTypes
> ---
>
> Key: SPARK-18632
> URL: https://issues.apache.org/jira/browse/SPARK-18632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which 
> enables implicit input type casting). This can lead to unexpected results, 
> and should only be enabled when it is suitable for the function at hand. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes

2016-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18632:


Assignee: Herman van Hovell  (was: Apache Spark)

> AggregateFunction should not ImplicitCastInputTypes
> ---
>
> Key: SPARK-18632
> URL: https://issues.apache.org/jira/browse/SPARK-18632
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which 
> enables implicit input type casting). This can lead to unexpected results, 
> and should only be enabled when it is suitable for the function at hand. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >