[jira] [Resolved] (SPARK-18617) Close "kryo auto pick" feature for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-18617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18617. - Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.1.0 > Close "kryo auto pick" feature for Spark Streaming > -- > > Key: SPARK-18617 > URL: https://issues.apache.org/jira/browse/SPARK-18617 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Genmao Yu >Assignee: Genmao Yu > Fix For: 2.1.0 > > > [PR-15992| https://github.com/apache/spark/pull/15992] provided a solution to > fix the bug, i.e. {{receiver data can not be deserialized properly}}. As > [~zsxwing] said, it is a critical bug, but we should not break APIs between > maintenance releases. It may be a rational choice to close {{auto pick kryo > serializer}} for Spark Streaming in the first step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18622) Missing Reference in Multi Union Clauses Cause by TypeCoercion
[ https://issues.apache.org/jira/browse/SPARK-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18622. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.1.0 > Missing Reference in Multi Union Clauses Cause by TypeCoercion > -- > > Key: SPARK-18622 > URL: https://issues.apache.org/jira/browse/SPARK-18622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2 >Reporter: Yerui Sun >Assignee: Herman van Hovell > Fix For: 2.1.0 > > > {code} > spark-sql> explain extended > > select a > > from > > ( > > select 0 a, 0 b > > union all > > select sum(1) a, cast(0 as bigint) b > > union all > > select 0 a, 0 b > > )t; > > == Parsed Logical Plan == > 'Project ['a] > +- 'SubqueryAlias t >+- 'Union > :- 'Union > : :- Project [0 AS a#0, 0 AS b#1] > : : +- OneRowRelation$ > : +- 'Project ['sum(1) AS a#2, cast(0 as bigint) AS b#3L] > : +- OneRowRelation$ > +- Project [0 AS a#4, 0 AS b#5] > +- OneRowRelation$ > > == Analyzed Logical Plan == > a: int > Project [a#0] > +- SubqueryAlias t >+- Union > :- !Project [a#0, b#9L] > : +- Union > : :- Project [cast(a#0 as bigint) AS a#11L, b#9L] > : : +- Project [a#0, cast(b#1 as bigint) AS b#9L] > : : +- Project [0 AS a#0, 0 AS b#1] > : :+- OneRowRelation$ > : +- Project [a#2L, b#3L] > :+- Project [a#2L, b#3L] > : +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as > bigint) AS b#3L] > : +- OneRowRelation$ > +- Project [a#4, cast(b#5 as bigint) AS b#10L] > +- Project [0 AS a#4, 0 AS b#5] > +- OneRowRelation$ > > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing > from a#11L,b#9L in operator !Project [a#0, b#9L];; > Project [a#0] > +- SubqueryAlias t >+- Union > :- !Project [a#0, b#9L] > : +- Union > : :- Project [cast(a#0 as bigint) AS a#11L, b#9L] > : : +- Project [a#0, cast(b#1 as bigint) AS b#9L] > : : +- Project [0 AS a#0, 0 AS b#1] > : :+- OneRowRelation$ > : +- Project [a#2L, b#3L] > :+- Project [a#2L, b#3L] > : +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as > bigint) AS b#3L] > : +- OneRowRelation$ > +- Project [a#4, cast(b#5 as bigint) AS b#10L] > +- Project [0 AS a#4, 0 AS b#5] > +- OneRowRelation$ > > == Physical Plan == > org.apache.spark.sql.AnalysisException: resolved attribute(s) a#0 missing > from a#11L,b#9L in operator !Project [a#0, b#9L];; > Project [a#0] > +- SubqueryAlias t >+- Union > :- !Project [a#0, b#9L] > : +- Union > : :- Project [cast(a#0 as bigint) AS a#11L, b#9L] > : : +- Project [a#0, cast(b#1 as bigint) AS b#9L] > : : +- Project [0 AS a#0, 0 AS b#1] > : :+- OneRowRelation$ > : +- Project [a#2L, b#3L] > :+- Project [a#2L, b#3L] > : +- Aggregate [sum(cast(1 as bigint)) AS a#2L, cast(0 as > bigint) AS b#3L] > : +- OneRowRelation$ > +- Project [a#4, cast(b#5 as bigint) AS b#10L] > +- Project [0 AS a#4, 0 AS b#5] > +- OneRowRelation$ > {code} > Key Points to re-produce issue: > * 3 or more union clauses; > * One column is sum aggregate in one union clause, and is Integer type in > other union clause; > * Another column has different date types in union clauses; > The reason of issue: > - Step 1: Apply TypeCoercion.WidenSetOperationTypes, add project with cast > since the union clauses has different datatypes for one column; With 3 union > clauses, the inner union clause also be projected with cast; > - Step 2: Apply TypeCoercion.FunctionArgumentConversion, the return type of > sum(int) will be extended to BigInt, meaning one column in union clauses > changed datatype; > - Step 3: Apply TypeCoercion.WidenSetOperationTypes again, another cast > project added in inner union clause, since sum(int) datatype changed; at this > point, the reference of project ON inner union will be missed, since the > project IN inner union is newly added, see the Analyzed Logical Plan; > Solutions to fix: > * Since set operation type coercion should be applied after inner clause be > stabled, apply WidenSetOperationTypes at last will fix the issue; > * To avoiding multi level projects on set operation clause, handle the > existing cast project carefully in
[jira] [Resolved] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17680. - Resolution: Fixed > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18643: - Description: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR (not from sparkR shell) library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend. {code} Launching java with spark-submit command spark-submit sparkr-shell /tmp/RtmpYbAYt5/backend_port5849dc2273 sh: 1: spark-submit: not found {code} If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. was: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR (not from sparkR shell) library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend. If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > {code} > Launching java with spark-submit command spark-submit sparkr-shell > /tmp/RtmpYbAYt5/backend_port5849dc2273 > sh: 1: spark-submit: not found > {code} > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17934) Support percentile scale in ml.feature
[ https://issues.apache.org/jira/browse/SPARK-17934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707710#comment-15707710 ] yuhao yang commented on SPARK-17934: We can probably implement something like Robust Scaler in sklearn. http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html > Support percentile scale in ml.feature > -- > > Key: SPARK-17934 > URL: https://issues.apache.org/jira/browse/SPARK-17934 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Lei Wang > > Percentile scale is often used in feature scale. > In my project, I need to use this scaler. > Compared to MinMaxScaler, PercentileScaler will not produce unstable result > due to anomaly large value. > About percentile scale, refer to https://en.wikipedia.org/wiki/Percentile_rank -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707704#comment-15707704 ] Felix Cheung commented on SPARK-18643: -- A workaround is to start as sparkR.session(master="local") - but it might not be always correct (not if the user is going to run Spark in non-local mode) > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18643: Assignee: Apache Spark (was: Felix Cheung) > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18643: Assignee: Felix Cheung (was: Apache Spark) > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707701#comment-15707701 ] Apache Spark commented on SPARK-18643: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16077 > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16848) Make jdbc() and read.format("jdbc") consistently throwing exception for user-specified schema
[ https://issues.apache.org/jira/browse/SPARK-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707699#comment-15707699 ] Pramod Anarase commented on SPARK-16848: +1 > Make jdbc() and read.format("jdbc") consistently throwing exception for > user-specified schema > - > > Key: SPARK-16848 > URL: https://issues.apache.org/jira/browse/SPARK-16848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, > {code} > spark.read.schema(StructType(Seq())).jdbc(...),show() > {code} > does not throws an exception whereas > {code} > spark.read.schema(StructType(Seq())).option(...).format("jdbc").load().show() > {code} > does as below: > {code} > jdbc does not allow user-specified schemas.; > org.apache.spark.sql.AnalysisException: jdbc does not allow user-specified > schemas.; > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at > org.apache.spark.sql.jdbc.JDBCSuite$$anonfun$17.apply$mcV$sp(JDBCSuite.scala:351) > {code} > It'd make sense throwing the exception when user specifies schema identically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18643) SparkR hangs at session start when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18643: - Summary: SparkR hangs at session start when installed as a package without SPARK_HOME set (was: SparkR hangs when installed as a package without SPARK_HOME set) > SparkR hangs at session start when installed as a package without SPARK_HOME > set > > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18643: - Description: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR (not from sparkR shell) library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend. If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. was: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR (not from sparkR shell) library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend. If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. This seems to be a regression on the earlier behavior. > SparkR hangs when installed as a package without SPARK_HOME set > --- > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. > Basically with this bug the auto install Spark will only work in the first > session. > This seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18643: - Description: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR (not from sparkR shell) library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend. If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. This seems to be a regression on the earlier behavior. was: 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend > SparkR hangs when installed as a package without SPARK_HOME set > --- > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR (not from sparkR shell) > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend. > If SparkR is running as a package and it has previously downloaded Spark Jar > it should be able to run as before without having to set SPARK_HOME. This > seems to be a regression on the earlier behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707675#comment-15707675 ] Apache Spark commented on SPARK-18324: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/16076 > ML, Graph 2.1 QA: Programming guide update and migration guide > -- > > Key: SPARK-18324 > URL: https://issues.apache.org/jira/browse/SPARK-18324 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang >Priority: Critical > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set
Felix Cheung created SPARK-18643: Summary: SparkR hangs when installed as a package without SPARK_HOME set Key: SPARK-18643 URL: https://issues.apache.org/jira/browse/SPARK-18643 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.0 Reporter: Felix Cheung Assignee: Felix Cheung Priority: Critical 1) Install SparkR from source package, ie. R CMD INSTALL SparkR_2.1.0.tar.gz 2) Start SparkR library(SparkR) sparkR.session() Notice SparkR hangs when it couldn't find spark-submit to launch the JVM backend -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18643) SparkR hangs when installed as a package without SPARK_HOME set
[ https://issues.apache.org/jira/browse/SPARK-18643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707669#comment-15707669 ] Felix Cheung commented on SPARK-18643: -- Related PR: https://github.com/apache/spark/pull/15888 > SparkR hangs when installed as a package without SPARK_HOME set > --- > > Key: SPARK-18643 > URL: https://issues.apache.org/jira/browse/SPARK-18643 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Critical > > 1) Install SparkR from source package, ie. > R CMD INSTALL SparkR_2.1.0.tar.gz > 2) Start SparkR > library(SparkR) > sparkR.session() > Notice SparkR hangs when it couldn't find spark-submit to launch the JVM > backend -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17692. - Resolution: Fixed Fix Version/s: 2.1.0 > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > Fix For: 2.1.0 > > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. > * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for > SelectKBest features. > * SPARK-3261: KMeans returns potentially fewer than k cluster centers in > cases where k distinct centroids aren't available or aren't selected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707665#comment-15707665 ] Yanbo Liang commented on SPARK-17692: - All behavior changes has been documented in the PR of SPARK-18324, so I will close this one. > Document ML/MLlib behavior changes in Spark 2.1 > --- > > Key: SPARK-17692 > URL: https://issues.apache.org/jira/browse/SPARK-17692 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Labels: 2.1.0 > > This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can > note those changes (if any) in the user guide's Migration Guide section. If > you found one, please comment below and link the corresponding JIRA here. > * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5. > * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for > SelectKBest features. > * SPARK-3261: KMeans returns potentially fewer than k cluster centers in > cases where k distinct centroids aren't available or aren't selected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707663#comment-15707663 ] yuhao yang commented on SPARK-18608: Agree. we can just add an extra parameter handlePersistence: Boolean to the train method in Predictor. > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707659#comment-15707659 ] yuhao yang commented on SPARK-18374: Yes. Currently we're discussing if we should put "wouldn't" (rather than "wonldnt") directly into MLlib's stop words list, because by default Tokenizer in Spark does not split on apostrophes or quotes. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-18278: --- Attachment: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf I attached a proposal outlining a potential long term plan for this feature. Any feedback about it would be appreciated. > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707642#comment-15707642 ] Xiangrui Meng commented on SPARK-18374: --- See the discussion here: https://github.com/nltk/nltk_data/issues/22. Including `won` is apparently a mistake. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17680) Unicode Character Support for Column Names and Comments
[ https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707629#comment-15707629 ] Kazuaki Ishizaki commented on SPARK-17680: -- Sorry, it is my mistake. > Unicode Character Support for Column Names and Comments > --- > > Key: SPARK-17680 > URL: https://issues.apache.org/jira/browse/SPARK-17680 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Spark SQL supports Unicode characters for column names when specified within > backticks(`). When the Hive support is enabled, the version of the Hive > metastore must be higher than 0.12, See the JIRA: > https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports > Unicode characters for column names since 0.13. > In Spark SQL, table comments, and view comments always allow Unicode > characters without backticks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Description: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into spark's classpath. Here is the stack: === 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) at org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) was: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into spark's
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Priority: Major (was: Minor) > Show databases NullPointerException while Sentry turned on > -- > > Key: SPARK-18641 > URL: https://issues.apache.org/jira/browse/SPARK-18641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1 >Reporter: zhangqw > > I've traced into source code, and it seems that of > Sentry not set when spark sql started a session. This operation should be > done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is > not called in spark sql. > Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into > spark's classpath. > Here is thestack: > === > 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) > at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) > at > org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at >
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Priority: Minor (was: Major) > Show databases NullPointerException while Sentry turned on > -- > > Key: SPARK-18641 > URL: https://issues.apache.org/jira/browse/SPARK-18641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1 >Reporter: zhangqw >Priority: Minor > > I've traced into source code, and it seems that of > Sentry not set when spark sql started a session. This operation should be > done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is > not called in spark sql. > Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into > spark's classpath. > Here is thestack: > === > 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) > at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) > at > org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
[ https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-18642: -- Description: When doing a left-join between two tables, say A and B, Catalyst has information about the projection required for table B. Only the required columns should be scanned. Code snippet below explains the scenario: scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] scala> dfA.registerTempTable("A") scala> dfB.registerTempTable("B") scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid where B.bid<2").explain == Physical Plan == Project [aid#15,bid#17] +- Filter (bid#17 < 2) +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB This is a watered-down example from a production issue which has a huge performance impact. External reference: http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns was: When doing a left-join between two tables, say A and B, Catalyst has information about the projection required for table B. Only the required columns should be scanned. Code snippet below explains the scenario: scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] scala> dfA.registerTempTable("A") scala> dfB.registerTempTable("B") scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid where B.bid<2").explain == Physical Plan == Project [aid#15,bid#17] +- Filter (bid#17 < 2) +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB This is a watered-down example from a production issue which has a huge performance impact. > Spark SQL: Catalyst is scanning undesired columns > - > > Key: SPARK-18642 > URL: https://issues.apache.org/jira/browse/SPARK-18642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: Ubuntu 14.04 > Spark: Local Mode >Reporter: Mohit > Labels: performance > > When doing a left-join between two tables, say A and B, Catalyst has > information about the projection required for table B. Only the required > columns should be scanned. > Code snippet below explains the scenario: > scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") > dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] > scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") > dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] > scala> dfA.registerTempTable("A") > scala> dfB.registerTempTable("B") > scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid > where B.bid<2").explain > == Physical Plan == > Project [aid#15,bid#17] > +- Filter (bid#17 < 2) >+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None > :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: > file:/home/mohit/ruleA > +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: > file:/home/mohit/ruleB > This is a watered-down example from a production issue which has a huge > performance impact. > External reference: > http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
[ https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit updated SPARK-18642: -- Description: When doing a left-join between two tables, say A and B, Catalyst has information about the projection required for table B. Only the required columns should be scanned. Code snippet below explains the scenario: scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] scala> dfA.registerTempTable("A") scala> dfB.registerTempTable("B") scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid where B.bid<2").explain == Physical Plan == Project [aid#15,bid#17] +- Filter (bid#17 < 2) +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB This is a watered-down example from a production issue which has a huge performance impact. was: When doing a left-join between two tables, say A and B, Catalyst has information about the projection required for table B. Code snippet below explains the scenario: scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] scala> dfA.registerTempTable("A") scala> dfB.registerTempTable("B") scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid where B.bid<2").explain == Physical Plan == Project [aid#15,bid#17] +- Filter (bid#17 < 2) +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB This is a watered-down example from a production issue which has a huge performance impact. > Spark SQL: Catalyst is scanning undesired columns > - > > Key: SPARK-18642 > URL: https://issues.apache.org/jira/browse/SPARK-18642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 > Environment: Ubuntu 14.04 > Spark: Local Mode >Reporter: Mohit > Labels: performance > > When doing a left-join between two tables, say A and B, Catalyst has > information about the projection required for table B. Only the required > columns should be scanned. > Code snippet below explains the scenario: > scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") > dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] > scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") > dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] > scala> dfA.registerTempTable("A") > scala> dfB.registerTempTable("B") > scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid > where B.bid<2").explain > == Physical Plan == > Project [aid#15,bid#17] > +- Filter (bid#17 < 2) >+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None > :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: > file:/home/mohit/ruleA > +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: > file:/home/mohit/ruleB > This is a watered-down example from a production issue which has a huge > performance impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns
Mohit created SPARK-18642: - Summary: Spark SQL: Catalyst is scanning undesired columns Key: SPARK-18642 URL: https://issues.apache.org/jira/browse/SPARK-18642 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Environment: Ubuntu 14.04 Spark: Local Mode Reporter: Mohit When doing a left-join between two tables, say A and B, Catalyst has information about the projection required for table B. Code snippet below explains the scenario: scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA") dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string] scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB") dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string] scala> dfA.registerTempTable("A") scala> dfB.registerTempTable("B") scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid where B.bid<2").explain == Physical Plan == Project [aid#15,bid#17] +- Filter (bid#17 < 2) +- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: file:/home/mohit/ruleA +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: file:/home/mohit/ruleB This is a watered-down example from a production issue which has a huge performance impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators
[ https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707594#comment-15707594 ] Apache Spark commented on SPARK-17732: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15987 > ALTER TABLE DROP PARTITION should support comparators > - > > Key: SPARK-17732 > URL: https://issues.apache.org/jira/browse/SPARK-17732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > > This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in > Apache Spark 2.0 for backward compatibility. > *Spark 1.6.2* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > res1: org.apache.spark.sql.DataFrame = [result: string] > {code} > *Spark 2.0* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting {')', ','}(line 1, pos 42) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt
[ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707560#comment-15707560 ] yuhao yang commented on SPARK-18374: cc [~mengxr] to see if he recalls any specific reason. > Incorrect words in StopWords/english.txt > > > Key: SPARK-18374 > URL: https://issues.apache.org/jira/browse/SPARK-18374 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: nirav patel > > I was just double checking english.txt for list of stopwords as I felt it was > taking out valid tokens like 'won'. I think issue is english.txt list is > missing apostrophe character and all character after apostrophe. So "won't" > becam "won" in that list; "wouldn't" is "wouldn" . > Here are some incorrect tokens in this list: > won > wouldn > ma > mightn > mustn > needn > shan > shouldn > wasn > weren > I think ideal list should have both style. i.e. won't and wont both should be > part of english.txt as some tokenizer might remove special characters. But > 'won' is obviously shouldn't be in this list. > Here's list of snowball english stop words: > http://snowball.tartarus.org/algorithms/english/stop.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18531) Apache Spark FPGrowth algorithm implementation fails with java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707542#comment-15707542 ] yuhao yang commented on SPARK-18531: [~tuxdna] Does it work for you? > Apache Spark FPGrowth algorithm implementation fails with > java.lang.StackOverflowError > -- > > Key: SPARK-18531 > URL: https://issues.apache.org/jira/browse/SPARK-18531 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: Saleem Ansari > > More details can be found here: > https://gist.github.com/tuxdna/37a69b53e6f9a9442fa3b1d5e53c2acb > *Spark FPGrowth algorithm croaks with a small dataset as shown below* > $ spark-shell --master "local[*]" --driver-memory 5g > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_102) > Spark context available as sc. > SQL context available as sqlContext. > scala> import org.apache.spark.mllib.fpm.FPGrowth > import org.apache.spark.mllib.fpm.FPGrowth > scala> import org.apache.spark.rdd.RDD > import org.apache.spark.rdd.RDD > scala> import org.apache.spark.sql.SQLContext > import org.apache.spark.sql.SQLContext > scala> import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.{SparkConf, SparkContext} > scala> val data = sc.textFile("bug.data") > data: org.apache.spark.rdd.RDD[String] = bug.data MapPartitionsRDD[1] at > textFile at :31 > scala> val transactions: RDD[Array[String]] = data.map(l => > l.split(",").distinct) > transactions: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] > at map at :33 > scala> transactions.cache() > res0: transactions.type = MapPartitionsRDD[2] at map at :33 > scala> val fpg = new FPGrowth().setMinSupport(0.05).setNumPartitions(10) > fpg: org.apache.spark.mllib.fpm.FPGrowth = > org.apache.spark.mllib.fpm.FPGrowth@66d62c59 > scala> val model = fpg.run(transactions) > model: org.apache.spark.mllib.fpm.FPGrowthModel[String] = > org.apache.spark.mllib.fpm.FPGrowthModel@6e92f150 > scala> model.freqItemsets.take(1).foreach { i => i.items.mkString("[", ",", > "]") + ", " + i.freq } > [Stage 3:> (0 + 2) / > 2]16/11/21 23:56:14 ERROR Executor: Managed memory leak detected; size = > 18068980 bytes, TID = 14 > 16/11/21 23:56:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 14) > java.lang.StackOverflowError > at org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > at > org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:273) > at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:115) > at > org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:202) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > *This failure is likely due to the size of baskets which contains over > thousands of items.* > scala> val maxBasketSize = transactions.map(_.length).max() > maxBasketSize: Int = 1171 > > scala> transactions.filter(_.length == maxBasketSize).collect() > res3: Array[Array[String]] = Array(Array(3858, 109, 5842, 2184, 2481, 534 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15819) Add KMeanSummary in KMeans of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-15819. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Add KMeanSummary in KMeans of PySpark > - > > Key: SPARK-15819 > URL: https://issues.apache.org/jira/browse/SPARK-15819 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Fix For: 2.1.1, 2.2.0 > > > There's no corresponding python api for KMeansSummary, it would be nice to > have it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18145. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17861) Store data source partitions in metastore and push partition pruning into metastore
[ https://issues.apache.org/jira/browse/SPARK-17861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17861. - Resolution: Fixed Fix Version/s: 2.1.0 > Store data source partitions in metastore and push partition pruning into > metastore > --- > > Key: SPARK-17861 > URL: https://issues.apache.org/jira/browse/SPARK-17861 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Eric Liang >Priority: Critical > Fix For: 2.1.0 > > > Initially, Spark SQL does not store any partition information in the catalog > for data source tables, because initially it was designed to work with > arbitrary files. This, however, has a few issues for catalog tables: > 1. Listing partitions for a large table (with millions of partitions) can be > very slow during cold start. > 2. Does not support heterogeneous partition naming schemes. > 3. Cannot leverage pushing partition pruning into the metastore. > This ticket tracks the work required to push the tracking of partitions into > the metastore. This change should be feature flagged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes
[ https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18632. - Resolution: Fixed Fix Version/s: 2.2.0 > AggregateFunction should not ImplicitCastInputTypes > --- > > Key: SPARK-18632 > URL: https://issues.apache.org/jira/browse/SPARK-18632 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which > enables implicit input type casting). This can lead to unexpected results, > and should only be enabled when it is suitable for the function at hand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707435#comment-15707435 ] holdenk commented on SPARK-15369: - So I'm probably going to be busy until after the 2.1 release (also trying to finish a book and have some talks in the middle) just but I'll take a look after that. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707428#comment-15707428 ] Marius Van Niekerk edited comment on SPARK-15369 at 11/30/16 3:49 AM: -- Oh yeah, once we have a pip installable spark it should be pretty easy testing this with some docker pieces with travis. Basic idea is to convert the benchmarks into an integration test. Feel free to open issues on that project. was (Author: mariusvniekerk): Oh yeah, once we have a pip installable spark it should be pretty easy testing this with some docker pieces with travis. Basic idea is to convert the benchmarks into an integration test. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707428#comment-15707428 ] Marius Van Niekerk commented on SPARK-15369: Oh yeah, once we have a pip installable spark it should be pretty easy testing this with some docker pieces with travis. Basic idea is to convert the benchmarks into an integration test. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707421#comment-15707421 ] holdenk commented on SPARK-15369: - That looks like a great start :) Probably the packaging is going to be a bit trixie and it would probably make sense to have some testing as well but thanks for getting started making a spark package for this. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15369) Investigate selectively using Jython for parts of PySpark
[ https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707411#comment-15707411 ] Marius Van Niekerk commented on SPARK-15369: I'm in the process of an initial stab at turning this into a spark package. https://github.com/mariusvniekerk/spark-jython-udf Feedback would be appreciated. > Investigate selectively using Jython for parts of PySpark > - > > Key: SPARK-15369 > URL: https://issues.apache.org/jira/browse/SPARK-15369 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Minor > > Transferring data from the JVM to the Python executor can be a substantial > bottleneck. While Jython is not suitable for all UDFs or map functions, it > may be suitable for some simple ones. We should investigate the option of > using Jython to accelerate these small functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18516) Separate instantaneous state from progress performance statistics
[ https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707399#comment-15707399 ] Apache Spark commented on SPARK-18516: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16075 > Separate instantaneous state from progress performance statistics > - > > Key: SPARK-18516 > URL: https://issues.apache.org/jira/browse/SPARK-18516 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 2.1.0 > > > There are two types of information that you want to be able to extract from a > running query: instantaneous _status_ and metrics about the performance as > make _progress_ in query processing. > Today, these are conflated in a single {{StreamingQueryStatus}} object. The > downside to this approach is that a user now needs to reason about what state > the query is in anytime they retrieve a status object. Fields like > {{statusMessage}} don't appear in updates that come from listener bus. > Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when > you retrieve a status object from the query itself. > I propose we make the follow changes: > - Make {{status}} only report instantaneous things, such as if data is > available or a human readable message about what phase we are currently in. > - Have a separate {{progress}} message that we report for each trigger with > the other performance information that lives in status today. You should be > able to easily retrieve a configurable number of the most recent progress > messages instead of just the most recent. > While we are making these changes, I propose that we also change {{id}} to be > a globally unique identifier, rather than a JVM unique one. Without this its > hard to correlate performance across restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while Sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Summary: Show databases NullPointerException while Sentry turned on (was: Show databases NullPointerException while sentry turned on) > Show databases NullPointerException while Sentry turned on > -- > > Key: SPARK-18641 > URL: https://issues.apache.org/jira/browse/SPARK-18641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1 >Reporter: zhangqw > > I've traced into source code, and it seems that of > Sentry not set when spark sql started a session. This operation should be > done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is > not called in spark sql. > Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into > spark's classpath. > Here is thestack: > === > 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) > at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) > at > org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at >
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Description: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into spark's classpath. Here is thestack: === 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) at org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) was: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into spark's
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Affects Version/s: (was: 2.0.1) 2.0.0 > Show databases NullPointerException while sentry turned on > -- > > Key: SPARK-18641 > URL: https://issues.apache.org/jira/browse/SPARK-18641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1 >Reporter: zhangqw > > I've traced into source code, and it seems that of > Sentry not set when spark sql started a session. This operation should be > done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is > not called in spark sql. > Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into > spark's classpath. > Here is stack: > 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) > at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) > at > org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) > at > org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) > at > org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) > at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) > at > org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) > at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at >
[jira] [Updated] (SPARK-18641) Show databases NullPointerException while sentry turned on
[ https://issues.apache.org/jira/browse/SPARK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangqw updated SPARK-18641: Description: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Edit: I copyed hive-site.xml(which turns on Sentry) and all sentry jars into spark's classpath. Here is stack: 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) at org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) was: I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Here is stack: 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list java.lang.NullPointerException at
[jira] [Created] (SPARK-18641) Show databases NullPointerException while sentry turned on
zhangqw created SPARK-18641: --- Summary: Show databases NullPointerException while sentry turned on Key: SPARK-18641 URL: https://issues.apache.org/jira/browse/SPARK-18641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: CentOS 6.5 / Hive 1.1.0 / Sentry 1.5.1 Reporter: zhangqw I've traced into source code, and it seems that of Sentry not set when spark sql started a session. This operation should be done in org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook which is not called in spark sql. Here is stack: 16/11/30 10:54:50 WARN SentryMetaStoreFilterHook: Error getting DB list java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333) at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:988) at org.apache.hadoop.security.Groups.getGroups(Groups.java:162) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.getHiveBindingWithPrivilegeCache(HiveAuthzBindingHook.java:956) at org.apache.sentry.binding.hive.HiveAuthzBindingHook.filterShowDatabases(HiveAuthzBindingHook.java:826) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDb(SentryMetaStoreFilterHook.java:131) at org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook.filterDatabases(SentryMetaStoreFilterHook.java:59) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1031) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy38.getAllDatabases(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:170) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSessionState.metadataHive$lzycompute(HiveSessionState.scala:43) at org.apache.spark.sql.hive.HiveSessionState.metadataHive(HiveSessionState.scala:43) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:62) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-18613) spark.ml LDA classes should not expose spark.mllib in APIs
[ https://issues.apache.org/jira/browse/SPARK-18613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707320#comment-15707320 ] Joseph K. Bradley commented on SPARK-18613: --- I can after 2.1 QA, but feel free to go ahead if you'd like. > spark.ml LDA classes should not expose spark.mllib in APIs > -- > > Key: SPARK-18613 > URL: https://issues.apache.org/jira/browse/SPARK-18613 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > spark.ml.LDAModel exposes dependencies on spark.mllib in 2 methods, but it > should not: > * {{def oldLocalModel: OldLocalLDAModel}} > * {{def getModel: OldLDAModel}} > This task is to deprecate those methods. I recommend creating > {{private[ml]}} versions of the methods which are used internally in order to > avoid deprecation warnings. > Setting target for 2.2, but I'm OK with getting it into 2.1 if we have time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18319. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 15972 [https://github.com/apache/spark/pull/15972] > ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-18319 > URL: https://issues.apache.org/jira/browse/SPARK-18319 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Blocker > Fix For: 2.1.1, 2.2.0 > > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18145: Assignee: (was: Apache Spark) > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18145: Assignee: Apache Spark > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707288#comment-15707288 ] Apache Spark commented on SPARK-18145: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/16074 > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
[ https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707265#comment-15707265 ] Apache Spark commented on SPARK-18640: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/16073 > Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors > > > Key: SPARK-18640 > URL: https://issues.apache.org/jira/browse/SPARK-18640 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable > executorIdToRunningTaskIds map without proper synchronization. We should fix > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
Josh Rosen created SPARK-18640: -- Summary: Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors Key: SPARK-18640 URL: https://issues.apache.org/jira/browse/SPARK-18640 Project: Spark Issue Type: Bug Components: Scheduler Reporter: Josh Rosen Priority: Minor The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable executorIdToRunningTaskIds map without proper synchronization. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18640) Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors
[ https://issues.apache.org/jira/browse/SPARK-18640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-18640: -- Assignee: Josh Rosen > Fix minor synchronization issue in TaskSchedulerImpl.runningTasksByExecutors > > > Key: SPARK-18640 > URL: https://issues.apache.org/jira/browse/SPARK-18640 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > The method TaskSchedulerImpl.runningTasksByExecutors() accesses the mutable > executorIdToRunningTaskIds map without proper synchronization. We should fix > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18639) Build only a single pip package
[ https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18639: Assignee: Reynold Xin (was: Apache Spark) > Build only a single pip package > --- > > Key: SPARK-18639 > URL: https://issues.apache.org/jira/browse/SPARK-18639 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > > We current build 5 separate pip binary tar balls, doubling the release script > runtime. It'd be better to build one, especially for use cases that are just > using Spark locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18639) Build only a single pip package
[ https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707258#comment-15707258 ] Apache Spark commented on SPARK-18639: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16072 > Build only a single pip package > --- > > Key: SPARK-18639 > URL: https://issues.apache.org/jira/browse/SPARK-18639 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > > We current build 5 separate pip binary tar balls, doubling the release script > runtime. It'd be better to build one, especially for use cases that are just > using Spark locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18639) Build only a single pip package
[ https://issues.apache.org/jira/browse/SPARK-18639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18639: Assignee: Apache Spark (was: Reynold Xin) > Build only a single pip package > --- > > Key: SPARK-18639 > URL: https://issues.apache.org/jira/browse/SPARK-18639 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Apache Spark > > We current build 5 separate pip binary tar balls, doubling the release script > runtime. It'd be better to build one, especially for use cases that are just > using Spark locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18639) Build only a single pip package
Reynold Xin created SPARK-18639: --- Summary: Build only a single pip package Key: SPARK-18639 URL: https://issues.apache.org/jira/browse/SPARK-18639 Project: Spark Issue Type: Sub-task Components: Build Reporter: Reynold Xin Assignee: Reynold Xin We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18635) Partition name/values not escaped correctly in some cases
[ https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18635: Assignee: Apache Spark > Partition name/values not escaped correctly in some cases > - > > Key: SPARK-18635 > URL: https://issues.apache.org/jira/browse/SPARK-18635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Critical > > For example, the following command does not insert data properly into the > table > {code} > spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as > B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18635) Partition name/values not escaped correctly in some cases
[ https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18635: Assignee: (was: Apache Spark) > Partition name/values not escaped correctly in some cases > - > > Key: SPARK-18635 > URL: https://issues.apache.org/jira/browse/SPARK-18635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Priority: Critical > > For example, the following command does not insert data properly into the > table > {code} > spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as > B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18635) Partition name/values not escaped correctly in some cases
[ https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707211#comment-15707211 ] Apache Spark commented on SPARK-18635: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/16071 > Partition name/values not escaped correctly in some cases > - > > Key: SPARK-18635 > URL: https://issues.apache.org/jira/browse/SPARK-18635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Priority: Critical > > For example, the following command does not insert data properly into the > table > {code} > spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as > B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups
[ https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707201#comment-15707201 ] Alex Jiang commented on SPARK-14437: [~hogeland] Did you get your issue resolved in 2.0.0? We are seeing a similar issue if we run our app in intellij. If we run our app via command line, like "jar -jar app.jar -c app.conf", everything was fine. However, it got issue when running in intellij. used by: java.lang.RuntimeException: Stream '/jars/classes' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:222) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) > Spark using Netty RPC gets wrong address in some setups > --- > > Key: SPARK-14437 > URL: https://issues.apache.org/jira/browse/SPARK-14437 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.6.0, 1.6.1 > Environment: AWS, Docker, Flannel >Reporter: Kevin Hogeland >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > Netty can't get the correct origin address in certain network setups. Spark > should handle this, as relying on Netty correctly reporting all addresses > leads to incompatible and unpredictable network states. We're currently using > Docker with Flannel on AWS. Container communication looks something like: > {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) > -> Container 2 (4.5.6.1)}} > If the client in that setup is Container 1 (1.2.3.4), Netty channels from > there to Container 2 will have a client address of 1.2.3.0. > The {{RequestMessage}} object that is sent over the wire already contains a > {{senderAddress}} field that the sender can use to specify their address. In > {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client > socket address when null. {{senderAddress}} in the messages sent from the > executors is currently always null, meaning all messages will have these > incorrect addresses (we've switched back to Akka as a temporary workaround > for this). The executor should send its address explicitly so that the driver > doesn't attempt to infer addresses based on possibly incorrect information > from Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18502) Spark does not handle columns that contain backquote (`)
[ https://issues.apache.org/jira/browse/SPARK-18502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707196#comment-15707196 ] Takeshi Yamamuro commented on SPARK-18502: -- Currently, AFAIK no. However, the SQL standard (http://savage.net.au/SQL/sql-99.bnf.html#delimited%20identifier) specifies a double quotation (") as an escape one and I feel we need a general approach to escape these metacharacters in Spark. Certainly, other databases can use back quotations in column names. ex) PostgreSQL {code} postgres=# create table test_table("i`d" INT, "value" VARCHAR); CREATE TABLE postgres=# \d test_table Table "public.test_table" Column | Type| Modifiers +---+--- i`d| integer | value | character varying | postgres=# insert into test_table values(1, 'aa'); INSERT 0 1 postgres=# select "i`d" from test_table; i`d - 1 (1 row) {code} > Spark does not handle columns that contain backquote (`) > > > Key: SPARK-18502 > URL: https://issues.apache.org/jira/browse/SPARK-18502 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Barry Becker >Priority: Minor > > I know that if a column contains dots or hyphens we can put > backquotes/backticks around it, but what if the column contains a backtick > (`)? Can the back tick be escaped by some means? > Here is an example of the sort of error I see > {code} > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `Invoice`Date`;org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:109) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.quotedString(unresolved.scala:90) > org.apache.spark.sql.Column.(Column.scala:113) > org.apache.spark.sql.Column$.apply(Column.scala:36) > org.apache.spark.sql.functions$.min(functions.scala:407) > com.mineset.spark.vizagg.vizbin.strategies.DateBinStrategy.getDateExtent(DateBinStrategy.scala:158) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18516) Separate instantaneous state from progress performance statistics
[ https://issues.apache.org/jira/browse/SPARK-18516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18516. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15954 [https://github.com/apache/spark/pull/15954] > Separate instantaneous state from progress performance statistics > - > > Key: SPARK-18516 > URL: https://issues.apache.org/jira/browse/SPARK-18516 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 2.1.0 > > > There are two types of information that you want to be able to extract from a > running query: instantaneous _status_ and metrics about the performance as > make _progress_ in query processing. > Today, these are conflated in a single {{StreamingQueryStatus}} object. The > downside to this approach is that a user now needs to reason about what state > the query is in anytime they retrieve a status object. Fields like > {{statusMessage}} don't appear in updates that come from listener bus. > Simlarly, {{inputRate}}/{{processingRate}} statistics are usually {{0}} when > you retrieve a status object from the query itself. > I propose we make the follow changes: > - Make {{status}} only report instantaneous things, such as if data is > available or a human readable message about what phase we are currently in. > - Have a separate {{progress}} message that we report for each trigger with > the other performance information that lives in status today. You should be > able to easily retrieve a configurable number of the most recent progress > messages instead of just the most recent. > While we are making these changes, I propose that we also change {{id}} to be > a globally unique identifier, rather than a JVM unique one. Without this its > hard to correlate performance across restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18553) Executor loss may cause TaskSetManager to be leaked
[ https://issues.apache.org/jira/browse/SPARK-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707105#comment-15707105 ] Apache Spark commented on SPARK-18553: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/16070 > Executor loss may cause TaskSetManager to be leaked > --- > > Key: SPARK-18553 > URL: https://issues.apache.org/jira/browse/SPARK-18553 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 2.0.3, 2.1.0, 2.2.0 > > > Due to a bug in TaskSchedulerImpl, the complete sudden loss of an executor > may cause a TaskSetManager to be leaked, causing ShuffleDependencies and > other data structures to be kept alive indefinitely, leading to various types > of resource leaks (including shuffle file leaks). > In a nutshell, the problem is that TaskSchedulerImpl did not maintain its own > mapping from executorId to running task ids, leaving it unable to clean up > taskId to taskSetManager maps when an executor is totally lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18638) Upgrade sbt to 0.13.13
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15707063#comment-15707063 ] Apache Spark commented on SPARK-18638: -- User 'weiqingy' has created a pull request for this issue: https://github.com/apache/spark/pull/16069 > Upgrade sbt to 0.13.13 > -- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18638) Upgrade sbt to 0.13.13
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18638: Assignee: (was: Apache Spark) > Upgrade sbt to 0.13.13 > -- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Priority: Minor > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18638) Upgrade sbt to 0.13.13
[ https://issues.apache.org/jira/browse/SPARK-18638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18638: Assignee: Apache Spark > Upgrade sbt to 0.13.13 > -- > > Key: SPARK-18638 > URL: https://issues.apache.org/jira/browse/SPARK-18638 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Weiqing Yang >Assignee: Apache Spark >Priority: Minor > > v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and > upgrade it from 0.13.11 to 0.13.13. The release notes since the last version > we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and > https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some > regression fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18553) Executor loss may cause TaskSetManager to be leaked
[ https://issues.apache.org/jira/browse/SPARK-18553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18553: --- Fix Version/s: 2.2.0 2.1.0 > Executor loss may cause TaskSetManager to be leaked > --- > > Key: SPARK-18553 > URL: https://issues.apache.org/jira/browse/SPARK-18553 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 2.0.3, 2.1.0, 2.2.0 > > > Due to a bug in TaskSchedulerImpl, the complete sudden loss of an executor > may cause a TaskSetManager to be leaked, causing ShuffleDependencies and > other data structures to be kept alive indefinitely, leading to various types > of resource leaks (including shuffle file leaks). > In a nutshell, the problem is that TaskSchedulerImpl did not maintain its own > mapping from executorId to running task ids, leaving it unable to clean up > taskId to taskSetManager maps when an executor is totally lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18638) Upgrade sbt to 0.13.13
Weiqing Yang created SPARK-18638: Summary: Upgrade sbt to 0.13.13 Key: SPARK-18638 URL: https://issues.apache.org/jira/browse/SPARK-18638 Project: Spark Issue Type: Improvement Components: Build Reporter: Weiqing Yang Priority: Minor v2.1.0-rc1has been out. For 2.2.x, it is better to keep sbt up-to-date, and upgrade it from 0.13.11 to 0.13.13. The release notes since the last version we used are: https://github.com/sbt/sbt/releases/tag/v0.13.12 and https://github.com/sbt/sbt/releases/tag/v0.13.13. Both releases include some regression fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961 ] Zhan Zhang commented on SPARK-18637: [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ @Public @Evolving @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) @Inherited public @interface UDFType { /** * Certain optimizations should not be applied if UDF is not deterministic. * Deterministic UDF returns same result each time it is invoked with a * particular input. This determinism just needs to hold within the context of * a query. * * @return true if the UDF is deterministic */ boolean deterministic() default true; /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; /** * A UDF is considered distinctLike if the UDF can be evaluated on just the * distinct values of a column. Examples include min and max UDFs. This * information is used by metadata-only optimizer. * * @return true if UDF is distinctLike */ boolean distinctLike() default false; /** * Using in analytical functions to specify that UDF implies an ordering * * @return true if the function implies order */ boolean impliesOrder() default false; } > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706961#comment-15706961 ] Zhan Zhang edited comment on SPARK-18637 at 11/29/16 11:52 PM: --- [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ was (Author: zhzhan): [~hvanhovell] It is an annotation. /** * UDFType annotations are used to describe properties of a UDF. This gives * important information to the optimizer. * If the UDF is not deterministic, or if it is stateful, it is necessary to * annotate it as such for correctness. * */ @Public @Evolving @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) @Inherited public @interface UDFType { /** * Certain optimizations should not be applied if UDF is not deterministic. * Deterministic UDF returns same result each time it is invoked with a * particular input. This determinism just needs to hold within the context of * a query. * * @return true if the UDF is deterministic */ boolean deterministic() default true; /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; /** * A UDF is considered distinctLike if the UDF can be evaluated on just the * distinct values of a column. Examples include min and max UDFs. This * information is used by metadata-only optimizer. * * @return true if UDF is distinctLike */ boolean distinctLike() default false; /** * Using in analytical functions to specify that UDF implies an ordering * * @return true if the function implies order */ boolean impliesOrder() default false; } > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18637: Assignee: Apache Spark > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang >Assignee: Apache Spark > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-18637: --- Component/s: SQL > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706935#comment-15706935 ] Apache Spark commented on SPARK-18637: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/16068 > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18637: Assignee: (was: Apache Spark) > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706928#comment-15706928 ] Herman van Hovell edited comment on SPARK-18637 at 11/29/16 11:35 PM: -- {{UDFType}} is a Hive construct right? was (Author: hvanhovell): {{UDFType}} is a Hive contruct right? > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706928#comment-15706928 ] Herman van Hovell commented on SPARK-18637: --- {{UDFType}} is a Hive contruct right? > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18614) Incorrect predicate pushdown from ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-18614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18614. --- Resolution: Fixed Assignee: Nattavut Sutyanyong Fix Version/s: 2.1.0 > Incorrect predicate pushdown from ExistenceJoin > --- > > Key: SPARK-18614 > URL: https://issues.apache.org/jira/browse/SPARK-18614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Nattavut Sutyanyong >Priority: Minor > Fix For: 2.1.0 > > > This is a follow-up work from SPARK-18597 to close a potential incorrect > rewrite in {{PushPredicateThroughJoin}} rule of the Optimizer phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18637) Stateful UDF should be considered as nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706905#comment-15706905 ] Zhan Zhang commented on SPARK-18637: Here is the comments from UDFType /** * If a UDF stores state based on the sequence of records it has processed, it * is stateful. A stateful UDF cannot be used in certain expressions such as * case statement and certain optimizations such as AND/OR short circuiting * don't apply for such UDFs, as they need to be invoked for each record. * row_sequence is an example of stateful UDF. A stateful UDF is considered to * be non-deterministic, irrespective of what deterministic() returns. * * @return true */ boolean stateful() default false; > Stateful UDF should be considered as nondeterministic > - > > Key: SPARK-18637 > URL: https://issues.apache.org/jira/browse/SPARK-18637 > Project: Spark > Issue Type: Bug >Reporter: Zhan Zhang > > If the annotation UDFType of a udf is stateful, it shoudl be considered as > non-deterministic. Otherwise, the catalyst may optimize the plan and return > the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18637) Stateful UDF should be considered as nondeterministic
Zhan Zhang created SPARK-18637: -- Summary: Stateful UDF should be considered as nondeterministic Key: SPARK-18637 URL: https://issues.apache.org/jira/browse/SPARK-18637 Project: Spark Issue Type: Bug Reporter: Zhan Zhang If the annotation UDFType of a udf is stateful, it shoudl be considered as non-deterministic. Otherwise, the catalyst may optimize the plan and return the wrong result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18631) Avoid making data skew worse in ExchangeCoordinator
[ https://issues.apache.org/jira/browse/SPARK-18631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18631. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16065 [https://github.com/apache/spark/pull/16065] > Avoid making data skew worse in ExchangeCoordinator > --- > > Key: SPARK-18631 > URL: https://issues.apache.org/jira/browse/SPARK-18631 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0 >Reporter: Mark Hamstra >Assignee: Mark Hamstra > Fix For: 2.2.0 > > > The logic to resize partitions in the ExchangeCoordinator is to not start a > new partition until the targetPostShuffleInputSize is equalled or exceeded. > This can make data skew problems worse since a number of small partitions can > first be combined as long as the combined size remains smaller than the > targetPostShuffleInputSize, and then a large, data-skewed partition can be > further combined, making it even bigger than it already was. > It's a fairly simple to change the logic to create a new partition if adding > a new piece would exceed the targetPostShuffleInputSize instead of only > creating a new partition after the targetPostShuffleInputSize has already > been exceeded. This results in a few more partitions being created by the > ExchangeCoordinator, but data skew problems are at least not made worse even > though they are not made any better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18636) UnsafeShuffleWriter and DiskBlockObjectWriter do not consider encryption / compression in metrics
Marcelo Vanzin created SPARK-18636: -- Summary: UnsafeShuffleWriter and DiskBlockObjectWriter do not consider encryption / compression in metrics Key: SPARK-18636 URL: https://issues.apache.org/jira/browse/SPARK-18636 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Minor The code in {{UnsafeShuffleWriter}} and {{DiskBlockObjectWriter}} only wraps the file output stream when collecting metrics, so it does not count the time it takes to compress and / or encrypt the data. This makes the metrics a little less accurate than they should be. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706814#comment-15706814 ] Cody Koeninger commented on SPARK-18475: Glad you agree it shouldn't be enabled by default. If you're in an organization where you are responsible for shit that other people broke, but have no power to actually fix it correctly... I'm not sure there's anything useful I can say there. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706788#comment-15706788 ] Burak Yavuz commented on SPARK-18475: - I'd be happy to share performance results. You're right, I never tried it with SSL on. One thing to note is that I was never planning to have this enabled by default, because there is no way to think of a sane default parallelism value. What I was hoping to achieve was provide Spark users, who may not be Kafka experts a "Break in case of emergency" way out. It's easy to say "Partition your data properly" to people, until someone upstream in your organization changes one thing and the data engineer has to deal with the mess of skewed data. You may want to tell people, "hey increase your Kafka partitions" if you want to increase Kafka parallelism, but is that a viable operation when your queues are already messed up, and the damage has been already done. Are you going to have them empty the queue, delete the topic, create a topic with increased number of partitions and re-consume everything so that it is properly partitioned again? It's easy to talk about what needs to be done, and what is the proper way to do things until shit hits the fan in production with something that is/was totally out of your control and you have to clean up the mess. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16551) Accumulator Examples should demonstrate different use case from UDAFs
[ https://issues.apache.org/jira/browse/SPARK-16551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706764#comment-15706764 ] Ruiming Zhou commented on SPARK-16551: -- I can look at this issue. > Accumulator Examples should demonstrate different use case from UDAFs > - > > Key: SPARK-16551 > URL: https://issues.apache.org/jira/browse/SPARK-16551 > Project: Spark > Issue Type: Documentation >Reporter: Vladimir Feinberg >Priority: Minor > > Currently, the Spark programming guide demonstrates Accumulators > (http://spark.apache.org/docs/latest/programming-guide.html#accumulators) by > taking the sum of an RDD. > This example makes new users think that Accumulators serve the role that > UDAFs do, which they don't. They're meant to be out-of-band, small values > that don't break pipe-lining. Documentation examples and notes should reflect > this (and warn that they may cause driver bottlenecks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706758#comment-15706758 ] Cody Koeninger commented on SPARK-18475: Burak hasn't empirically shown that it is of benefit for a properly partitioned, non-skewed kafka topic, especially if SSL is enabled (because of the effect on consumer caching). Any output operation can tell the difference in ordering. People are welcome to convince you that this is a worthwhile option, but there is no way it should be on by default. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-18475: -- > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source
[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706692#comment-15706692 ] Michael Armbrust commented on SPARK-18475: -- I think that this suggestion was closed prematurely. While I don't think that we want to always perform this optimization, I think that for a large subset of the {{DataFrame}} operations that we support this is valid. Furthermore, Burak has already show empirically that it significantly increases throughput, and I don't think that should be dismissed. Spark users are not always the same people who are configuring Kafka, and I don't see a reason to tie their hands. To unpack some of the specific concerns: - *Violation of Kafka's Ordering* - The proposal doesn't change the order of data presented by an iterator. It just subdivides further than the existing batching mechanism and parallelizes. For an operation like {{mapPartitions}}, running two correctly ordered partitions in parallel is indistinguishable from running them serially at batch boundaries. That is, unless your computation is non-deterministic as a result of communication with an external store. Here, it should be noted that non-deterministic computation violates our recovery semantics, and should be avoided anyway. That said, there certainly are cases where people may choose to give up correctness during recovery and that is why I agree this optimization should be optional. Perhaps even off by default. - *Partitions are the answer* - Sufficient partitions are helpful, but this optimization would allow you to increase throughput through the use of replicas as well. And again, Spark users are not always Kafka administrators. Now, there is an operation, {{mapWithState}}, where this optimization could change the result. I do think we will want to support this operation eventually (maybe in 2.2). I haven't really figured out the specifics, but I would imagine we can use existing mechanisms in the query planner, such as {{requiredChildOrdering}} or {{requiredChildDistribution}} to make sure that we only turn this on when it can't change the answer. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > -- > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull
[ https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706606#comment-15706606 ] Apache Spark commented on SPARK-17897: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16067 > not isnotnull is converted to the always false condition isnotnull && not > isnotnull > --- > > Key: SPARK-17897 > URL: https://issues.apache.org/jira/browse/SPARK-17897 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.0, 2.0.1 >Reporter: Jordan Halterman > Labels: correctness > > When a logical plan is built containing the following somewhat nonsensical > filter: > {{Filter (NOT isnotnull($f0#212))}} > During optimization the filter is converted into a condition that will always > fail: > {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}} > This appears to be caused by the following check for {{NullIntolerant}}: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63 > Which recurses through the expression and extracts nested {{IsNotNull}} > calls, converting them to {{IsNotNull}} calls on the attribute at the root > level: > https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49 > This results in the nonsensical condition above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18635) Partition name/values not escaped correctly in some cases
[ https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18635: --- Target Version/s: 2.1.0 Priority: Critical (was: Major) > Partition name/values not escaped correctly in some cases > - > > Key: SPARK-18635 > URL: https://issues.apache.org/jira/browse/SPARK-18635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Priority: Critical > > For example, the following command does not insert data properly into the > table > {code} > spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as > B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18635) Partition name/values not escaped correctly in some cases
Eric Liang created SPARK-18635: -- Summary: Partition name/values not escaped correctly in some cases Key: SPARK-18635 URL: https://issues.apache.org/jira/browse/SPARK-18635 Project: Spark Issue Type: Sub-task Reporter: Eric Liang For example, the following command does not insert data properly into the table {code} spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
[ https://issues.apache.org/jira/browse/SPARK-18545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18545: --- Issue Type: Sub-task (was: Test) Parent: SPARK-17861 > Verify number of hive client RPCs in PartitionedTablePerfStatsSuite > --- > > Key: SPARK-18545 > URL: https://issues.apache.org/jira/browse/SPARK-18545 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > To avoid performance regressions like > https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should > add a metric for the number of Hive client RPC issued and check it in the > perf stats suite. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18507) Major performance regression in SHOW PARTITIONS on partitioned Hive tables
[ https://issues.apache.org/jira/browse/SPARK-18507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18507: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Major performance regression in SHOW PARTITIONS on partitioned Hive tables > -- > > Key: SPARK-18507 > URL: https://issues.apache.org/jira/browse/SPARK-18507 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.1.0 > > > Commit {{ccb11543048dccd4cc590a8db1df1d9d5847d112}} > (https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112) > appears to have introduced a major regression in the performance of the Hive > {{SHOW PARTITIONS}} command. Running that command on a Hive table with 17,337 > partitions in the {{spark-sql}} shell with the parent commit of {{ccb1154}} > takes approximately 7.3 seconds. Running the same command with commit > {{ccb1154}} takes approximately 250 seconds. > I have not had the opportunity to complete a thorough investigation, but I > suspect the problem lies in the diff hunk beginning at > https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112#diff-159191585e10542f013cb3a714f26075L675. > If that's the case, this performance issue should manifest itself in other > areas as this programming pattern was used elsewhere in this commit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18429: Issue Type: Sub-task (was: New Feature) Parent: SPARK-16026 > SQL aggregate function for CountMinSketch > - > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Zhenhua Wang > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18429) SQL aggregate function for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18429. - Resolution: Fixed Assignee: Zhenhua Wang Fix Version/s: 2.2.0 > SQL aggregate function for CountMinSketch > - > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.0 > > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18429) SQL aggregate function for CountMinSketch
[ https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18429: Summary: SQL aggregate function for CountMinSketch (was: implement a new Aggregate for CountMinSketch) > SQL aggregate function for CountMinSketch > - > > Key: SPARK-18429 > URL: https://issues.apache.org/jira/browse/SPARK-18429 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Zhenhua Wang > > Implement a new Aggregate to generate count min sketch, which is a wrapper of > CountMinSketch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes
[ https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18632: Target Version/s: 2.2.0 > AggregateFunction should not ImplicitCastInputTypes > --- > > Key: SPARK-18632 > URL: https://issues.apache.org/jira/browse/SPARK-18632 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which > enables implicit input type casting). This can lead to unexpected results, > and should only be enabled when it is suitable for the function at hand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18632) AggregateFunction should not ImplicitCastInputTypes
[ https://issues.apache.org/jira/browse/SPARK-18632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18632: Assignee: Herman van Hovell (was: Apache Spark) > AggregateFunction should not ImplicitCastInputTypes > --- > > Key: SPARK-18632 > URL: https://issues.apache.org/jira/browse/SPARK-18632 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > {{AggregateFunction}} currently implements {{ImplicitCastInputTypes}} (which > enables implicit input type casting). This can lead to unexpected results, > and should only be enabled when it is suitable for the function at hand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org