[jira] [Created] (SPARK-23761) Dataframe filter(udf) followed by groupby in pyspark throws a casting error
Dhaniram Kshirsagar created SPARK-23761: --- Summary: Dataframe filter(udf) followed by groupby in pyspark throws a casting error Key: SPARK-23761 URL: https://issues.apache.org/jira/browse/SPARK-23761 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.6.0 Environment: pyspark 1.6.0 Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37) [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2 CentOS 6.7 Reporter: Dhaniram Kshirsagar On pyspark with dataframe, we are getting following exception when 'filter(with UDF) is followed by groupby' :- # Snippet of error observed in pyspark {code:java} py4j.protocol.Py4JJavaError: An error occurred while calling o56.filter. : java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate{code} This one looks like https://issues.apache.org/jira/browse/SPARK-12981 however not sure if this one is same. Here is gist with pyspark steps to reproduce this issue: [https://gist.github.com/dhaniram-kshirsagar/d72545620b6a05d145a1a6bece797b6d] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23666) Undeterministic column name with UDFs
[ https://issues.apache.org/jira/browse/SPARK-23666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23666. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Undeterministic column name with UDFs > - > > Key: SPARK-23666 > URL: https://issues.apache.org/jira/browse/SPARK-23666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Daniel Darabos >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.4.0 > > > When you access structure fields in Spark SQL, the auto-generated result > column name includes an internal ID. > {code:java} > scala> import spark.implicits._ > scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x") > scala> spark.udf.register("f", (a: Int) => a) > scala> spark.sql("select f(a._1) from x").show > +-+ > |UDF:f(a._1 AS _1#148)| > +-+ > |1| > +-+ > {code} > This ID ({{#148}}) is only included for UDFs. > {code:java} > scala> spark.sql("select factorial(a._1) from x").show > +---+ > |factorial(a._1 AS `_1`)| > +---+ > | 1| > +---+ > {code} > The internal ID is different on every invocation. The problem this causes for > us is that the schema of the SQL output is never the same: > {code:java} > scala> spark.sql("select f(a._1) from x").schema == >spark.sql("select f(a._1) from x").schema > Boolean = false > {code} > We rely on similar schema checks when reloading persisted data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23234) ML python test failure due to default outputCol
[ https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23234. -- Resolution: Duplicate > ML python test failure due to default outputCol > --- > > Key: SPARK-23234 > URL: https://issues.apache.org/jira/browse/SPARK-23234 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Major > > SPARK-22799 and SPARK-22797 are causing valid Python test failures. The > reason is that Python is setting the default params with set. So they are not > considered as defaults, but as params passed by the user. > This means that an outputCol is set not as a default but as a real value. > Anyway, this is a misbehavior of the python API which can cause serious > problems and I'd suggest to rethink the way this is done. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers
[ https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407510#comment-16407510 ] Bryan Cutler commented on SPARK-23244: -- Just to clarify, the PySpark save/load is just a wrapper making the same calls in Java, so that will fix the root cause of the issue > Incorrect handling of default values when deserializing python wrappers of > scala transformers > - > > Key: SPARK-23244 > URL: https://issues.apache.org/jira/browse/SPARK-23244 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Tomas Nykodym >Priority: Minor > > Default values are not handled properly when serializing/deserializing python > trasnformers which are wrappers around scala objects. It looks like that > after deserialization the default values which were based on uid do not get > properly restored and values which were not set are set to their (original) > default values. > Here's a simple code example using Bucketizer: > {code:python} > >>> from pyspark.ml.feature import Bucketizer > >>> a = Bucketizer() > >>> a.save("bucketizer0") > >>> b = load("bucketizer0") > >>> a._defaultParamMap[a.outputCol] > u'Bucketizer_440bb49206c148989db7__output' > >>> b._defaultParamMap[b.outputCol] > u'Bucketizer_41cf9afbc559ca2bfc9a__output' > >>> a.isSet(a.outputCol) > False > >>> b.isSet(b.outputCol) > True > >>> a.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > >>> b.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers
[ https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23244. -- Resolution: Duplicate > Incorrect handling of default values when deserializing python wrappers of > scala transformers > - > > Key: SPARK-23244 > URL: https://issues.apache.org/jira/browse/SPARK-23244 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Tomas Nykodym >Priority: Minor > > Default values are not handled properly when serializing/deserializing python > trasnformers which are wrappers around scala objects. It looks like that > after deserialization the default values which were based on uid do not get > properly restored and values which were not set are set to their (original) > default values. > Here's a simple code example using Bucketizer: > {code:python} > >>> from pyspark.ml.feature import Bucketizer > >>> a = Bucketizer() > >>> a.save("bucketizer0") > >>> b = load("bucketizer0") > >>> a._defaultParamMap[a.outputCol] > u'Bucketizer_440bb49206c148989db7__output' > >>> b._defaultParamMap[b.outputCol] > u'Bucketizer_41cf9afbc559ca2bfc9a__output' > >>> a.isSet(a.outputCol) > False > >>> b.isSet(b.outputCol) > True > >>> a.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > >>> b.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers
[ https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407507#comment-16407507 ] Bryan Cutler commented on SPARK-23244: -- I looked into this and it is a little bit different because with save/load, params are only transferred from Java to Python. So the actual problem is in Scala: {code:java} scala> import org.apache.spark.ml.feature.Bucketizer import org.apache.spark.ml.feature.Bucketizer scala> val a = new Bucketizer() a: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18 scala> a.isSet(a.outputCol) res2: Boolean = false scala> a.save("bucketizer0") scala> val b = Bucketizer.load("bucketizer0") b: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18 scala> b.isSet(b.outputCol) res4: Boolean = true{code} It seems this is being worked on in SPARK-23455, so I'll still close this as a duplicate > Incorrect handling of default values when deserializing python wrappers of > scala transformers > - > > Key: SPARK-23244 > URL: https://issues.apache.org/jira/browse/SPARK-23244 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Tomas Nykodym >Priority: Minor > > Default values are not handled properly when serializing/deserializing python > trasnformers which are wrappers around scala objects. It looks like that > after deserialization the default values which were based on uid do not get > properly restored and values which were not set are set to their (original) > default values. > Here's a simple code example using Bucketizer: > {code:python} > >>> from pyspark.ml.feature import Bucketizer > >>> a = Bucketizer() > >>> a.save("bucketizer0") > >>> b = load("bucketizer0") > >>> a._defaultParamMap[a.outputCol] > u'Bucketizer_440bb49206c148989db7__output' > >>> b._defaultParamMap[b.outputCol] > u'Bucketizer_41cf9afbc559ca2bfc9a__output' > >>> a.isSet(a.outputCol) > False > >>> b.isSet(b.outputCol) > True > >>> a.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > >>> b.getOutputCol() > u'Bucketizer_440bb49206c148989db7__output' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
[ https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407466#comment-16407466 ] Apache Spark commented on SPARK-23760: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/20870 > CodegenContext.withSubExprEliminationExprs should save/restore CSE state > correctly > -- > > Key: SPARK-23760 > URL: https://issues.apache.org/jira/browse/SPARK-23760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Kris Mok >Priority: Major > > There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes > it effectively always clear the subexpression elimination state after it's > called. > The original intent of this function was that it should save the old state, > set the given new state as current and perform codegen (invoke > {{Expression.genCode()}}), and at the end restore the subexpression > elimination state back to the old state. This ticket tracks a fix to actually > implement the original intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
[ https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23760: Assignee: (was: Apache Spark) > CodegenContext.withSubExprEliminationExprs should save/restore CSE state > correctly > -- > > Key: SPARK-23760 > URL: https://issues.apache.org/jira/browse/SPARK-23760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Kris Mok >Priority: Major > > There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes > it effectively always clear the subexpression elimination state after it's > called. > The original intent of this function was that it should save the old state, > set the given new state as current and perform codegen (invoke > {{Expression.genCode()}}), and at the end restore the subexpression > elimination state back to the old state. This ticket tracks a fix to actually > implement the original intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
[ https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23760: Assignee: Apache Spark > CodegenContext.withSubExprEliminationExprs should save/restore CSE state > correctly > -- > > Key: SPARK-23760 > URL: https://issues.apache.org/jira/browse/SPARK-23760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes > it effectively always clear the subexpression elimination state after it's > called. > The original intent of this function was that it should save the old state, > set the given new state as current and perform codegen (invoke > {{Expression.genCode()}}), and at the end restore the subexpression > elimination state back to the old state. This ticket tracks a fix to actually > implement the original intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly
Kris Mok created SPARK-23760: Summary: CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly Key: SPARK-23760 URL: https://issues.apache.org/jira/browse/SPARK-23760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: Kris Mok There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes it effectively always clear the subexpression elimination state after it's called. The original intent of this function was that it should save the old state, set the given new state as current and perform codegen (invoke {{Expression.genCode()}}), and at the end restore the subexpression elimination state back to the old state. This ticket tracks a fix to actually implement the original intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407451#comment-16407451 ] Teng Peng edited comment on SPARK-19208 at 3/21/18 4:44 AM: [~timhunter] Has the Jira ticket been opened? I believe the new API for statistical info would be a great improvement. was (Author: teng peng): [~timhunter] Has the Jira ticket been opened? I believe this would be a great improvement. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Major > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407451#comment-16407451 ] Teng Peng commented on SPARK-19208: --- [~timhunter] Has the Jira ticket been opened? I believe this would be a great improvement. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Major > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389150#comment-16389150 ] Franck Tago edited comment on SPARK-23519 at 3/21/18 3:29 AM: -- Any updates on this ? Could someone assist with this? was (Author: tafra...@gmail.com): Any updates on this ? > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-23519: Description: 1- create and populate a hive table . I did this in a hive cli session .[ not that this matters ] create table atable (col1 int) ; insert into atable values (10 ) , (100) ; 2. create a view from the table. [These actions were performed from a spark shell ] spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 from atable ") java.lang.AssertionError: assertion failed: The view output (col1,col1) contains duplicate column name. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) was: 1- create and populate a hive table . I did this in a hive cli session .[ not that this matters ] create table atable (col1 int) ; insert into atable values (10 ) , (100) ; 2. create a view form the table. [ I did this from a spark shell ] spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 from atable ") java.lang.AssertionError: assertion failed: The view output (col1,col1) contains duplicate column name. at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error
[ https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407355#comment-16407355 ] abel-sun commented on SPARK-23513: -- Can you provide some more error message![~Fray] > java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit > error > --- > > Key: SPARK-23513 > URL: https://issues.apache.org/jira/browse/SPARK-23513 > Project: Spark > Issue Type: Bug > Components: EC2, Examples, Input/Output, Java API >Affects Versions: 1.4.0, 2.2.0 >Reporter: Rawia >Priority: Blocker > > Hello > I'm trying to run a spark application (distributedWekaSpark) but when I'm > using the spark-submit command I get this error > {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.io.IOException: Expected 12 fields, but got 5 for row: > outlook,temperature,humidity,windy,play > {quote}{quote} > I tried with other datasets but always the same error appeared, (always 12 > fields expected) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20709) spark-shell use proxy-user failed
[ https://issues.apache.org/jira/browse/SPARK-20709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407334#comment-16407334 ] KaiXinXIaoLei commented on SPARK-20709: --- [~ffbin] [~srowen] i also meet this problem. Can u tell me how to solve this > spark-shell use proxy-user failed > - > > Key: SPARK-20709 > URL: https://issues.apache.org/jira/browse/SPARK-20709 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: fangfengbin >Priority: Major > > cmd is : spark-shell --master yarn-client --proxy-user leoB > Throw Exception: failedto find any Kerberos tgt > Log is: > 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, > sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of > successful kerberos logins and latency (milliseconds)]) > 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, > sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of > failed kerberos logins and latency (milliseconds)]) > 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, > sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[GetGroups]) > 17/05/11 15:56:21 DEBUG MetricsSystemImpl: UgiMetrics, User and group related > metrics > 17/05/11 15:56:22 DEBUG Shell: setsid exited with exit code 0 > 17/05/11 15:56:22 DEBUG Groups: Creating new Groups object > 17/05/11 15:56:22 DEBUG NativeCodeLoader: Trying to load the custom-built > native-hadoop library... > 17/05/11 15:56:22 DEBUG NativeCodeLoader: Loaded the native-hadoop library > 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMapping: Using > JniBasedUnixGroupsMapping for Group resolution > 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping > impl=org.apache.hadoop.security.JniBasedUnixGroupsMapping > 17/05/11 15:56:22 DEBUG Groups: Group mapping > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; > cacheTimeout=30; warningDeltaMs=5000 > 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login > 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login commit > 17/05/11 15:56:22 DEBUG UserGroupInformation: using kerberos > user:sp...@hadoop.com > 17/05/11 15:56:22 DEBUG UserGroupInformation: Using user: "sp...@hadoop.com" > with name sp...@hadoop.com > 17/05/11 15:56:22 DEBUG UserGroupInformation: User entry: "sp...@hadoop.com" > 17/05/11 15:56:22 DEBUG UserGroupInformation: Assuming keytab is managed > externally since logged in from subject. > 17/05/11 15:56:22 DEBUG UserGroupInformation: UGI loginUser:sp...@hadoop.com > (auth:KERBEROS) > 17/05/11 15:56:22 DEBUG UserGroupInformation: Current time is 1494489382449 > 17/05/11 15:56:22 DEBUG UserGroupInformation: Next refresh is 1494541210600 > 17/05/11 15:56:22 DEBUG UserGroupInformation: PrivilegedAction as:leoB > (auth:PROXY) via sp...@hadoop.com (auth:KERBEROS) > from:org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170) > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > 17/05/11 15:56:29 WARN SparkConf: In Spark 1.0 and later spark.local.dir will > be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS > in mesos/standalone and LOCAL_DIRS in YARN). > 17/05/11 15:56:56 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR > env not found! > 17/05/11 15:56:56
[jira] [Commented] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407302#comment-16407302 ] Weichen Xu commented on SPARK-23751: I will work on this. :) > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23455: -- Shepherd: Joseph K. Bradley > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP
[ https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407288#comment-16407288 ] Apache Spark commented on SPARK-23759: -- User 'felixalbani' has created a pull request for this issue: https://github.com/apache/spark/pull/20867 > Unable to bind Spark2 history server to specific host name / IP > --- > > Key: SPARK-23759 > URL: https://issues.apache.org/jira/browse/SPARK-23759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: Felix >Priority: Major > > Ideally, exporting SPARK_LOCAL_IP= in spark2 > environment should allow Spark2 History server to bind to private interface > however this is not working in spark 2.2.0 > Spark2 history server still listens on 0.0.0.0 > {code:java} > [root@sparknode1 ~]# netstat -tulapn|grep 18081 > tcp0 0 0.0.0.0:18081 0.0.0.0:* > LISTEN 21313/java > tcp0 0 172.26.104.151:39126172.26.104.151:18081 > TIME_WAIT - > {code} > On earlier versions this change was working fine: > {code:java} > [root@dwphive1 ~]# netstat -tulapn|grep 18081 > tcp0 0 172.26.113.55:18081 0.0.0.0:* > LISTEN 2565/java > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP
[ https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23759: Assignee: Apache Spark > Unable to bind Spark2 history server to specific host name / IP > --- > > Key: SPARK-23759 > URL: https://issues.apache.org/jira/browse/SPARK-23759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: Felix >Assignee: Apache Spark >Priority: Major > > Ideally, exporting SPARK_LOCAL_IP= in spark2 > environment should allow Spark2 History server to bind to private interface > however this is not working in spark 2.2.0 > Spark2 history server still listens on 0.0.0.0 > {code:java} > [root@sparknode1 ~]# netstat -tulapn|grep 18081 > tcp0 0 0.0.0.0:18081 0.0.0.0:* > LISTEN 21313/java > tcp0 0 172.26.104.151:39126172.26.104.151:18081 > TIME_WAIT - > {code} > On earlier versions this change was working fine: > {code:java} > [root@dwphive1 ~]# netstat -tulapn|grep 18081 > tcp0 0 172.26.113.55:18081 0.0.0.0:* > LISTEN 2565/java > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP
[ https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23759: Assignee: (was: Apache Spark) > Unable to bind Spark2 history server to specific host name / IP > --- > > Key: SPARK-23759 > URL: https://issues.apache.org/jira/browse/SPARK-23759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.2.0 >Reporter: Felix >Priority: Major > > Ideally, exporting SPARK_LOCAL_IP= in spark2 > environment should allow Spark2 History server to bind to private interface > however this is not working in spark 2.2.0 > Spark2 history server still listens on 0.0.0.0 > {code:java} > [root@sparknode1 ~]# netstat -tulapn|grep 18081 > tcp0 0 0.0.0.0:18081 0.0.0.0:* > LISTEN 21313/java > tcp0 0 172.26.104.151:39126172.26.104.151:18081 > TIME_WAIT - > {code} > On earlier versions this change was working fine: > {code:java} > [root@dwphive1 ~]# netstat -tulapn|grep 18081 > tcp0 0 172.26.113.55:18081 0.0.0.0:* > LISTEN 2565/java > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-23749: Description: {noformat} 18/03/15 22:34:46 WARN Hive: Failed to register all functions. org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.TApplicationException: Invalid method name: 'get_all_functions' at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388) at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:273) at org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply$mcV$sp(HiveDelegationTokenProvider.scala:95) at org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply(HiveDelegationTokenProvider.scala:94) at org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply(HiveDelegationTokenProvider.scala:94) at org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anon$1.run(HiveDelegationTokenProvider.scala:131) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) at org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130) at org.apache.spark.deploy.security.HiveDelegationTokenProvider.obtainDelegationTokens(HiveDelegationTokenProvider.scala:94) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:132) at org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:130) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:130) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.obtainDelegationTokens(YARNHadoopDelegationTokenManager.scala:56) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:388) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:869) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.(SparkContext.scala:501) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2489) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:304) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:157) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.Java
[jira] [Commented] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints
[ https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407248#comment-16407248 ] Apache Spark commented on SPARK-23750: -- User 'ioana-delaney' has created a pull request for this issue: https://github.com/apache/spark/pull/20868 > [Performance] Inner Join Elimination based on Informational RI constraints > -- > > Key: SPARK-23750 > URL: https://issues.apache.org/jira/browse/SPARK-23750 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ioana Delaney >Priority: Major > > +*Inner Join Elimination based on Informational RI constraints*+ > This transformation detects RI joins and eliminates the parent/PK table if > none of its columns, other than the PK columns, are referenced in the query. > Typical examples that benefit from this rewrite are queries over complex > views. > *View using TPC-DS schema:* > {code} > create view customer_purchases_2002 (id, last, first, product, store_id, > month, quantity) as > select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, > d_moy, ss_quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 > {code} > The view returns customer purchases made in year 2002. It is a join between > fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and > _date_. The tables are joined using RI predicates. > If we write a query that only selects a subset of columns from the view, for > example, we are only interested in the items bought and not the stores, > internally, the Optimizer, will first merge the view into the query, and > then, based on the _primary key – foreign key_ join predicate analysis, it > will decide that the join with the _store_ table is not needed, and therefore > the _store_ table is removed. > *Query:* > {code} > select id, first, last, product, quantity > from customer_purchases_2002 > where product like ‘bicycle%’ and > month between 1 and 2 > {code} > *Internal query after view expansion:* > {code} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > *Internal optimized query after join elimination:* > {code:java} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > The join with _store_ table can be removed since no columns are retrieved > from the table, and every row from the _store_sales_ fact table will find a > match in _store_ based on the RI relationship. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints
[ https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23750: Assignee: Apache Spark > [Performance] Inner Join Elimination based on Informational RI constraints > -- > > Key: SPARK-23750 > URL: https://issues.apache.org/jira/browse/SPARK-23750 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ioana Delaney >Assignee: Apache Spark >Priority: Major > > +*Inner Join Elimination based on Informational RI constraints*+ > This transformation detects RI joins and eliminates the parent/PK table if > none of its columns, other than the PK columns, are referenced in the query. > Typical examples that benefit from this rewrite are queries over complex > views. > *View using TPC-DS schema:* > {code} > create view customer_purchases_2002 (id, last, first, product, store_id, > month, quantity) as > select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, > d_moy, ss_quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 > {code} > The view returns customer purchases made in year 2002. It is a join between > fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and > _date_. The tables are joined using RI predicates. > If we write a query that only selects a subset of columns from the view, for > example, we are only interested in the items bought and not the stores, > internally, the Optimizer, will first merge the view into the query, and > then, based on the _primary key – foreign key_ join predicate analysis, it > will decide that the join with the _store_ table is not needed, and therefore > the _store_ table is removed. > *Query:* > {code} > select id, first, last, product, quantity > from customer_purchases_2002 > where product like ‘bicycle%’ and > month between 1 and 2 > {code} > *Internal query after view expansion:* > {code} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > *Internal optimized query after join elimination:* > {code:java} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > The join with _store_ table can be removed since no columns are retrieved > from the table, and every row from the _store_sales_ fact table will find a > match in _store_ based on the RI relationship. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints
[ https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23750: Assignee: (was: Apache Spark) > [Performance] Inner Join Elimination based on Informational RI constraints > -- > > Key: SPARK-23750 > URL: https://issues.apache.org/jira/browse/SPARK-23750 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ioana Delaney >Priority: Major > > +*Inner Join Elimination based on Informational RI constraints*+ > This transformation detects RI joins and eliminates the parent/PK table if > none of its columns, other than the PK columns, are referenced in the query. > Typical examples that benefit from this rewrite are queries over complex > views. > *View using TPC-DS schema:* > {code} > create view customer_purchases_2002 (id, last, first, product, store_id, > month, quantity) as > select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, > d_moy, ss_quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 > {code} > The view returns customer purchases made in year 2002. It is a join between > fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and > _date_. The tables are joined using RI predicates. > If we write a query that only selects a subset of columns from the view, for > example, we are only interested in the items bought and not the stores, > internally, the Optimizer, will first merge the view into the query, and > then, based on the _primary key – foreign key_ join predicate analysis, it > will decide that the join with the _store_ table is not needed, and therefore > the _store_ table is removed. > *Query:* > {code} > select id, first, last, product, quantity > from customer_purchases_2002 > where product like ‘bicycle%’ and > month between 1 and 2 > {code} > *Internal query after view expansion:* > {code} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item, store > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > s_store_sk = ss_store_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > *Internal optimized query after join elimination:* > {code:java} > select c_customer_id as id, c_first_name as first, c_last_name as last, >i_product_name as product,ss_quantity as quantity > from store_sales, date_dim, customer, item > where d_date_sk = ss_sold_date_sk and > c_customer_sk = ss_customer_sk and > i_item_sk = ss_item_sk and > d_year = 2002 and > month between 1 and 2 and > product like ‘bicycle%’ > {code} > The join with _store_ table can be removed since no columns are retrieved > from the table, and every row from the _store_sales_ fact table will find a > match in _store_ based on the RI relationship. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407247#comment-16407247 ] Darek commented on SPARK-23534: --- https://github.com/Azure/azure-storage-java 7.0 will only work with org.apache.hadoop/hadoop-azure/3.0.0. I am afraid of using of older version of azure-storage because of all the security issues that have been found and fixed in the newer version, not to mention all the new features that Azure has added in the last 2 years. Using old software and public cloud = bad idea. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Priority: Critical (was: Major) > MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. > -- > > Key: SPARK-20697 > URL: https://issues.apache.org/jira/browse/SPARK-20697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0 >Reporter: Abhishek Madav >Priority: Critical > > MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table > does not restore the bucketing information to the storage descriptor in the > metastore. > Steps to reproduce: > 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) > PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED > FIELDS TERMINATED BY ','; > 2) In Hive-CLI issue a desc formatted for the table. > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:8020/user/hive/warehouse/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > transient_lastDdlTime 1494437467 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: 10 > Bucket Columns: [a] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > 3) In spark-shell, > scala> spark.sql("MSCK REPAIR TABLE partbucket") > 4) Back to Hive-CLI > desc formatted partbucket; > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: > hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > spark.sql.partitionProvider catalog > transient_lastDdlTime 1494437647 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > Further inserts to this table cannot be made in bucketed fashion through > Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
[ https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Madav updated SPARK-20697: --- Affects Version/s: 2.2.0 2.2.1 2.3.0 > MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables. > -- > > Key: SPARK-20697 > URL: https://issues.apache.org/jira/browse/SPARK-20697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0 >Reporter: Abhishek Madav >Priority: Major > > MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table > does not restore the bucketing information to the storage descriptor in the > metastore. > Steps to reproduce: > 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) > PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED > FIELDS TERMINATED BY ','; > 2) In Hive-CLI issue a desc formatted for the table. > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://localhost:8020/user/hive/warehouse/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > transient_lastDdlTime 1494437467 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: 10 > Bucket Columns: [a] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > 3) In spark-shell, > scala> spark.sql("MSCK REPAIR TABLE partbucket") > 4) Back to Hive-CLI > desc formatted partbucket; > # col_namedata_type comment > > a int > > # Partition Information > # col_namedata_type comment > > b int > > # Detailed Table Information > Database: sparkhivebucket > Owner:devbld > CreateTime: Wed May 10 10:31:07 PDT 2017 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: > hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket > Table Type: MANAGED_TABLE > Table Parameters: > spark.sql.partitionProvider catalog > transient_lastDdlTime 1494437647 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > field.delim , > serialization.format, > Further inserts to this table cannot be made in bucketed fashion through > Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407205#comment-16407205 ] Joseph K. Bradley commented on SPARK-23686: --- This will be useful! Synced offline: we'll split this up into subtasks. > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models
[ https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407163#comment-16407163 ] Joseph K. Bradley commented on SPARK-10884: --- I know a lot of people are watching this, so I'm just pinging to say I'm about to merge the PR. Please check out the changes in case you have comments & have not seen the updates. Thanks! > Support prediction on single instance for regression and classification > related models > -- > > Key: SPARK-10884 > URL: https://issues.apache.org/jira/browse/SPARK-10884 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yanbo Liang >Assignee: Weichen Xu >Priority: Major > Labels: 2.2.0 > > Support prediction on single instance for regression and classification > related models (i.e., PredictionModel, ClassificationModel and their sub > classes). > Add corresponding test cases. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP
Felix created SPARK-23759: - Summary: Unable to bind Spark2 history server to specific host name / IP Key: SPARK-23759 URL: https://issues.apache.org/jira/browse/SPARK-23759 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.2.0 Reporter: Felix Ideally, exporting SPARK_LOCAL_IP= in spark2 environment should allow Spark2 History server to bind to private interface however this is not working in spark 2.2.0 Spark2 history server still listens on 0.0.0.0 {code:java} [root@sparknode1 ~]# netstat -tulapn|grep 18081 tcp0 0 0.0.0.0:18081 0.0.0.0:* LISTEN 21313/java tcp0 0 172.26.104.151:39126172.26.104.151:18081 TIME_WAIT - {code} On earlier versions this change was working fine: {code:java} [root@dwphive1 ~]# netstat -tulapn|grep 18081 tcp0 0 172.26.113.55:18081 0.0.0.0:* LISTEN 2565/java {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem
[ https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407112#comment-16407112 ] Marco Gaido commented on SPARK-23739: - Can you provide some more info about how you are getting this error and how to reproduce? Which command are you using to submit your application? May you also provide a sample to reproduce this? Thanks. > Spark structured streaming long running problem > --- > > Key: SPARK-23739 > URL: https://issues.apache.org/jira/browse/SPARK-23739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Florencio >Priority: Critical > Labels: spark, streaming, structured > > I had a problem with long running spark structured streaming in spark 2.1. > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.common.requests.LeaveGroupResponse. > The detailed error is the following: > 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. > Metadata OffsetSeqMetadata(0,1521216656590) > 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = > Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = > \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}} > 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map() > 18/03/16 16:10:57 ERROR StreamExecution: Query [id = > a233b9ff-cc39-44d3-b953-a255986c04bf, runId = > 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error > java.util.zip.ZipException: invalid code lengths set > at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at > org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45) > at > org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2101) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370) > at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > at org.apache.spark.rdd.RDD.map(RDD.scala:369) > at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator > java.lang.NoClassDefFoundError: > org/apache/kafka/common/requests/LeaveGroupResponse > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566) > at > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555) > at > org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377) > at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(Kafk
[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407108#comment-16407108 ] Joseph K. Bradley commented on SPARK-18813: --- I just linked the roadmap for 2.4 (since we did not have one for 2.3): [SPARK-23758] > MLlib 2.2 Roadmap > - > > Key: SPARK-18813 > URL: https://issues.apache.org/jira/browse/SPARK-18813 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.2.0 > > > *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.* > The roadmap process described below is significantly updated since the 2.1 > roadmap [SPARK-15581]. Please refer to [SPARK-15581] for more discussion on > the basis for this proposal, and comment in this JIRA if you have suggestions > for improvements. > h1. Roadmap process > This roadmap is a master list for MLlib improvements we are working on during > this release. This includes ML-related changes in PySpark and SparkR. > *What is planned for the next release?* > * This roadmap lists issues which at least one Committer has prioritized. > See details below in "Instructions for committers." > * This roadmap only lists larger or more critical issues. > *How can contributors influence this roadmap?* > * If you believe an issue should be in this roadmap, please discuss the issue > on JIRA and/or the dev mailing list. Make sure to ping Committers since at > least one must agree to shepherd the issue. > * For general discussions, use this JIRA or the dev mailing list. For > specific issues, please comment on those issues or the mailing list. > * Vote for & watch issues which are important to you. > ** MLlib, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] > ** SparkR, sorted by: [Votes | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] > or [Watchers | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] > h2. Target Version and Priority > This section describes the meaning of Target Version and Priority. _These > meanings have been updated in this proposal for the 2.2 process._ > || Category | Target Version | Priority | Shepherd | Put on roadmap? | In > next release? || > | [1 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Blocker | *must* | *must* | *must* | > | [2 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Critical | *must* | yes, unless small | *best effort* | > | [3 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Major | *must* | optional | *best effort* | > | [4 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Minor | optional | no | maybe | > | [5 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0] > | next release | Trivial | optional | no | maybe | > | [6 | > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2
[jira] [Updated] (SPARK-23758) MLlib 2.4 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23758: -- Description: h1. Roadmap process This roadmap is a master list for MLlib improvements we are working on during this release. This includes ML-related changes in PySpark and SparkR. *What is planned for the next release?* * This roadmap lists issues which at least one Committer has prioritized. See details below in "Instructions for committers." * This roadmap only lists larger or more critical issues. *How can contributors influence this roadmap?* * If you believe an issue should be in this roadmap, please discuss the issue on JIRA and/or the dev mailing list. Make sure to ping Committers since at least one must agree to shepherd the issue. * For general discussions, use this JIRA or the dev mailing list. For specific issues, please comment on those issues or the mailing list. * Vote for & watch issues which are important to you. ** MLlib, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] ** SparkR, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] h2. Target Version and Priority This section describes the meaning of Target Version and Priority. || Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release? || | [1 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Blocker | *must* | *must* | *must* | | [2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Critical | *must* | yes, unless small | *best effort* | | [3 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Major | *must* | optional | *best effort* | | [4 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Minor | optional | no | maybe | | [5 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Trivial | optional | no | maybe | | [6 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) | yes | no | maybe | | [7 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(EMPTY)%20AND%20Shepherd%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) | no | no | maybe | The *Category* in the table above has the following meaning: 1. A committer has promised to see this issue to completion for the next release. Contributions *will* receive attention. 2-3. A committer has promised to see this issue to completion for the next release. Contributions *will* receive attention. The issue ma
[jira] [Created] (SPARK-23758) MLlib 2.4 Roadmap
Joseph K. Bradley created SPARK-23758: - Summary: MLlib 2.4 Roadmap Key: SPARK-23758 URL: https://issues.apache.org/jira/browse/SPARK-23758 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.4.0 Reporter: Joseph K. Bradley h1. Roadmap process This roadmap is a master list for MLlib improvements we are working on during this release. This includes ML-related changes in PySpark and SparkR. *What is planned for the next release?* * This roadmap lists issues which at least one Committer has prioritized. See details below in "Instructions for committers." * This roadmap only lists larger or more critical issues. *How can contributors influence this roadmap?* * If you believe an issue should be in this roadmap, please discuss the issue on JIRA and/or the dev mailing list. Make sure to ping Committers since at least one must agree to shepherd the issue. * For general discussions, use this JIRA or the dev mailing list. For specific issues, please comment on those issues or the mailing list. * Vote for & watch issues which are important to you. ** MLlib, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC] ** SparkR, sorted by: [Votes | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC] or [Watchers | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC] h2. Target Version and Priority This section describes the meaning of Target Version and Priority. || Category | Target Version | Priority | Shepherd | Put on roadmap? | In next release? || | [1 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Blocker | *must* | *must* | *must* | | [2 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Critical | *must* | yes, unless small | *best effort* | | [3 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Major | *must* | optional | *best effort* | | [4 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Minor | optional | no | maybe | | [5 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)] | next release | Trivial | optional | no | maybe | | [6 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) | yes | no | maybe | | [7 | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(EMPTY)%20AND%20Shepherd%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC] | (empty) | (any) | no | no | maybe | The *Category* in the table above has the following meaning: 1. A committer has promised to see this issue to completion for the next release. Contributions *will
[jira] [Updated] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values
[ https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23690: -- Shepherd: Joseph K. Bradley > VectorAssembler should have handleInvalid to handle columns with null values > > > Key: SPARK-23690 > URL: https://issues.apache.org/jira/browse/SPARK-23690 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > > VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as > an input and returns the assembled vector. It currently throws an error if it > sees a null value in any column. This behavior also affects `RFormula` that > uses VectorAssembler to assemble numeric columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23500) Filters on named_structs could be pushed into scans
[ https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23500. - Resolution: Fixed Assignee: Henry Robinson Fix Version/s: 2.4.0 > Filters on named_structs could be pushed into scans > --- > > Key: SPARK-23500 > URL: https://issues.apache.org/jira/browse/SPARK-23500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Major > Fix For: 2.4.0 > > > Simple filters on dataframes joined with {{joinWith()}} are missing an > opportunity to get pushed into the scan because they're written in terms of > {{named_struct}} that could be removed by the optimizer. > Given the following simple query over two dataframes: > {code:java} > scala> val df = spark.read.parquet("one_million") > df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> val df2 = spark.read.parquet("one_million") > df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint] > scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > > 30").explain > == Physical Plan == > *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight > :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94] > : +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > struct, false].id)) >+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95] > +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30) > +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct > {code} > Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and > is then pushed down. When the filter is just above the scan, the > wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be > removed. Then the filter can be pushed down to Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23757) [Performance] Star schema detection improvements
Ioana Delaney created SPARK-23757: - Summary: [Performance] Star schema detection improvements Key: SPARK-23757 URL: https://issues.apache.org/jira/browse/SPARK-23757 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney Star schema consists of one or more fact tables referencing a number of dimension tables. Queries against star schema are expected to run fast because of the established RI constraints among the tables. In general, star schema joins are detected using the following conditions: 1. RI constraints (reliable detection) * Dimension contains a primary key that is being joined to the fact table. * Fact table contains foreign keys referencing multiple dimension tables. 2. Cardinality based heuristics * Usually, the table with the highest cardinality is the fact table. Existing SPARK-17791 uses a combination of the above two conditions to detect and optimize star joins. With support for informational RI constraints, the algorithm in SPARK-17791 can be improved with reliable RI detection. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark
[ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406909#comment-16406909 ] Ioana Delaney commented on SPARK-19842: --- I opened several performance JIRAs to show the benefits of the informational RI constraints. > Informational Referential Integrity Constraints Support in Spark > > > Key: SPARK-19842 > URL: https://issues.apache.org/jira/browse/SPARK-19842 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Major > Attachments: InformationalRIConstraints.doc > > > *Informational Referential Integrity Constraints Support in Spark* > This work proposes support for _informational primary key_ and _foreign key > (referential integrity) constraints_ in Spark. The main purpose is to open up > an area of query optimization techniques that rely on referential integrity > constraints semantics. > An _informational_ or _statistical constraint_ is a constraint such as a > _unique_, _primary key_, _foreign key_, or _check constraint_, that can be > used by Spark to improve query performance. Informational constraints are not > enforced by the Spark SQL engine; rather, they are used by Catalyst to > optimize the query processing. They provide semantics information that allows > Catalyst to rewrite queries to eliminate joins, push down aggregates, remove > unnecessary Distinct operations, and perform a number of other optimizations. > Informational constraints are primarily targeted to applications that load > and analyze data that originated from a data warehouse. For such > applications, the conditions for a given constraint are known to be true, so > the constraint does not need to be enforced during data load operations. > The attached document covers constraint definition, metastore storage, > constraint validation, and maintenance. The document shows many examples of > query performance improvements that utilize referential integrity constraints > and can be implemented in Spark. > Link to the google doc: > [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23756) [Performance] Redundant join elimination
Ioana Delaney created SPARK-23756: - Summary: [Performance] Redundant join elimination Key: SPARK-23756 URL: https://issues.apache.org/jira/browse/SPARK-23756 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney This rewrite eliminates self-joins on unique keys. Self-joins may be introduced after view expansion. *User view:* {code} create view manager(mgrno, income) as select e.empno, e.salary + e.bonus from employee e, department d where e.empno = d.mgrno; {code} *User query:* {code} select e.empname, e.empno from employee e, manager m where e.empno = m.mgrno and m.income > 100K {code} *Internal query after view expansion:* {code} select e.lastname, e.empno from employee e, employee m, department d where e.empno = m.empno /* PK = PK */ and e.empno = d.mgrno and m.salary + m.bonus > 100K {code} *Internal query after join elimination:* {code} select e.lastname, e.empno from employee e, department d where e.empno = d.mgrno and e.salary + e.bonus > 100K {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23755) [Performance] Distinct elimination
Ioana Delaney created SPARK-23755: - Summary: [Performance] Distinct elimination Key: SPARK-23755 URL: https://issues.apache.org/jira/browse/SPARK-23755 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney The Distinct requirement can be removed if it is proved that the operation produces unique output. {code} select distinct d.deptno /* PK */, e.empname from employee e, department d where e.empno = d.mgrno /*PK = FK*/ {code} *Internal query after rewrite:* {code} select d.deptno, e.empname from employee e, department d where e.empno = d.mgrno {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-23754: --- Description: Reproduce: {code:java} df = spark.range(0, 1000) from pyspark.sql.functions import udf def foo(x): raise StopIteration() df.withColumn('v', udf(foo)).show() # Results # +---+---+ # | id| v| # +---+---+ # +---+---+{code} I think the task should fail in this case was: {code:java} df = spark.range(0, 1000) from pyspark.sql.functions import udf def foo(x): raise StopIteration() df.withColumn('v', udf(foo)).show() # Results # +---+---+ # | id| v| # +---+---+ # +---+---+ {code} > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > > Reproduce: > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+{code} > I think the task should fail in this case -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result
[ https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-23754: --- Description: {code:java} df = spark.range(0, 1000) from pyspark.sql.functions import udf def foo(x): raise StopIteration() df.withColumn('v', udf(foo)).show() # Results # +---+---+ # | id| v| # +---+---+ # +---+---+ {code} > StopIterator exception in Python UDF results in partial result > -- > > Key: SPARK-23754 > URL: https://issues.apache.org/jira/browse/SPARK-23754 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > > {code:java} > df = spark.range(0, 1000) > from pyspark.sql.functions import udf > def foo(x): > raise StopIteration() > df.withColumn('v', udf(foo)).show() > # Results > # +---+---+ > # | id| v| > # +---+---+ > # +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23754) StopIterator exception in Python UDF results in partial result
Li Jin created SPARK-23754: -- Summary: StopIterator exception in Python UDF results in partial result Key: SPARK-23754 URL: https://issues.apache.org/jira/browse/SPARK-23754 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Li Jin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406875#comment-16406875 ] Matthew Porter commented on SPARK-6190: --- Experiencing similar frustrations to Brian, we have well partitioned datasets that are just massive in size. Every once in a while Spark fails and we spend much longer than I would like doing nothing but tweaking partition values and crossing our fingers that the next run succeeds. Again, very hard to explain to higher management that despite using having hundreds of GBs of of RAM at our disposal, we are limited to 2 GBs during data shuffles. Are there any plans or intentions on resolving this bug? This and https://issues.apache.org/jira/browse/SPARK-5928 have been "In Progress" for more than 3 years now with no visible progress. > create LargeByteBuffer abstraction for eliminating 2GB limit on blocks > -- > > Key: SPARK-6190 > URL: https://issues.apache.org/jira/browse/SPARK-6190 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Attachments: LargeByteBuffer_v3.pdf > > > A key component in eliminating the 2GB limit on blocks is creating a proper > abstraction for storing more than 2GB. Currently spark is limited by a > reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at > 2GB. This task will introduce the new abstraction and the relevant > implementation and utilities, without effecting the existing implementation > at all. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources
[ https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23737. -- Resolution: Duplicate > Scala API documentation leads to nonexistent pages for sources > -- > > Key: SPARK-23737 > URL: https://issues.apache.org/jira/browse/SPARK-23737 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Alexander Bessonov >Priority: Minor > > h3. Steps to reproduce: > # Go to [Scala API > homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]]. > # Click "Source: package.scala" > h3. Result: > The link leads to nonexistent page: > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala] > h3. Expected result: > The link leads to proper page: > [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23574) SinglePartition in data source V2 scan
[ https://issues.apache.org/jira/browse/SPARK-23574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23574. - Resolution: Fixed Assignee: Jose Torres Fix Version/s: 2.4.0 > SinglePartition in data source V2 scan > -- > > Key: SPARK-23574 > URL: https://issues.apache.org/jira/browse/SPARK-23574 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.4.0 > > > DataSourceV2ScanExec currently reports UnknownPartitioning whenever the > reader doesn't mix in SupportsReportPartitioning. It can also report > SinglePartition in the case where there's a single reader factory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23753) [Performance] Group By Push Down through Join
Ioana Delaney created SPARK-23753: - Summary: [Performance] Group By Push Down through Join Key: SPARK-23753 URL: https://issues.apache.org/jira/browse/SPARK-23753 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney *Group By push down through Join* Another transformation that benefits from RI constraints is Group By push down through joins. The transformation interchanges the order of the group-by and join operations. The benefit of pushing down a group-by is that it may reduce the number of input rows to the join. On the other hand, if the join is very selective, it might make sense to execute the group by after the join. That is why this transformation is in general applied based on cost or selectivity estimates. However, if the join is an RI join, under certain conditions, it is safe to push down group by operation below the join. An example is shown below. {code} select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, sum(ss.ss_quantity) as store_sales_quantity from store_sales ss, date_dim, customer, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and s_store_sk = ss_store_sk and d_year between 2000 and 2002 group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name {code} The query computes the quantities sold grouped by _customer_ and _store_ tables. The tables are in a _star schema_ join. The grouping columns are a super set of the join keys. The aggregate columns come from the fact table _store_sales_. The group by operation can be pushed down to the fact table _store_sales_ through the join with the _customer_ and _store_ tables. The join will not affect the partitions nor the aggregates computed by the pushed down group-by since every tuple in _store_sales_ will join with a tuple in _customer_ and _store_ tables. {code} select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, v1.store_sales_quantity from customer, store, (select ss_customer_sk, ss_store_sk, sum(ss_quantity) as store_sales_quantity from store_sales, date_dim where d_date_sk = ss_sold_date_sk and d_year between 2000 and 2002 group by ss_customer_sk, ss_store_sk ) v1 where c_customer_sk = v1.ss_customer_sk and s_store_sk = v1.ss_store_sk {code} \\ When the query is run using a 1TB TPC-DS setup, the group by reduces the number of rows from 1.5 billion to 100 million rows and the query execution drops from about 70 secs to 30 secs, a 2x improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23715: -- Description: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with an explicit timezone) produces a wrong answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} Additionally, the equivalent Unix time (1520921903, which is also "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: {noformat} df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} These issues stem from the fact that the FromUTCTimestamp expression, despite its name, expects the input to be in the user's local timezone. There is some magic under the covers to make things work (mostly) as the user expects. As an example, let's say a user in Los Angeles issues the following: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show {noformat} FromUTCTimestamp gets as input a Timestamp (long) value representing {noformat} 2018-03-13T06:18:23-07:00 (long value 152094710300) {noformat} What FromUTCTimestamp needs instead is {noformat} 2018-03-13T06:18:23+00:00 (long value 152092190300) {noformat} So, it applies the local timezone's offset to the input timestamp to get the correct value (152094710300 minus 7 hours is 152092190300). Then it can process the value and produce the expected output. When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions break down. The input is no longer in the local time zone. Because of the way input data is implicitly casted, FromUTCTimestamp never knows whether the input data had an explicit timezone. Here are some gory details: There is sometimes a mismatch in expectations between the (string => timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never sees the actual input string (the cast "intercepts" the input and converts it to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp cannot reject any input value that would exercise this mismatch in expectations. There is a similar mismatch in expectations in the (integer => timestamp) cast and FromUTCTimestamp. As a result, Unix time input almost always produces incorrect output. h3. When things work as expected for String input: When from_utc_timestamp is passed a string time value with no time zone, DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the datetime string as though it's in the user's local time zone. Because DateTimeUtils.stringToTimestamp is a general function, this is reasonable. As a result, FromUTCTimestamp's input is a timestamp shifted by the local time zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility function called by FromUTCTimestamp assumes this), so the first thing it does is reverse-shift to get it back the correct value. Now that the long value has been shifted back to the correct timestamp value, it can now process it (by shifting it again based on the specified time zone). h3. When things go wrong with String input: When from_utc_timestamp is passed a string datetime value with an explicit time zone, stringToTimestamp honors that timezone and ignores the local time zone. stringToTimestamp does not shift the timestamp by the local timezone's offset, but by the timezone specified on the datetime string. Unfortunately, FromUTCTimestamp, which has no insight into the actual input or the conversion, still assumes the timestamp is shifted by the local time zone. So it reverse-shifts the long value by the local time zone's offset, which produces a incorrect timestamp (except in the case where the input datetime string just happened to have an explicit timezone that matches the local timezone). FromUTCTimestamp then uses this incorrect value for further processing. h3. When things go wrong for Unix time input: The cast in this case simply multiplies the integer by 100. The cast does not shift the resulting timestamp by the local time zone's offset. Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the result is wrong. was: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018
[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406836#comment-16406836 ] Bruce Robbins commented on SPARK-23715: --- A fix to this requires some ugly hacking of the implicit casts rule and the Cast class. It doesn't seem worth the mess. New Timestamp types (with timezone awareness) might help with this issue. > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Major > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact the FromUTCTimestamp expression, despite its > name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted
[jira] [Created] (SPARK-23752) [Performance] Existential Subquery to Inner Join
Ioana Delaney created SPARK-23752: - Summary: [Performance] Existential Subquery to Inner Join Key: SPARK-23752 URL: https://issues.apache.org/jira/browse/SPARK-23752 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney *+Existential Subquery to Inner Join+* Another enhancement that uses Informational Constraints is existential subquery to inner join. This rewrite converts an existential subquery to an inner join, and thus provides alternative join choices for the Optimizer based on the selectivity of the tables. An example using TPC-DS schema is shown below. {code} select c_first_name, c_last_name, c_email_address from customer c where EXISTS (select * from store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3) {code} Spark uses left semi-join to evaluated existential subqueries. A left semi-join will return a row from the outer table if there is at least one match in the inner. Semi-join is a general used technique to rewrite existential subqueries, but it has some limitations as it imposes a certain order of the joined table. In this case the large fact table _store_sales_ has to be on the inner of the join. A more efficient execution can be obtained if the subquery is converted to a regular Inner join. This will allow the Optimizer to choose better join orders. Converting a subquery to inner join is possible if either the subquery produces at most one row or, by introducing a _Distinct_ on the outer table’s row key in order to remove the duplicate rows that will result after the inner join and thus to enforce the semantics of the subquery. As a key for the outer, we can use the primary key of the _customer_ table. *Internal query after rewrite:* {code} select distinct c_customer_sk /*PK */, c_first_name, c_last_name, c_email_address from customer c, store_sales, date_dim where c.c_customer_sk = ss_customer_sk and ss_sold_date_sk = d_date_sk and d_year = 2002 and d_moy between 4 and 4+3 {code} \\ *Example performance results using 1TB TPC-DS benchmark:* \\ ||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup|| ||||(secs)||(secs)|| || |Q10|355|190|2x| |Q16|1394|706|2x| |Q35|462|285|1.5x| |Q69|327|173|1.5x| |Q94|603|307|2x| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23715: -- Description: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with an explicit timezone) produces a wrong answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} Additionally, the equivalent Unix time (1520921903, which is also "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: {noformat} df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} These issues stem from the fact the FromUTCTimestamp expression, despite its name, expects the input to be in the user's local timezone. There is some magic under the covers to make things work (mostly) as the user expects. As an example, let's say a user in Los Angeles issues the following: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show {noformat} FromUTCTimestamp gets as input a Timestamp (long) value representing {noformat} 2018-03-13T06:18:23-07:00 (long value 152094710300) {noformat} What FromUTCTimestamp needs instead is {noformat} 2018-03-13T06:18:23+00:00 (long value 152092190300) {noformat} So, it applies the local timezone's offset to the input timestamp to get the correct value (152094710300 minus 7 hours is 152092190300). Then it can process the value and produce the expected output. When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions break down. The input is no longer in the local time zone. Because of the way input data is implicitly casted, FromUTCTimestamp never knows whether the input data had an explicit timezone. Here are some gory details: There is sometimes a mismatch in expectations between the (string => timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never sees the actual input string (the cast "intercepts" the input and converts it to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp cannot reject any input value that would exercise this mismatch in expectations. There is a similar mismatch in expectations in the (integer => timestamp) cast and FromUTCTimestamp. As a result, Unix time input almost always produces incorrect output. h3. When things work as expected for String input: When from_utc_timestamp is passed a string time value with no time zone, DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the datetime string as though it's in the user's local time zone. Because DateTimeUtils.stringToTimestamp is a general function, this is reasonable. As a result, FromUTCTimestamp's input is a timestamp shifted by the local time zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility function called by FromUTCTimestamp assumes this), so the first thing it does is reverse-shift to get it back the correct value. Now that the long value has been shifted back to the correct timestamp value, it can now process it (by shifting it again based on the specified time zone). h3. When things go wrong with String input: When from_utc_timestamp is passed a string datetime value with an explicit time zone, stringToTimestamp honors that timezone and ignores the local time zone. stringToTimestamp does not shift the timestamp by the local timezone's offset, but by the timezone specified on the datetime string. Unfortunately, FromUTCTimestamp, which has no insight into the actual input or the conversion, still assumes the timestamp is shifted by the local time zone. So it reverse-shifts the long value by the local time zone's offset, which produces a incorrect timestamp (except in the case where the input datetime string just happened to have an explicit timezone that matches the local timezone). FromUTCTimestamp then uses this incorrect value for further processing. h3. When things go wrong for Unix time input: The cast in this case simply multiplies the integer by 100. The cast does not shift the resulting timestamp by the local time zone's offset. Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the result is wrong. was: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-1
[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago updated SPARK-23519: Component/s: SQL > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view form the table. [ I did this from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23715: -- Description: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with an explicit timezone) produces a wrong answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} Additionally, the equivalent Unix time (1520921903, which is also "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: {noformat} df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} These issues stem from the fact the FromUTCTimestamp expression, despite its name, expects the input to be in the user's local timezone. As an example, let's say a user in Los Angeles issues the following: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show {noformat} FromUTCTimestamp gets as input a Timestamp (long) value representing {noformat} 2018-03-13T06:18:23-07:00 (long value 152094710300) {noformat} What FromUTCTimestamp needs instead is {noformat} 2018-03-13T06:18:23+00:00 (long value 152092190300) {noformat} So, it applies the local timezone's offset to the input timestamp to get the correct value (152094710300 minus 7 hours is 152092190300). Then it can process the value and produce the expected output. When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions break down. The input is no longer in the local time zone. Because of the way input data is implicitly casted, FromUTCTimestamp never knows whether the input data had an explicit timezone. Here are some gory details: There is sometimes a mismatch in expectations between the (string => timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never sees the actual input string (the cast "intercepts" the input and converts it to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp cannot reject any input value that would exercise this mismatch in expectations. There is a similar mismatch in expectations in the (integer => timestamp) cast and FromUTCTimestamp. As a result, Unix time input almost always produces incorrect output. h3. When things work as expected for String input: When from_utc_timestamp is passed a string time value with no time zone, DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the datetime string as though it's in the user's local time zone. Because DateTimeUtils.stringToTimestamp is a general function, this is reasonable. As a result, FromUTCTimestamp's input is a timestamp shifted by the local time zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility function called by FromUTCTimestamp assumes this), so the first thing it does is reverse-shift to get it back the correct value. Now that the long value has been shifted back to the correct timestamp value, it can now process it (by shifting it again based on the specified time zone). h3. When things go wrong with String input: When from_utc_timestamp is passed a string datetime value with an explicit time zone, stringToTimestamp honors that timezone and ignores the local time zone. stringToTimestamp does not shift the timestamp by the local timezone's offset, but by the timezone specified on the datetime string. Unfortunately, FromUTCTimestamp, which has no insight into the actual input or the conversion, still assumes the timestamp is shifted by the local time zone. So it reverse-shifts the long value by the local time zone's offset, which produces a incorrect timestamp (except in the case where the input datetime string just happened to have an explicit timezone that matches the local timezone). FromUTCTimestamp then uses this incorrect value for further processing. h3. When things go wrong for Unix time input: The cast in this case simply multiplies the integer by 100. The cast does not shift the resulting timestamp by the local time zone's offset. Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the result is wrong. was: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with
[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-23715: -- Description: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with an explicit timezone) produces a wrong answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} Additionally, the equivalent Unix time (1520921903, which is also "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: {noformat} df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 00:18:23| +---+ {noformat} These issues stem from the fact the FromUTCTimestamp, despite its name, expects the input to be in the user's local timezone. As an example, let's say a user in Los Angeles issues the following: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show {noformat} FromUTCTimestamp gets as input a Timestamp (long) value representing {noformat} 2018-03-13T06:18:23-07:00 (long value 152094710300) {noformat} What FromUTCTimestamp needs instead is {noformat} 2018-03-13T06:18:23+00:00 (long value 152092190300) {noformat} So, it applies the local timezone's offset to the input timestamp to get the correct value (152094710300 minus 7 hours is 152092190300). Then it can process the value and produce the expected output. When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions break down. The input is no longer in the local time zone. Because of the way input data is implicitly casted, FromUTCTimestamp never knows whether the input data had an explicit timezone. Here are some gory details: There is sometimes a mismatch in expectations between the (string => timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never sees the actual input string (the cast "intercepts" the input and converts it to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp cannot reject any input value that would exercise this mismatch in expectations. There is a similar mismatch in expectations in the (integer => timestamp) cast and FromUTCTimestamp. As a result, Unix time input almost always produces incorrect output. h3. When things work as expected for String input: When from_utc_timestamp is passed a string time value with no time zone, DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the datetime string as though it's in the user's local time zone. Because DateTimeUtils.stringToTimestamp is a general function, this is reasonable. As a result, FromUTCTimestamp's input is a timestamp shifted by the local time zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility function called by FromUTCTimestamp assumes this), so the first thing it does is reverse-shift to get it back the correct value. Now that the long value has been shifted back to the correct timestamp value, it can now process it (by shifting it again based on the specified time zone). h3. When things go wrong with String input: When from_utc_timestamp is passed a string datetime value with an explicit time zone, stringToTimestamp honors that timezone and ignores the local time zone. stringToTimestamp does not shift the timestamp by the local timezone's offset, but by the timezone specified on the datetime string. Unfortunately, FromUTCTimestamp, which has no insight into the actual input or the conversion, still assumes the timestamp is shifted by the local time zone. So it reverse-shifts the long value by the local time zone's offset, which produces a incorrect timestamp (except in the case where the input datetime string just happened to have an explicit timezone that matches the local timezone). FromUTCTimestamp then uses this incorrect value for further processing. h3. When things go wrong for Unix time input: The cast in this case simply multiplies the integer by 100. The cast does not shift the resulting timestamp by the local time zone's offset. Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the result is wrong. was: This produces the expected answer: {noformat} df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" ).as("dt")).show +---+ | dt| +---+ |2018-03-13 07:18:23| +---+ {noformat} However, the equivalent UTC input (but with an explici
[jira] [Created] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
Joseph K. Bradley created SPARK-23751: - Summary: Kolmogorov-Smirnoff test Python API in pyspark.ml Key: SPARK-23751 URL: https://issues.apache.org/jira/browse/SPARK-23751 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Joseph K. Bradley Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21898) Feature parity for KolmogorovSmirnovTest in MLlib
[ https://issues.apache.org/jira/browse/SPARK-21898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21898. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 19108 [https://github.com/apache/spark/pull/19108] > Feature parity for KolmogorovSmirnovTest in MLlib > - > > Key: SPARK-21898 > URL: https://issues.apache.org/jira/browse/SPARK-21898 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.4.0 > > > Feature parity for KolmogorovSmirnovTest in MLlib. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21898) Feature parity for KolmogorovSmirnovTest in MLlib
[ https://issues.apache.org/jira/browse/SPARK-21898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21898: - Assignee: Weichen Xu > Feature parity for KolmogorovSmirnovTest in MLlib > - > > Key: SPARK-21898 > URL: https://issues.apache.org/jira/browse/SPARK-21898 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.4.0 > > > Feature parity for KolmogorovSmirnovTest in MLlib. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers
[ https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406796#comment-16406796 ] Pascal GILLET commented on SPARK-23499: --- [~susanxhuynh] Certainly, none of the proposed solutions above can prevent a "misuse" of the dispatcher, like keeping adding drivers in the URGENT queue. To understand you well, are you in favor of implementing one of the solutions above? * The solution not related to the Mesos weights? * The solution that maps the Mesos weights to priorities? * A solution that must allow the preemption of jobs at the dispatcher level? > Mesos Cluster Dispatcher should support priority queues to submit drivers > - > > Key: SPARK-23499 > URL: https://issues.apache.org/jira/browse/SPARK-23499 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Pascal GILLET >Priority: Major > Attachments: Screenshot from 2018-02-28 17-22-47.png > > > As for Yarn, Mesos users should be able to specify priority queues to define > a workload management policy for queued drivers in the Mesos Cluster > Dispatcher. > Submitted drivers are *currently* kept in order of their submission: the > first driver added to the queue will be the first one to be executed (FIFO). > Each driver could have a "priority" associated with it. A driver with high > priority is served (Mesos resources) before a driver with low priority. If > two drivers have the same priority, they are served according to their submit > date in the queue. > To set up such priority queues, the following changes are proposed: > * The Mesos Cluster Dispatcher can optionally be configured with the > _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a > float as value. This adds a new queue named _QueueName_ for submitted drivers > with the specified priority. > Higher numbers indicate higher priority. > The user can then specify multiple queues. > * A driver can be submitted to a specific queue with > _spark.mesos.dispatcher.queue_. This property takes the name of a queue > previously declared in the dispatcher as value. > By default, the dispatcher has a single "default" queue with 0.0 priority > (cannot be overridden). If none of the properties above are specified, the > behavior is the same as the current one (i.e. simple FIFO). > Additionaly, it is possible to implement a consistent and overall workload > management policy throughout the lifecycle of drivers by mapping these > priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in > the dispatcher to the final states in the Mesos cluster), and by specifying a > _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when > submitting an application. > For example, with the URGENT Mesos role: > {code:java} > # Conf on the dispatcher side > spark.mesos.dispatcher.queue.URGENT=1.0 > # Conf on the driver side > spark.mesos.dispatcher.queue=URGENT > spark.mesos.role=URGENT > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints
Ioana Delaney created SPARK-23750: - Summary: [Performance] Inner Join Elimination based on Informational RI constraints Key: SPARK-23750 URL: https://issues.apache.org/jira/browse/SPARK-23750 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ioana Delaney +*Inner Join Elimination based on Informational RI constraints*+ This transformation detects RI joins and eliminates the parent/PK table if none of its columns, other than the PK columns, are referenced in the query. Typical examples that benefit from this rewrite are queries over complex views. *View using TPC-DS schema:* {code} create view customer_purchases_2002 (id, last, first, product, store_id, month, quantity) as select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, d_moy, ss_quantity from store_sales, date_dim, customer, item, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and d_year = 2002 {code} The view returns customer purchases made in year 2002. It is a join between fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and _date_. The tables are joined using RI predicates. If we write a query that only selects a subset of columns from the view, for example, we are only interested in the items bought and not the stores, internally, the Optimizer, will first merge the view into the query, and then, based on the _primary key – foreign key_ join predicate analysis, it will decide that the join with the _store_ table is not needed, and therefore the _store_ table is removed. *Query:* {code} select id, first, last, product, quantity from customer_purchases_2002 where product like ‘bicycle%’ and month between 1 and 2 {code} *Internal query after view expansion:* {code} select c_customer_id as id, c_first_name as first, c_last_name as last, i_product_name as product,ss_quantity as quantity from store_sales, date_dim, customer, item, store where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and d_year = 2002 and month between 1 and 2 and product like ‘bicycle%’ {code} *Internal optimized query after join elimination:* {code:java} select c_customer_id as id, c_first_name as first, c_last_name as last, i_product_name as product,ss_quantity as quantity from store_sales, date_dim, customer, item where d_date_sk = ss_sold_date_sk and c_customer_sk = ss_customer_sk and i_item_sk = ss_item_sk and d_year = 2002 and month between 1 and 2 and product like ‘bicycle%’ {code} The join with _store_ table can be removed since no columns are retrieved from the table, and every row from the _store_sales_ fact table will find a match in _store_ based on the RI relationship. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23749: Assignee: (was: Apache Spark) > Avoid Hive.get() to compatible with different Hive metastore > > > Key: SPARK-23749 > URL: https://issues.apache.org/jira/browse/SPARK-23749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406724#comment-16406724 ] Apache Spark commented on SPARK-23749: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20866 > Avoid Hive.get() to compatible with different Hive metastore > > > Key: SPARK-23749 > URL: https://issues.apache.org/jira/browse/SPARK-23749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23749: Assignee: Apache Spark > Avoid Hive.get() to compatible with different Hive metastore > > > Key: SPARK-23749 > URL: https://issues.apache.org/jira/browse/SPARK-23749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore
Yuming Wang created SPARK-23749: --- Summary: Avoid Hive.get() to compatible with different Hive metastore Key: SPARK-23749 URL: https://issues.apache.org/jira/browse/SPARK-23749 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23748) Support select from temp tables
Jose Torres created SPARK-23748: --- Summary: Support select from temp tables Key: SPARK-23748 URL: https://issues.apache.org/jira/browse/SPARK-23748 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres As reported in the dev list, the following currently fails: val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table") resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Trigger.Continuous("1 second")).start() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23747) Add EpochCoordinator unit tests
Jose Torres created SPARK-23747: --- Summary: Add EpochCoordinator unit tests Key: SPARK-23747 URL: https://issues.apache.org/jira/browse/SPARK-23747 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources
[ https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406591#comment-16406591 ] Alexander Bessonov commented on SPARK-23737: Oh, thanks. Linked them. > Scala API documentation leads to nonexistent pages for sources > -- > > Key: SPARK-23737 > URL: https://issues.apache.org/jira/browse/SPARK-23737 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.0 >Reporter: Alexander Bessonov >Priority: Minor > > h3. Steps to reproduce: > # Go to [Scala API > homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]]. > # Click "Source: package.scala" > h3. Result: > The link leads to nonexistent page: > [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala] > h3. Expected result: > The link leads to proper page: > [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23746) HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF
Izhar Ahmed created SPARK-23746: --- Summary: HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF Key: SPARK-23746 URL: https://issues.apache.org/jira/browse/SPARK-23746 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.6.2 Reporter: Izhar Ahmed I am trying to use a custom HashMap implementation as UserDefinedType instead of MapType in spark. The code is *working fine in spark 1.5.2* but giving {{java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 cannot be cast to org.apache.spark.sql.catalyst.util.MapData}} *exception in spark 1.6.2* The code:- {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ import scala.collection.immutable.HashMap class Test extends UserDefinedAggregateFunction { def inputSchema: StructType = StructType(Array(StructField("input", StringType))) def bufferSchema = StructType(Array(StructField("top_n", CustomHashMapType))) def dataType: DataType = CustomHashMapType def deterministic = true def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = HashMap.empty[String, Long] } def update(buffer: MutableAggregationBuffer, input: Row): Unit = { val buff0 = buffer.getAs[HashMap[String, Long]](0) buffer(0) = buff0.updated("test", buff0.getOrElse("test", 0L) + 1L) } def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1. getAs[HashMap[String, Long]](0) .merged(buffer2.getAs[HashMap[String, Long]](0))({ case ((k, v1), (_, v2)) => (k, v1 + v2) }) } def evaluate(buffer: Row): Any = { buffer(0) } } private case object CustomHashMapType extends UserDefinedType[HashMap[String, Long]] { override def sqlType: DataType = MapType(StringType, LongType) override def serialize(obj: Any): Map[String, Long] = obj.asInstanceOf[Map[String, Long]] override def deserialize(datum: Any): HashMap[String, Long] = { datum.asInstanceOf[Map[String, Long]] ++: HashMap.empty[String, Long] } override def userClass: Class[HashMap[String, Long]] = classOf[HashMap[String, Long]] } {code} The wrapper Class to run the UDAF:- {code:scala} import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} object TestJob { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local[4]").setAppName("DataStatsExecution") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(Seq(1,2,3,4)).toDF("col") val udaf = new Test() val outdf = df.agg(udaf(df("col"))) outdf.show } } {code} Stacktrace:- {code:java} Caused by: java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 cannot be cast to org.apache.spark.sql.catalyst.util.MapData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getMap(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getMap(rows.scala:248) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getMap(JoinedRow.scala:115) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:345) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:344) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:154) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issue
[jira] [Assigned] (SPARK-23542) The exists action shoule be further optimized in logical plan
[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23542: Assignee: Apache Spark > The exists action shoule be further optimized in logical plan > - > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: KaiXinXIaoLei >Assignee: Apache Spark >Priority: Major > > The optimized logical plan of query '*select * from tt1 where exists (select > * from tt2 where tt1.i = tt2.i)*' is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > > The `exists` action will be rewritten as semi jion. But i the query of > `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized > logical plan is : > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > So i think the optimized logical plan of '*select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- Filter isnotnull(i#14) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > :- Filter isnotnull(i#16) > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23542) The exists action shoule be further optimized in logical plan
[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406172#comment-16406172 ] Apache Spark commented on SPARK-23542: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/20865 > The exists action shoule be further optimized in logical plan > - > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: KaiXinXIaoLei >Priority: Major > > The optimized logical plan of query '*select * from tt1 where exists (select > * from tt2 where tt1.i = tt2.i)*' is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > > The `exists` action will be rewritten as semi jion. But i the query of > `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized > logical plan is : > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > So i think the optimized logical plan of '*select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- Filter isnotnull(i#14) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > :- Filter isnotnull(i#16) > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23542) The exists action shoule be further optimized in logical plan
[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23542: Assignee: (was: Apache Spark) > The exists action shoule be further optimized in logical plan > - > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: KaiXinXIaoLei >Priority: Major > > The optimized logical plan of query '*select * from tt1 where exists (select > * from tt2 where tt1.i = tt2.i)*' is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > > The `exists` action will be rewritten as semi jion. But i the query of > `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized > logical plan is : > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > So i think the optimized logical plan of '*select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- Filter isnotnull(i#14) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > :- Filter isnotnull(i#16) > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23542) The exists action shoule be further optimized in logical plan
[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-23542: -- Description: The optimized logical plan of query '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*' is : {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} The `exists` action will be rewritten as semi jion. But i the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized logical plan is : {noformat} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} So i think the optimized logical plan of '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- Filter isnotnull(i#14) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] :- Filter isnotnull(i#16) +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} was: The optimized logical plan of query '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*' is : {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} The `exists` action will be rewritten as semi jion. But i the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized logical plan is : {noformat} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} So i think the optimized logical plan of '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > The exists action shoule be further optimized in logical plan > - > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: KaiXinXIaoLei >Priority: Major > > The optimized logical plan of query '*select * from tt1 where exists (select > * from tt2 where tt1.i = tt2.i)*' is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > > The `exists` action will be rewritten as semi jion. But i the query of > `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized > logical plan is : > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > So i think the optimized logical plan of '*select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- Filter isnotnull(i#14) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > :- Filter isnotnull(i#16) > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16
[jira] [Commented] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error
[ https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406164#comment-16406164 ] Narsireddy AVula commented on SPARK-23513: -- Seems provided information is not sufficient to proceed to validate > java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit > error > --- > > Key: SPARK-23513 > URL: https://issues.apache.org/jira/browse/SPARK-23513 > Project: Spark > Issue Type: Bug > Components: EC2, Examples, Input/Output, Java API >Affects Versions: 1.4.0, 2.2.0 >Reporter: Rawia >Priority: Blocker > > Hello > I'm trying to run a spark application (distributedWekaSpark) but when I'm > using the spark-submit command I get this error > {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.io.IOException: Expected 12 fields, but got 5 for row: > outlook,temperature,humidity,windy,play > {quote}{quote} > I tried with other datasets but always the same error appeared, (always 12 > fields expected) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
[ https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406153#comment-16406153 ] Sujit Kumar Mahapatra commented on SPARK-16745: --- +1. Getting similar issue with standalone spark on Mac for spark 2.2.1 > Spark job completed however have to wait for 13 mins (data size is small) > - > > Key: SPARK-16745 > URL: https://issues.apache.org/jira/browse/SPARK-16745 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 >Reporter: Joe Chong >Priority: Minor > > I submitted a job in scala spark shell to show a DataFrame. The data size is > about 43K. The job was successful in the end, but took more than 13 minutes > to resolve. Upon checking the log, there's multiple exception raised on > "Failed to check existence of class" with a java.net.connectionexpcetion > message indicating timeout trying to connect to the port 52067, the repl port > that Spark setup. Please assist to troubleshoot. Thanks. > Started Spark in standalone mode > $ spark-shell --driver-memory 5g --master local[*] > 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server > 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:30 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:52067 > 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class > server' on port 52067. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) > Type in expressions to have them evaluated. > Type :help for more information. > 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: > joechong > 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(joechong); users > with modify permissions: Set(joechong) > 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' > on port 52072. > 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/07/26 21:05:35 INFO Remoting: Starting remoting > 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] > 16/07/26 21:05:35 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 52074. > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at > /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 > 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity > 3.4 GB > 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/07/26 21:05:36 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at > http://10.199.29.218:4040 > 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host > localhost > 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: > http://10.199.29.218:52067 > 16/07/26 21:05:36 INFO util.Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075. > 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on > 52075 > 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register > BlockManager > 16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block > manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver, localhost, > 52075) > 16/07/26 21:05:36 INF
[jira] [Commented] (SPARK-16872) Include Gaussian Naive Bayes Classifier
[ https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406128#comment-16406128 ] zhengruifeng commented on SPARK-16872: -- I think both 1) a new GNB estimator and 2) current NB includes Gaussian are OK. [~mlnick] [~josephkb] [~yanboliang] What are your thoughts? It has been a long time since my first PR, and I really hope to finish it in following months. Could you help shepherding this ? > Include Gaussian Naive Bayes Classifier > --- > > Key: SPARK-16872 > URL: https://issues.apache.org/jira/browse/SPARK-16872 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}. > In GaussianNB model, the {{theta}} matrix is used to store means and there is > a extra {{sigma}} matrix storing the variance of each feature. > GaussianNB in spark > {code} > scala> import org.apache.spark.ml.classification.GaussianNaiveBayes > import org.apache.spark.ml.classification.GaussianNaiveBayes > scala> val path = > "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > path: String = > /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt > scala> val data = spark.read.format("libsvm").load(path).persist() > data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: > double, features: vector] > scala> val gnb = new GaussianNaiveBayes() > gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c > scala> val model = gnb.fit(data) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 > storageLevel=StorageLevel(1 replicas) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished > model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = > GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes > scala> model.pi > res0: org.apache.spark.ml.linalg.Vector = > [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098] > scala> model.pi.toArray.map(math.exp) > res1: Array[Double] = Array(0., 0., > 0.) > scala> model.theta > res2: org.apache.spark.ml.linalg.Matrix = > 0.270067018001 -0.188540006 0.543050720001 0.60546 > -0.60779998 0.18172 -0.842711740006 > -0.88139998 > -0.091425964 -0.35858001 0.105084738 > 0.021666701507102017 > scala> model.sigma > res3: org.apache.spark.ml.linalg.Matrix = > 0.1223012510889361 0.07078051983960698 0.0343595243976 > 0.051336071297393815 > 0.03758145300924998 0.09880280046403413 0.003390296940069426 > 0.007822241779598893 > 0.08058763609659315 0.06701386661293329 0.024866409227781675 > 0.02661391644759426 > scala> model.transform(data).select("probability").take(10) > [rdd_68_0] > res4: Array[org.apache.spark.sql.Row] = > Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], > [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], > [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], > [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], > [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], > [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], > [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], > [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], > [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], > [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]]) > scala> model.transform(data).select("prediction").take(10) > [rdd_68_0] > res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], > [0.0], [1.0], [2.0], [2.0], [1.0], [0.0]) > {code} > GaussianNB in scikit-learn > {code} > import numpy as np > from sklearn.naive_bayes import GaussianNB > from sklearn.datasets import load_svmlight_file > path = > '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt' > X, y = load_svmlight_file(path) > X = X.toarray() > clf = GaussianNB() > clf.fit(X, y) > >>> clf.class_prior_ > array([ 0., 0., 0.]) > >>> clf.theta_ > array([[ 0.2701, -0.1885, 0.54305072,
[jira] [Assigned] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23745: Assignee: Apache Spark > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Assignee: Apache Spark >Priority: Major > Attachments: 2018-03-20_164832.png > > > !2018-03-20_164832.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories. The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23745: Assignee: (was: Apache Spark) > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > !2018-03-20_164832.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories. The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406050#comment-16406050 ] Apache Spark commented on SPARK-23745: -- User 'zuotingbing' has created a pull request for this issue: https://github.com/apache/spark/pull/20864 > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > !2018-03-20_164832.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories. The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23745: Description: !2018-03-20_164832.png! when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories. The directories could accumulate a lot. was: !2018-03-20_164832.png! when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > !2018-03-20_164832.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories. The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23745: Description: !2018-03-20_164832.png! when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. was: when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > !2018-03-20_164832.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories.The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23745: Description: when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. was: !image-2018-03-20-16-49-00-175.png! when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories.The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
[ https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-23745: Attachment: 2018-03-20_164832.png > Remove the directories of the “hive.downloaded.resources.dir” when > HiveThriftServer2 stopped > > > Key: SPARK-23745 > URL: https://issues.apache.org/jira/browse/SPARK-23745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: linux >Reporter: zuotingbing >Priority: Major > Attachments: 2018-03-20_164832.png > > > !image-2018-03-20-16-49-00-175.png! > when start the HiveThriftServer2, we create some directories for > hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not > remove these directories.The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped
zuotingbing created SPARK-23745: --- Summary: Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped Key: SPARK-23745 URL: https://issues.apache.org/jira/browse/SPARK-23745 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: linux Reporter: zuotingbing !image-2018-03-20-16-49-00-175.png! when start the HiveThriftServer2, we create some directories for hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not remove these directories.The directories could accumulate a lot. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23691) Use sql_conf util in PySpark tests where possible
[ https://issues.apache.org/jira/browse/SPARK-23691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23691: - Fix Version/s: 2.3.1 > Use sql_conf util in PySpark tests where possible > - > > Key: SPARK-23691 > URL: https://issues.apache.org/jira/browse/SPARK-23691 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > https://github.com/apache/spark/commit/d6632d185e147fcbe6724545488ad80dce20277e > added an useful util > {code} > @contextmanager > def sql_conf(self, pairs): > ... > {code} > to allow configuration set/unset within a block: > {code} > with self.sql_conf({"spark.blah.blah.blah", "blah"}) > # test codes > {code} > It would be nicer if we use it. > Note that there look already few places affecting tests without restoring the > original value back in unittest classes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard
[ https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962 ] Alex Ott edited comment on SPARK-20964 at 3/20/18 8:36 AM: --- Just want to add another example of query that is rejected by sqlite, but works fine with Spark SQL: {{SELECT state, count( * ) FROM user_addresses *where* group by state;}} In this case the *where* keyword is treated as table alias - although it matches to SQL specification, it really just hides the error that I did in this query by forgetting to add condition expression was (Author: alexott): Just want to add another example of query that is rejected by sqlite, but works fine with Spark SQL: {{SELECT state, count(*) FROM user_addresses *where* group by state;}} In this case the *where* keyword is treated as table alias - although it matches to SQL specification, it really just hides the error that I did in this query by forgetting to add condition expression > Make some keywords reserved along with the ANSI/SQL standard > > > Key: SPARK-20964 > URL: https://issues.apache.org/jira/browse/SPARK-20964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The current Spark has many non-reserved words that are essentially reserved > in the ANSI/SQL standard > (http://developer.mimer.se/validator/sql-reserved-words.tml). > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709 > This is because there are many datasources (for instance twitter4j) that > unfortunately use reserved keywords for column names (See [~hvanhovell]'s > comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). > We might fix this issue in future major releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard
[ https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962 ] Alex Ott edited comment on SPARK-20964 at 3/20/18 8:35 AM: --- Just want to add another example of query that is rejected by sqlite, but works fine with Spark SQL: {{SELECT state, count(*) FROM user_addresses *where* group by state;}} In this case the *where* keyword is treated as table alias - although it matches to SQL specification, it really just hides the error that I did in this query by forgetting to add condition expression was (Author: alexott): Just want to add another example of query that is rejected by sqlite, but works fine with Spark SQL: SELECT state, count(*) FROM user_addresses *where* group by state; In this case the *where* keyword is treated as table alias - although it matches to SQL specification, it really just hides the error that I did in this query by forgetting to add condition expression > Make some keywords reserved along with the ANSI/SQL standard > > > Key: SPARK-20964 > URL: https://issues.apache.org/jira/browse/SPARK-20964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The current Spark has many non-reserved words that are essentially reserved > in the ANSI/SQL standard > (http://developer.mimer.se/validator/sql-reserved-words.tml). > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709 > This is because there are many datasources (for instance twitter4j) that > unfortunately use reserved keywords for column names (See [~hvanhovell]'s > comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). > We might fix this issue in future major releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard
[ https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962 ] Alex Ott commented on SPARK-20964: -- Just want to add another example of query that is rejected by sqlite, but works fine with Spark SQL: SELECT state, count(*) FROM user_addresses *where* group by state; In this case the *where* keyword is treated as table alias - although it matches to SQL specification, it really just hides the error that I did in this query by forgetting to add condition expression > Make some keywords reserved along with the ANSI/SQL standard > > > Key: SPARK-20964 > URL: https://issues.apache.org/jira/browse/SPARK-20964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The current Spark has many non-reserved words that are essentially reserved > in the ANSI/SQL standard > (http://developer.mimer.se/validator/sql-reserved-words.tml). > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709 > This is because there are many datasources (for instance twitter4j) that > unfortunately use reserved keywords for column names (See [~hvanhovell]'s > comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). > We might fix this issue in future major releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org