[jira] [Commented] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551527#comment-16551527 ] Apache Spark commented on SPARK-24873: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/21784 > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Major > Attachments: pic.jpg > > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn > > !pic.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24836) New option - ignoreExtension
[ https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24836. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > New option - ignoreExtension > > > Key: SPARK-24836 > URL: https://issues.apache.org/jira/browse/SPARK-24836 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to add new option for Avro datasource - *ignoreExtension*. It should > control ignoring of the .avro extensions. If it is set to *true* (by > default), files with and without .avro extensions should be loaded. Example > of usage: > {code:scala} > spark > .read > .option("ignoreExtension", false) > .avro("path to avro files") > {code} > The option duplicates Hadoop's config > avro.mapred.ignore.inputs.without.extension which is taken into account by > Avro datasource now and can be set like: > {code:scala} > spark > .sqlContext > .sparkContext > .hadoopConfiguration > .set("avro.mapred.ignore.inputs.without.extension", "true") > {code} > The ignoreExtension option must override > avro.mapred.ignore.inputs.without.extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24879. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.2 > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Assignee: William Sheu >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) >
[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24879: --- Assignee: William Sheu > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Assignee: William Sheu >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) > at >
[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457 ] Liang-Chi Hsieh edited comment on SPARK-24875 at 7/21/18 12:21 AM: --- hmm, I think for calculation of precision, recall and true/false positive rate, we should only care about exact numbers but approximate one. Thus is it reasonable to use countByValueApprox here? was (Author: viirya): hmm, I think for calculation of precision, recall and true/false positive rate, we should only care about exact calculation but approximate one. Thus is it reasonable to use countByValueApprox here? > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
[ https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457 ] Liang-Chi Hsieh commented on SPARK-24875: - hmm, I think for calculation of precision, recall and true/false positive rate, we should only care about exact calculation but approximate one. Thus is it reasonable to use countByValueApprox here? > MulticlassMetrics should offer a more efficient way to compute count by label > - > > Key: SPARK-24875 > URL: https://issues.apache.org/jira/browse/SPARK-24875 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Antoine Galataud >Priority: Minor > > Currently _MulticlassMetrics_ calls _countByValue_() to get count by > class/label > {code:java} > private lazy val labelCountByClass: Map[Double, Long] = > predictionAndLabels.values.countByValue() > {code} > If input _RDD[(Double, Double)]_ is huge (which can be the case with a large > test dataset), it will lead to poor execution performance. > One option could be to allow using _countByValueApprox_ (could require adding > an extra configuration param for MulticlassMetrics). > Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, > I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551416#comment-16551416 ] Liang-Chi Hsieh commented on SPARK-24862: - Isn't it inconsistent between the schema and the ser/de? And for serializer, for example, how can we get the {{y}} from {{Multi}} objects? > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at >
[jira] [Resolved] (SPARK-24488) Analyzer throws when generator is aliased multiple times
[ https://issues.apache.org/jira/browse/SPARK-24488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-24488. --- Resolution: Fixed Assignee: Brandon Krieger Fix Version/s: 2.4.0 > Analyzer throws when generator is aliased multiple times > > > Key: SPARK-24488 > URL: https://issues.apache.org/jira/browse/SPARK-24488 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Brandon Krieger >Assignee: Brandon Krieger >Priority: Minor > Fix For: 2.4.0 > > > Currently, the Analyzer throws an exception if your try to nest a generator. > However, it special cases generators "nested" in an alias, and allows that. > If you try to alias a generator twice, it is not caught by the special case, > so an exception is thrown: > > {code:java} > scala> Seq(("a", "b")) > .toDF("col1","col2") > .select(functions.array('col1,'col2).as("arr")) > .select(functions.explode('arr).as("first").as("second")) > .collect() > org.apache.spark.sql.AnalysisException: Generators are not supported when > it's nested in expressions, but got: explode(arr) AS `first`; > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$23.applyOrElse(Analyzer.scala:1604) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$23.applyOrElse(Analyzer.scala:1601) > {code} > > In reality, aliasing twice is fine, so we can fix this by trimming non > top-level aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-24880. -- Resolution: Fixed Fix Version/s: 2.4.0 > Fix the group id for spark-kubernetes-integration-tests > --- > > Key: SPARK-24880 > URL: https://issues.apache.org/jira/browse/SPARK-24880 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > The correct group id should be `org.apache.spark`. This is causing the > nightly build failure: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on > project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: > Could not transfer artifact > spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1 > from/to apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Access denied > to: > https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar, > ReasonPhrase: Forbidden. -> [Help 1] > [ERROR] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24876) Simplify schema serialization
[ https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24876. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Simplify schema serialization > - > > Key: SPARK-24876 > URL: https://issues.apache.org/jira/browse/SPARK-24876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > Previously in the refactoring of Avro Serializer and Deserializer, a new > class SerializableSchema is created for serializing the avro schema. > [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] > > On second thought, we can use `toString` method for serialization. After > that, parse the JSON format schema on executor. This makes the code much > simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551312#comment-16551312 ] Antonio Murgia commented on SPARK-24862: Yeah, they are definitely not supported. Therefore I think they encoder generator should generate the schema based on the first parameter and the ser/de based on all the param lists. I can think of a PR if you’d like. > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at >
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551299#comment-16551299 ] Thomas Graves commented on SPARK-23128: --- [~carsonwang] I'm curious if you are still running with this and how has it been running? This is definitely interesting and looks like it just stalled needing reviews but wasn't sure? > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551296#comment-16551296 ] Abhishek Madav commented on SPARK-24864: Thanks for the reply. The views are currently crated by the customer and the spark-job hasn't been able to keep up with the upgrade from 1.6 -> 2.0+ hence they feel it is a regression. Is there anything that can be done to go back to the 1.6 way of column referencing? > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551240#comment-16551240 ] Thomas Graves commented on SPARK-24615: --- the other thing which I think I mentioned above is could this handle saying, I want 1 node with 4GB, 4 cores and 3 nodes with 2 gpus, 10GB, 1cores) . > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24879: Shepherd: Xiao Li (was: William Sheu) > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Priority: Major > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at >
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551233#comment-16551233 ] Thomas Graves commented on SPARK-24615: --- Ok so thinking about this a bit more I slightly misread what it was applying to. Really you are associated it with the new rdd that will be created (val rddWithGPUResult = rdd.withResources.xxx) , not the original and to regenerate the new rddwithGPUResult you would want to know it was created using those resources. The thing that isn't clear to me is the scoping of this. For instance if I say I have the code val rdd1 = {{sc.textFile("README.md")}} val rdd2 = rdd1.withResources.mapPartitions().collect(). Does the withResources apply to the entire line up to the action? What if I change it to say val rdd2 = rdd1.withResources.mapPartitions() val res = rdd2.collect() Does the withResources only apply to the mapPartitions, which is really what I think you want with some of the ml algorithms. So we need to define what it applies to. Doing something similar but would have more obvious scope: val rdd2 = withResources { rdd1.mapPartitions() } The above would be very obvious to the user the scope on it. You also have things people could do like: val rdd1 = rdd.withResources(x).mapPartitions() val rdd2 = rdd.withResources(y).mapPartitions() val rdd3 rdd1.join(rdd2) I think in this case rdd1 and rdd2 have to be individually materialized before you do the join for rdd3. So its more like an implicit val rdd1 = rdd.withResources(x).mapPartitions().eval() . You end up with putting in some stage boundaries. Have you thought about the scope and have ideas around this? > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24879: Assignee: (was: Apache Spark) > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Priority: Major > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) >
[jira] [Commented] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551200#comment-16551200 ] Apache Spark commented on SPARK-24879: -- User 'PenguinToast' has created a pull request for this issue: https://github.com/apache/spark/pull/21832 > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Priority: Major > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at >
[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[ https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24879: Assignee: Apache Spark > NPE in Hive partition filter pushdown for `partCol IN (NULL, )` > --- > > Key: SPARK-24879 > URL: https://issues.apache.org/jira/browse/SPARK-24879 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: William Sheu >Assignee: Apache Spark >Priority: Major > > The following query triggers a NPE: > {code:java} > create table foo (col1 int) partitioned by (col2 int); > select * from foo where col2 in (1, NULL); > {code} > We try to push down the filter to Hive in order to do partition pruning, but > the filter converter breaks on a `null`. > Here's the stack: > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) > at > org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:355) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) > at >
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551198#comment-16551198 ] Mridul Muralidharan commented on SPARK-24615: - [~tgraves] This was indeed a recurring issue - the ability to modulate ask's to RM based on current requirements. What you bring out is an excellent point - changing resource requirements would be very useful - particularly for applications with heterogenous resource needs. Even currently when executor_memory/executor_cores does not allign well with stage requirements, we end up with OOM - resulting in over-provisioning memory needs; resulting in suboptimal use. GPU/accelerator aware scheduler is an extension of the same - where we have other resources to consider. I agree with [~tgraves] that a more general way to model this would look at all resources (when declaratively specified ofcourse) and use the information to allocate resources (from RM) and for task schedule (within spark). > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)
[ https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22880. - Resolution: Fixed Assignee: Daniel van der Ende Fix Version/s: 2.4.0 > Add option to cascade jdbc truncate if database supports this (PostgreSQL and > Oracle) > - > > Key: SPARK-22880 > URL: https://issues.apache.org/jira/browse/SPARK-22880 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel van der Ende >Assignee: Daniel van der Ende >Priority: Minor > Fix For: 2.4.0 > > > When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. > This cascades the truncate to tables with foreign key constraints on a column > in the table specified to truncate. It would be nice to be able to optionally > set this `TRUNCATE` option for PostgreSQL and Oracle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-24880: - Description: The correct group id should be `org.apache.spark`. This is causing the nightly build failure: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: Could not transfer artifact spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1 from/to apache.snapshots.https (https://repository.apache.org/content/repositories/snapshots): Access denied to: https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar, ReasonPhrase: Forbidden. -> [Help 1] [ERROR] {code} > Fix the group id for spark-kubernetes-integration-tests > --- > > Key: SPARK-24880 > URL: https://issues.apache.org/jira/browse/SPARK-24880 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > The correct group id should be `org.apache.spark`. This is causing the > nightly build failure: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on > project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: > Could not transfer artifact > spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1 > from/to apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Access denied > to: > https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar, > ReasonPhrase: Forbidden. -> [Help 1] > [ERROR] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24880: Assignee: Shixiong Zhu (was: Apache Spark) > Fix the group id for spark-kubernetes-integration-tests > --- > > Key: SPARK-24880 > URL: https://issues.apache.org/jira/browse/SPARK-24880 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551191#comment-16551191 ] Apache Spark commented on SPARK-24880: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/21831 > Fix the group id for spark-kubernetes-integration-tests > --- > > Key: SPARK-24880 > URL: https://issues.apache.org/jira/browse/SPARK-24880 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
[ https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24880: Assignee: Apache Spark (was: Shixiong Zhu) > Fix the group id for spark-kubernetes-integration-tests > --- > > Key: SPARK-24880 > URL: https://issues.apache.org/jira/browse/SPARK-24880 > Project: Spark > Issue Type: Bug > Components: Build, Kubernetes >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests
Shixiong Zhu created SPARK-24880: Summary: Fix the group id for spark-kubernetes-integration-tests Key: SPARK-24880 URL: https://issues.apache.org/jira/browse/SPARK-24880 Project: Spark Issue Type: Bug Components: Build, Kubernetes Affects Versions: 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists
[ https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551174#comment-16551174 ] Liang-Chi Hsieh commented on SPARK-24862: - Even we only retrieve the first parameter list at {{getConstructorParameters}}, when we need to deserialize {{Multi}}, we don't have the {{y}} in input columns because we only serialize {{x}}. I think the multiple parameter lists case class is not supported for Encoder. > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists > > > Key: SPARK-24862 > URL: https://issues.apache.org/jira/browse/SPARK-24862 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Antonio Murgia >Priority: Major > > Spark Encoder is not consistent to scala case class semantic for multiple > argument lists. > For example if I create a case class with multiple constructor argument lists: > {code:java} > case class Multi(x: String)(y: Int){code} > Scala creates a product with arity 1, while if I apply > {code:java} > Encoders.product[Multi].schema.printTreeString{code} > I get > {code:java} > root > |-- x: string (nullable = true) > |-- y: integer (nullable = false){code} > That is not consistent and leads to: > {code:java} > Error while encoding: java.lang.RuntimeException: Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > Couldn't find y on class > it.enel.next.platform.service.events.common.massive.immutable.Multi > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).x, true) AS x#0 > assertnotnull(assertnotnull(input[0, > it.enel.next.platform.service.events.common.massive.immutable.Multi, > true])).y AS y#1 > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at > it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682) > at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) > at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at > org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692) > at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:392) > at
[jira] [Resolved] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
[ https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24852. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21799 [https://github.com/apache/spark/pull/21799] > Have spark.ml training use updated `Instrumentation` APIs. > -- > > Key: SPARK-24852 > URL: https://issues.apache.org/jira/browse/SPARK-24852 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Port spark.ml code to use the new methods on the `Instrumentation` class and > remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24864. - Resolution: Won't Fix > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551123#comment-16551123 ] Xiao Li commented on SPARK-24864: - Yeah, our generated alias names are different from the ones generated by Hive. Please explicitly specify the alias names in your query. > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
William Sheu created SPARK-24879: Summary: NPE in Hive partition filter pushdown for `partCol IN (NULL, )` Key: SPARK-24879 URL: https://issues.apache.org/jira/browse/SPARK-24879 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: William Sheu The following query triggers a NPE: {code:java} create table foo (col1 int) partitioned by (col2 int); select * from foo where col2 in (1, NULL); {code} We try to push down the filter to Hive in order to do partition pruning, but the filter converter breaks on a `null`. Here's the stack: {code:java} java.lang.NullPointerException at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609) at org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671) at org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) at org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:355) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172) at org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at
[jira] [Commented] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
[ https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550981#comment-16550981 ] Apache Spark commented on SPARK-24878: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/21830 > Fix reverse function for array type of primitive type containing null. > -- > > Key: SPARK-24878 > URL: https://issues.apache.org/jira/browse/SPARK-24878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > If we use {{reverse}} function for array type of primitive type containing > {{null}} and the child array is {{UnsafeArrayData}}, the function returns a > wrong result because {{UnsafeArrayData}} doesn't define the behavior of > re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
[ https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24878: Assignee: Apache Spark > Fix reverse function for array type of primitive type containing null. > -- > > Key: SPARK-24878 > URL: https://issues.apache.org/jira/browse/SPARK-24878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > If we use {{reverse}} function for array type of primitive type containing > {{null}} and the child array is {{UnsafeArrayData}}, the function returns a > wrong result because {{UnsafeArrayData}} doesn't define the behavior of > re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
[ https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24878: Assignee: (was: Apache Spark) > Fix reverse function for array type of primitive type containing null. > -- > > Key: SPARK-24878 > URL: https://issues.apache.org/jira/browse/SPARK-24878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > If we use {{reverse}} function for array type of primitive type containing > {{null}} and the child array is {{UnsafeArrayData}}, the function returns a > wrong result because {{UnsafeArrayData}} doesn't define the behavior of > re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24878) Fix reverse function for array type of primitive type containing null.
Takuya Ueshin created SPARK-24878: - Summary: Fix reverse function for array type of primitive type containing null. Key: SPARK-24878 URL: https://issues.apache.org/jira/browse/SPARK-24878 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Takuya Ueshin If we use {{reverse}} function for array type of primitive type containing {{null}} and the child array is {{UnsafeArrayData}}, the function returns a wrong result because {{UnsafeArrayData}} doesn't define the behavior of re-assignment, especially we can't set a valid value after we set {{null}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24792) Add API `.avro` in DataFrameReader/DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-24792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550949#comment-16550949 ] Xiao Li commented on SPARK-24792: - Since AVRO is an external modules, this does not make sense to have this API. > Add API `.avro` in DataFrameReader/DataFrameWriter > -- > > Key: SPARK-24792 > URL: https://issues.apache.org/jira/browse/SPARK-24792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add API `.avro` in DataFrameReader/DataFrameWriter > remove the implicit class AvroDataFrameWriter/Reader > > https://github.com/apache/spark/pull/21742#pullrequestreview-136075421 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24792) Add API `.avro` in DataFrameReader/DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-24792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24792. - Resolution: Won't Fix > Add API `.avro` in DataFrameReader/DataFrameWriter > -- > > Key: SPARK-24792 > URL: https://issues.apache.org/jira/browse/SPARK-24792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Add API `.avro` in DataFrameReader/DataFrameWriter > remove the implicit class AvroDataFrameWriter/Reader > > https://github.com/apache/spark/pull/21742#pullrequestreview-136075421 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24811) Add function `from_avro` and `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24811. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Add function `from_avro` and `to_avro` > -- > > Key: SPARK-24811 > URL: https://issues.apache.org/jira/browse/SPARK-24811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23451) Deprecate KMeans computeCost
[ https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-23451. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.4.0 > Deprecate KMeans computeCost > > > Key: SPARK-23451 > URL: https://issues.apache.org/jira/browse/SPARK-23451 > Project: Spark > Issue Type: Task > Components: ML >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.4.0 > > > SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of > proper cluster evaluators. Now SPARK-14516 introduces a proper > {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove > it in the next releases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24877) Ignore the task completion event from a zombie barrier task
Jiang Xingbo created SPARK-24877: Summary: Ignore the task completion event from a zombie barrier task Key: SPARK-24877 URL: https://issues.apache.org/jira/browse/SPARK-24877 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo Currently we abort the barrier stage if a zombie barrier task can't get killed to prevent data correctness issue. We can improve the behavior to let zombie barrier task continue running but not able to interact with other barrier tasks (maybe from different stage attempt) and ignore the task completion event from a zombie barrier task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550803#comment-16550803 ] Thomas Graves commented on SPARK-24615: --- yes if any requirement can't be satisfied it would use dynamic allocation to release and reacquire containers. I'm not saying we have to implement those parts right now, I'm saying we should keep them in mind during the design of this so they could be added later. I linked one old Jira that was about dynamically changing things. Its been brought up many times after in prs and just talking to customers not sure if there are other Jira as well. Its also somewhat related to SPARK-20589 where people just want to configure things per stage. I actually question if this should be done at the rdd level as well. A set of partitions don't care what the resources are, its generally the action you are taking on those rdd(s). Note it could be more then one rdd. I could do etl stuff on an RDD which resources would be totally different then if I ran tensorflow on that RDD for example. I do realize this is being tied in with the barrier stuff which is on the mapPartitions I'm not trying to be difficult and realize this Jira is more specific to the external ML algo's but don't want many api's for the same thing. I unfortunately haven't thought through a good solution for this, a while back my initial thought was to be able to pass in that resource context to the api calls, this obviously gets more tricky especially with pure sql support. I need to think about some more. the above proposal for .withResources is definitely closer but wonder about tying to the rdd still. cc [~irashid] [~mridulm80] who I think this has been brought up before with. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24876) Simplify schema serialization
[ https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550763#comment-16550763 ] Apache Spark commented on SPARK-24876: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21829 > Simplify schema serialization > - > > Key: SPARK-24876 > URL: https://issues.apache.org/jira/browse/SPARK-24876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Previously in the refactoring of Avro Serializer and Deserializer, a new > class SerializableSchema is created for serializing the avro schema. > [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] > > On second thought, we can use `toString` method for serialization. After > that, parse the JSON format schema on executor. This makes the code much > simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24876) Simplify schema serialization
[ https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24876: Assignee: Apache Spark > Simplify schema serialization > - > > Key: SPARK-24876 > URL: https://issues.apache.org/jira/browse/SPARK-24876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Previously in the refactoring of Avro Serializer and Deserializer, a new > class SerializableSchema is created for serializing the avro schema. > [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] > > On second thought, we can use `toString` method for serialization. After > that, parse the JSON format schema on executor. This makes the code much > simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24876) Simplify schema serialization
[ https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24876: Assignee: (was: Apache Spark) > Simplify schema serialization > - > > Key: SPARK-24876 > URL: https://issues.apache.org/jira/browse/SPARK-24876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Previously in the refactoring of Avro Serializer and Deserializer, a new > class SerializableSchema is created for serializing the avro schema. > [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] > > On second thought, we can use `toString` method for serialization. After > that, parse the JSON format schema on executor. This makes the code much > simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24876) Simplify schema serialization
[ https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-24876: --- Summary: Simplify schema serialization (was: Remove SerializableSchema and use json format string schema) > Simplify schema serialization > - > > Key: SPARK-24876 > URL: https://issues.apache.org/jira/browse/SPARK-24876 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > Previously in the refactoring of Avro Serializer and Deserializer, a new > class SerializableSchema is created for serializing the avro schema. > [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] > > On second thought, we can use `toString` method for serialization. After > that, parse the JSON format schema on executor. This makes the code much > simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24876) Remove SerializableSchema and use json format string schema
Gengliang Wang created SPARK-24876: -- Summary: Remove SerializableSchema and use json format string schema Key: SPARK-24876 URL: https://issues.apache.org/jira/browse/SPARK-24876 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang Previously in the refactoring of Avro Serializer and Deserializer, a new class SerializableSchema is created for serializing the avro schema. [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18] On second thought, we can use `toString` method for serialization. After that, parse the JSON format schema on executor. This makes the code much simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23731: --- Assignee: Hyukjin Kwon > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at >
[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-24864: -- Fix Version/s: (was: 2.4.0) > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23731. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 21815 [https://github.com/apache/spark/pull/21815] > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > Fix For: 2.4.0, 2.3.2 > > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at >
[jira] [Assigned] (SPARK-24551) Add Integration tests for Secrets
[ https://issues.apache.org/jira/browse/SPARK-24551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-24551: - Assignee: Stavros Kontopoulos > Add Integration tests for Secrets > - > > Key: SPARK-24551 > URL: https://issues.apache.org/jira/browse/SPARK-24551 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Minor > Fix For: 2.4.0 > > > Current > [suite|https://github.com/apache/spark/blob/7703b46d2843db99e28110c4c7ccf60934412504/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala] > needs to be expanded covering secrets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24551) Add Integration tests for Secrets
[ https://issues.apache.org/jira/browse/SPARK-24551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24551. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21652 [https://github.com/apache/spark/pull/21652] > Add Integration tests for Secrets > - > > Key: SPARK-24551 > URL: https://issues.apache.org/jira/browse/SPARK-24551 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.1 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Minor > Fix For: 2.4.0 > > > Current > [suite|https://github.com/apache/spark/blob/7703b46d2843db99e28110c4c7ccf60934412504/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala] > needs to be expanded covering secrets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label
Antoine Galataud created SPARK-24875: Summary: MulticlassMetrics should offer a more efficient way to compute count by label Key: SPARK-24875 URL: https://issues.apache.org/jira/browse/SPARK-24875 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.3.1 Reporter: Antoine Galataud Currently _MulticlassMetrics_ calls _countByValue_() to get count by class/label {code:java} private lazy val labelCountByClass: Map[Double, Long] = predictionAndLabels.values.countByValue() {code} If input _RDD[(Double, Double)]_ is huge (which can be the case with a large test dataset), it will lead to poor execution performance. One option could be to allow using _countByValueApprox_ (could require adding an extra configuration param for MulticlassMetrics). Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, I don't know how this could be ported there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data
[ https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550738#comment-16550738 ] Takeshi Yamamuro commented on SPARK-24869: -- In the example above, cache() is not called explicitly, so you mean we cache data implicitly for saving data? > SaveIntoDataSourceCommand's input Dataset does not use Cached Data > -- > > Key: SPARK-24869 > URL: https://issues.apache.org/jira/browse/SPARK-24869 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > withTable("t") { > withTempPath { path => > var numTotalCachedHit = 0 > val listener = new QueryExecutionListener { > override def onFailure(f: String, qe: QueryExecution, e: > Exception):Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, > duration: Long): Unit = { > qe.withCachedData match { > case c: SaveIntoDataSourceCommand > if c.query.isInstanceOf[InMemoryRelation] => > numTotalCachedHit += 1 > case _ => > println(qe.withCachedData) > } > } > } > spark.listenerManager.register(listener) > val udf1 = udf({ (x: Int, y: Int) => x + y }) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1(col("a"), lit(10))) > df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", > properties) > assert(numTotalCachedHit == 1, "expected to be cached in jdbc") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24874) Allow hybrid of both barrier tasks and regular tasks in a stage
Jiang Xingbo created SPARK-24874: Summary: Allow hybrid of both barrier tasks and regular tasks in a stage Key: SPARK-24874 URL: https://issues.apache.org/jira/browse/SPARK-24874 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jiang Xingbo Currently we only allow barrier tasks in a barrier stage, however, consider the following query: {code} sc = new SparkContext(conf) val rdd1 = sc.parallelize(1 to 100, 10) val rdd2 = sc.parallelize(1 to 1000, 20).barrier().mapPartitions((it, ctx) => it) val rdd = rdd1.union(rdd2).mapPartitions(t => t) {code} Now it requires 30 free slots to run `rdd.collect()`. Actually, we can launch regular tasks to collect data from rdd1's partitions, they are not required to be launched together. If we can do that, we only need 20 free slots to run `rdd.collect()`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.
[ https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24871: --- Assignee: Takuya Ueshin > Refactor Concat and MapConcat to avoid creating concatenator object for each > row. > - > > Key: SPARK-24871 > URL: https://issues.apache.org/jira/browse/SPARK-24871 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Refactor {{Concat}} and {{MapConcat}} to: > - avoid creating concatenator object for each row. > - make {{Concat}} handle {{containsNull}} properly. > - make {{Concat}} shortcut if {{null}} child is found. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.
[ https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24871. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21824 [https://github.com/apache/spark/pull/21824] > Refactor Concat and MapConcat to avoid creating concatenator object for each > row. > - > > Key: SPARK-24871 > URL: https://issues.apache.org/jira/browse/SPARK-24871 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Refactor {{Concat}} and {{MapConcat}} to: > - avoid creating concatenator object for each row. > - make {{Concat}} handle {{containsNull}} properly. > - make {{Concat}} shortcut if {{null}} child is found. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JieFang.He updated SPARK-24873: --- Description: There is too much frequent interaction reports when i use spark shell commend which affect my input,so i think it need to increase a switch to shielding frequent interaction reports with yarn !pic.jpg! was:There is too much frequent interaction reports when i use spark shell commend which affect my input,so i think it need to increase a switch to shielding frequent interaction reports with yarn > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Major > Attachments: pic.jpg > > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn > > !pic.jpg! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JieFang.He updated SPARK-24873: --- Attachment: pic.jpg > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Major > Attachments: pic.jpg > > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24868) add sequence function in Python
[ https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24868: Assignee: Huaxin Gao > add sequence function in Python > --- > > Key: SPARK-24868 > URL: https://issues.apache.org/jira/browse/SPARK-24868 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > Seems the sequence function is only in functions.scala, not in functions.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24868) add sequence function in Python
[ https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24868. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21820 [https://github.com/apache/spark/pull/21820] > add sequence function in Python > --- > > Key: SPARK-24868 > URL: https://issues.apache.org/jira/browse/SPARK-24868 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > Seems the sequence function is only in functions.scala, not in functions.py. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24873: Assignee: (was: Apache Spark) > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Major > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550583#comment-16550583 ] Apache Spark commented on SPARK-24873: -- User 'hejiefang' has created a pull request for this issue: https://github.com/apache/spark/pull/21827 > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Priority: Major > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
[ https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24873: Assignee: Apache Spark > increase switch to shielding frequent interaction reports with yarn > --- > > Key: SPARK-24873 > URL: https://issues.apache.org/jira/browse/SPARK-24873 > Project: Spark > Issue Type: Bug > Components: Spark Shell, YARN >Affects Versions: 2.4.0 >Reporter: JieFang.He >Assignee: Apache Spark >Priority: Major > > There is too much frequent interaction reports when i use spark shell commend > which affect my input,so i think it need to increase a switch to shielding > frequent interaction reports with yarn -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24872: Assignee: (was: Apache Spark) > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24872: Assignee: Apache Spark > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: Apache Spark >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550567#comment-16550567 ] Apache Spark commented on SPARK-24872: -- User 'httfighter' has created a pull request for this issue: https://github.com/apache/spark/pull/21826 > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hantiantian updated SPARK-24872: Description: “||” will perform the function of STRING concat, and it is also the symbol of the "OR" operation. When I want use "||" as "OR" operation, I find that it perform the function of STRING concat, spark-sql> explain extended select * from aa where id==1 || id==2; == Parsed Logical Plan == 'Project [*] +- 'Filter (('id = concat(1, 'id)) = 2) +- 'UnresolvedRelation `aa` spark-sql> select "abc" || "DFF" ; And the result is "abcDFF". In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? was: “||” will perform the function of STRING concat, and it is also the symbol of the "OR" operation. spark-sql> select "abc" || "DFF" ; And the result is "abcDFF". In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn
JieFang.He created SPARK-24873: -- Summary: increase switch to shielding frequent interaction reports with yarn Key: SPARK-24873 URL: https://issues.apache.org/jira/browse/SPARK-24873 Project: Spark Issue Type: Bug Components: Spark Shell, YARN Affects Versions: 2.4.0 Reporter: JieFang.He There is too much frequent interaction reports when i use spark shell commend which affect my input,so i think it need to increase a switch to shielding frequent interaction reports with yarn -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550551#comment-16550551 ] Szilard Nemeth commented on SPARK-20327: Pull request is updated with the latest fixes > Add CLI support for YARN custom resources, like GPUs > > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton >Priority: Major > Labels: newbie > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hantiantian updated SPARK-24872: Description: “||” will perform the function of STRING concat, and it is also the symbol of the "OR" operation. spark-sql> select "abc" || "DFF" ; And the result is "abcDFF". In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24872) Remove the symbol “||” of the “OR” operation
hantiantian created SPARK-24872: --- Summary: Remove the symbol “||” of the “OR” operation Key: SPARK-24872 URL: https://issues.apache.org/jira/browse/SPARK-24872 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: hantiantian -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24859) Predicates pushdown on outer joins
[ https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Mayer updated SPARK-24859: --- Description: I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a common column called part_col. Now I want to join both tables on their id but only for some of partitions. If I use an inner join, everything works well: {code:java} select * from FA f join DI d on(f.id = d.id and f.part_col = d.part_col) where f.part_col = 'xyz' {code} In the sql explain plan I can see, that the predicate part_col = 'xyz' is also used in the DIm HiveTableScan. When I execute the same query using a left join the full dim table is scanned. There are some workarounds for this issue, but I wanted to report this as a bug, since it works on an inner join, and i think the behaviour should be the same for an outer join. Here is a self contained example (created in Zeppelin): {code:java} val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col") val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col") fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact") dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim") spark.sqlContext.sql("create table if not exists fact(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/fact'") spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create table if not exists dim(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/dim'") spark.sqlContext.sql("msck repair table dim"){code} *Inner join example:* {code:java} select * from fact f join dim d on (f.id = d.id and f.part_col = d.part_col) where f.part_col = 100{code} Excerpt from Spark-SQL physical explain plan: {code:java} HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], [isnotnull(part_col#412), (part_col#412 = 100)] HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], [isnotnull(part_col#414), (part_col#414 = 100)]{code} *Outer join example:* {code:java} select * from fact f left join dim d on (f.id = d.id and f.part_col = d.part_col) where f.part_col = 100{code} Excerpt from Spark-SQL physical explain plan: {code:java} HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], [isnotnull(part_col#427), (part_col#427 = 100)] HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code} As you can see the predicate is not pushed down to the HiveTableScan of the dim table on the outer join. was: I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a common column called part_col. Now I want to join both tables on their id but only for some of partitions. If I use an inner join, everything works well: {code:java} select * from FA f join DI d on(f.id = d.id and f.part_col = d.part_col) where f.part_col = 'xyz' {code} In the sql explain plan I can see, that the predicate part_col = 'xyz' is also used in the DIm HiveTableScan. When I execute the same query using a left join the full dim table is scanned. There are some workarounds for this issue, but I wanted to report this as a bug, since it works on an inner join, and i think the behaviour should be the same for an outer join. Here is a self contained example (created in Zeppelin): {code:java} val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col") val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col") fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact") dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim") spark.sqlContext.sql("create table if not exists fact(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/fact'") spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create table if not exists dim(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/dim'") spark.sqlContext.sql("msck repair table dim"){code} *Inner join example:* {code:java} select * from fact f join dim d on (f.id = d.id and f.part_col = d.part_col) where f.part_col = 100{code} Excerpt from Spark-SQL physical explain plan: {code:java} HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, org.apache.hadoop.hive.serde2.avro.AvroSerDe,
[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins
[ https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550541#comment-16550541 ] Johannes Mayer commented on SPARK-24859: Ok, i added the example > Predicates pushdown on outer joins > -- > > Key: SPARK-24859 > URL: https://issues.apache.org/jira/browse/SPARK-24859 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Cloudera CDH 5.13.1 >Reporter: Johannes Mayer >Priority: Major > > I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a > common column called part_col. Now I want to join both tables on their id but > only for some of partitions. > If I use an inner join, everything works well: > > {code:java} > select * > from FA f > join DI d > on(f.id = d.id and f.part_col = d.part_col) > where f.part_col = 'xyz' > {code} > > In the sql explain plan I can see, that the predicate part_col = 'xyz' is > also used in the DIm HiveTableScan. > > When I execute the same query using a left join the full dim table is > scanned. There are some workarounds for this issue, but I wanted to report > this as a bug, since it works on an inner join, and i think the behaviour > should be the same for an outer join. > Here is a self contained example (created in Zeppelin): > > {code:java} > val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col") > val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col") > fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact") > dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim") > > spark.sqlContext.sql("create table if not exists fact(id int) partitioned by > (part_col int) stored as avro location '/tmp/jira/fact'") > spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create > table if not exists dim(id int) partitioned by (part_col int) stored as avro > location '/tmp/jira/dim'") > spark.sqlContext.sql("msck repair table dim"){code} > > > > *Inner join example:* > {code:java} > select * from fact f > join dim d > on (f.id = d.id > and f.part_col = d.part_col) > where f.part_col = 100{code} > Excerpt from Spark-SQL physical explain plan: > {code:java} > HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], > [isnotnull(part_col#412), (part_col#412 = 100)] > HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], > [isnotnull(part_col#414), (part_col#414 = 100)]{code} > > *Outer join example:* > {code:java} > select * from fact f > left join dim d > on (f.id = d.id > and f.part_col = d.part_col) > where f.part_col = 100{code} > > Excerpt from Spark-SQL physical explain plan: > > {code:java} > HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], > [isnotnull(part_col#427), (part_col#427 = 100)] > HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code} > > > As you can see the predicate is not pushed down to the HiveTableScan of the > dim table on the outer join. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24859) Predicates pushdown on outer joins
[ https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Mayer updated SPARK-24859: --- Description: I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a common column called part_col. Now I want to join both tables on their id but only for some of partitions. If I use an inner join, everything works well: {code:java} select * from FA f join DI d on(f.id = d.id and f.part_col = d.part_col) where f.part_col = 'xyz' {code} In the sql explain plan I can see, that the predicate part_col = 'xyz' is also used in the DIm HiveTableScan. When I execute the same query using a left join the full dim table is scanned. There are some workarounds for this issue, but I wanted to report this as a bug, since it works on an inner join, and i think the behaviour should be the same for an outer join. Here is a self contained example (created in Zeppelin): {code:java} val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col") val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col") fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact") dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim") spark.sqlContext.sql("create table if not exists fact(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/fact'") spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create table if not exists dim(id int) partitioned by (part_col int) stored as avro location '/tmp/jira/dim'") spark.sqlContext.sql("msck repair table dim"){code} *Inner join example:* {code:java} select * from fact f join dim d on (f.id = d.id and f.part_col = d.part_col) where f.part_col = 100{code} Excerpt from Spark-SQL physical explain plan: {code:java} HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], [isnotnull(part_col#412), (part_col#412 = 100)] HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], [isnotnull(part_col#414), (part_col#414 = 100)]{code} *Outer join example:* {code:java} select * from fact f left join dim d on (f.id = d.id and f.part_col = d.part_col) where f.part_col = 100{code} Excerpt from Spark-SQL physical explain plan: {code:java} HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], [isnotnull(part_col#427), (part_col#427 = 100)] HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code} As you can see the predicate is not pushed down to the HiveTableScan of the dim table on the outer join. was: I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a common column called part_col. Now I want to join both tables on their id but only for some of partitions. If I use an inner join, everything works well: {code:java} select * from FA f join DI d on(f.id = d.id and f.part_col = d.part_col) where f.part_col = 'xyz' {code} In the sql explain plan I can see, that the predicate part_col = 'xyz' is also used in the DIm HiveTableScan. When I execute the same query using a left join the full dim table is scanned. There are some workarounds for this issue, but I wanted to report this as a bug, since it works on an inner join, and i think the behaviour should be the same for an outer join > Predicates pushdown on outer joins > -- > > Key: SPARK-24859 > URL: https://issues.apache.org/jira/browse/SPARK-24859 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Cloudera CDH 5.13.1 >Reporter: Johannes Mayer >Priority: Major > > I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a > common column called part_col. Now I want to join both tables on their id but > only for some of partitions. > If I use an inner join, everything works well: > > {code:java} > select * > from FA f > join DI d > on(f.id = d.id and f.part_col = d.part_col) > where f.part_col = 'xyz' > {code} > > In the sql explain plan I can see, that the predicate part_col = 'xyz' is > also used in the DIm HiveTableScan. > > When I execute the same query using a left join the full dim table is > scanned. There are some workarounds for this issue, but I wanted to report > this as a bug, since it works on an inner join, and i think the behaviour > should be the same for an
[jira] [Updated] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23731: - Priority: Major (was: Minor) > FileSourceScanExec throws NullPointerException in subexpression elimination > --- > > Key: SPARK-23731 > URL: https://issues.apache.org/jira/browse/SPARK-23731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0, 2.3.1 >Reporter: Jacek Laskowski >Priority: Major > > While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I > faced the following exception (in Spark 2.3.0): > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502) > at > org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257) > at > org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36) > at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358) > at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40) > at > scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136) > at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132) > at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.get(HashMap.scala:70) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95) > at >
[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins
[ https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550359#comment-16550359 ] Johannes Mayer commented on SPARK-24859: I will provide an example. Could you test it on the master branch? > Predicates pushdown on outer joins > -- > > Key: SPARK-24859 > URL: https://issues.apache.org/jira/browse/SPARK-24859 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Cloudera CDH 5.13.1 >Reporter: Johannes Mayer >Priority: Major > > I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a > common column called part_col. Now I want to join both tables on their id but > only for some of partitions. > If I use an inner join, everything works well: > > {code:java} > select * > from FA f > join DI d > on(f.id = d.id and f.part_col = d.part_col) > where f.part_col = 'xyz' > {code} > > In the sql explain plan I can see, that the predicate part_col = 'xyz' is > also used in the DIm HiveTableScan. > > When I execute the same query using a left join the full dim table is > scanned. There are some workarounds for this issue, but I wanted to report > this as a bug, since it works on an inner join, and i think the behaviour > should be the same for an outer join > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18188) Add checksum for block of broadcast
[ https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550354#comment-16550354 ] Apache Spark commented on SPARK-18188: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/21825 > Add checksum for block of broadcast > --- > > Key: SPARK-18188 > URL: https://issues.apache.org/jira/browse/SPARK-18188 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Major > Fix For: 2.1.0 > > > There is an understanding issue for a long time: > https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for > the blocks, it's very hard for us to identify where is the bug came from. > Shuffle blocks are compressed separate (have checksum in it), but broadcast > blocks are compressed together, we should add checksum for each of separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24424) Support ANSI-SQL compliant syntax for GROUPING SET
[ https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24424. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.4.0 > Support ANSI-SQL compliant syntax for GROUPING SET > --- > > Key: SPARK-24424 > URL: https://issues.apache.org/jira/browse/SPARK-24424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Currently, our Group By clause follows Hive > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup] > : > However, this does not match ANSI SQL compliance. The proposal is to update > our parser and analyzer for ANSI compliance. > For example, > {code:java} > GROUP BY col1, col2 WITH ROLLUP > GROUP BY col1, col2 WITH CUBE > GROUP BY col1, col2 GROUPING SET ... > {code} > It is nice to support ANSI SQL syntax at the same time. > {code:java} > GROUP BY ROLLUP(col1, col2) > GROUP BY CUBE(col1, col2) > GROUP BY GROUPING SET(...) > {code} > Note, we only need to support one-level grouping set in this stage. That > means, nested grouping set is not supported. > Note, we should not break the existing syntax. The parser changes should be > like > {code:sql} > group-by-expressions > >>-GROUP BY+-hive-sql-group-by-expressions-+--->< >'-ansi-sql-grouping-set-expressions-' > hive-sql-group-by-expressions > '--GROUPING SETS--(--grouping-set-expressions--)--' >.-,--. +--WITH CUBE--+ >V| +--WITH ROLLUP+ > >>---+-expression-+-+---+-+->< > grouping-expressions-list >.-,--. >V| > >>---+-expression-+-+-->< > grouping-set-expressions > .-,. > | .-,--. | > | V| | > V '-(--expression---+-)-' | > >>+-expression--+--+->< > ansi-sql-grouping-set-expressions > >>-+-ROLLUP--(--grouping-expression-list--)-+-->< >+-CUBE--(--grouping-expression-list--)---+ >'-GROUPING SETS--(--grouping-set-expressions--)--' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for GROUPING SET
[ https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24424: Summary: Support ANSI-SQL compliant syntax for GROUPING SET (was: Support ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET) > Support ANSI-SQL compliant syntax for GROUPING SET > --- > > Key: SPARK-24424 > URL: https://issues.apache.org/jira/browse/SPARK-24424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > Currently, our Group By clause follows Hive > [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup] > : > However, this does not match ANSI SQL compliance. The proposal is to update > our parser and analyzer for ANSI compliance. > For example, > {code:java} > GROUP BY col1, col2 WITH ROLLUP > GROUP BY col1, col2 WITH CUBE > GROUP BY col1, col2 GROUPING SET ... > {code} > It is nice to support ANSI SQL syntax at the same time. > {code:java} > GROUP BY ROLLUP(col1, col2) > GROUP BY CUBE(col1, col2) > GROUP BY GROUPING SET(...) > {code} > Note, we only need to support one-level grouping set in this stage. That > means, nested grouping set is not supported. > Note, we should not break the existing syntax. The parser changes should be > like > {code:sql} > group-by-expressions > >>-GROUP BY+-hive-sql-group-by-expressions-+--->< >'-ansi-sql-grouping-set-expressions-' > hive-sql-group-by-expressions > '--GROUPING SETS--(--grouping-set-expressions--)--' >.-,--. +--WITH CUBE--+ >V| +--WITH ROLLUP+ > >>---+-expression-+-+---+-+->< > grouping-expressions-list >.-,--. >V| > >>---+-expression-+-+-->< > grouping-set-expressions > .-,. > | .-,--. | > | V| | > V '-(--expression---+-)-' | > >>+-expression--+--+->< > ansi-sql-grouping-set-expressions > >>-+-ROLLUP--(--grouping-expression-list--)-+-->< >+-CUBE--(--grouping-expression-list--)---+ >'-GROUPING SETS--(--grouping-set-expressions--)--' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24268) DataType in error messages are not coherent
[ https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24268. - Resolution: Fixed Fix Version/s: 2.4.0 > DataType in error messages are not coherent > --- > > Key: SPARK-24268 > URL: https://issues.apache.org/jira/browse/SPARK-24268 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > In SPARK-22893 there was a tentative to unify the way dataTypes are reported > in error messages. There, we decided to use always {{dataType.simpleString}}. > Unfortunately, we missed many places where this still needed to be fixed. > Moreover, it turns out that the right method to use is not {{simpleString}}, > but we should use {{catalogString}} instead (for further details please check > the discussion in the PR https://github.com/apache/spark/pull/21321). > So we should update all the missing places in order to provide error messages > coherently throughout the project. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550289#comment-16550289 ] Dilip Biswal edited comment on SPARK-24864 at 7/20/18 6:23 AM: --- [~abhimadav] I don't see a problem here. The generated column name is different between spark and hive. Perhaps in spark 1.6, the generated column names were same between spark and hive i.e it starts with `_c[number]`. In this repro, spark by default generates the column name as "upper(name)". {code} scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema root |-- id: integer (nullable = true) |-- upper(name): string (nullable = true) {code} So following would work in spark. {code:java} scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new"); res13: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from vsrc1new").show() +++ |id|uname| +++ |1|TEST | +++ {code} In my opinion, its a good practice to give explicit aliases instead of relying on system generated ones especially if we r looking for portability across different database systems. spark.sql("CREATE VIEW vsrc1new AS SELECT id, upper_name AS uname FROM (SELECT id, upper(name) as upper_name FROM src1) "); cc [~smilegator] We changed the generated column names on purpose to make them more readable, right ? was (Author: dkbiswal): [~abhimadav] I don't see a problem here. The generated column name is different between spark and hive. Perhaps in spark 1.6, the generated column names were same between spark and hive i.e it starts with `_c[number]`. In this repro, spark by default generates the column name as "upper(name)". {code} scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema root |-- id: integer (nullable = true) |-- upper(name): string (nullable = true) {code} So following would work in spark. {code:java} scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new"); res13: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from vsrc1new").show() +++ |id|uname| +++ |1|TEST | +++ {code} cc [~smilegator] We changed the generated column names on purpose to make them more readable, right ? > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against >
[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view
[ https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550289#comment-16550289 ] Dilip Biswal commented on SPARK-24864: -- [~abhimadav] I don't see a problem here. The generated column name is different between spark and hive. Perhaps in spark 1.6, the generated column names were same between spark and hive i.e it starts with `_c[number]`. In this repro, spark by default generates the column name as "upper(name)". {code} scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema root |-- id: integer (nullable = true) |-- upper(name): string (nullable = true) {code} So following would work in spark. {code:java} scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname FROM (SELECT id, upper(name) FROM src1) vsrc1new"); res13: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from vsrc1new").show() +++ |id|uname| +++ |1|TEST | +++ {code} cc [~smilegator] We changed the generated column names on purpose to make them more readable, right ? > Cannot resolve auto-generated column ordinals in a hive view > > > Key: SPARK-24864 > URL: https://issues.apache.org/jira/browse/SPARK-24864 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Abhishek Madav >Priority: Major > Fix For: 2.4.0 > > > Spark job reading from a hive-view fails with analysis exception when > resolving column ordinals which are autogenerated. > *Exception*: > {code:java} > scala> spark.sql("Select * from vsrc1new").show > org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given > input columns: [id, upper(name)]; line 1 pos 24; > 'Project [*] > +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new` > +- 'Project [id#634, 'vsrc1new._c1 AS uname#633] > +- SubqueryAlias vsrc1new > +- Project [id#634, upper(name#635) AS upper(name)#636] > +- MetastoreRelation default, src1 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > {code} > *Steps to reproduce:* > 1: Create a simple table, say src > {code:java} > CREATE TABLE `src1`(`id` int, `name` string) ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' > {code} > 2: Create a view, say with name vsrc1new > {code:java} > CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, > upper(name) FROM src1) vsrc1new; > {code} > 3. Selecting data from this view in hive-cli/beeline doesn't cause any error. > 4. Creating a dataframe using: > {code:java} > spark.sql("Select * from vsrc1new").show //throws error > {code} > The auto-generated column names for the view are not resolved. Am I possibly > missing some spark-sql configuration here? I tried the repro-case against > spark 1.6 and that worked fine. Any inputs are appreciated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org