date:20180720

[jira] [Commented] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551527#comment-16551527
 ] 

Apache Spark commented on SPARK-24873:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21784

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Major
> Attachments: pic.jpg
>
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn
>  
> !pic.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24836) New option - ignoreExtension

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24836.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> New option - ignoreExtension
> 
>
> Key: SPARK-24836
> URL: https://issues.apache.org/jira/browse/SPARK-24836
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Need to add new option for Avro datasource - *ignoreExtension*. It should 
> control ignoring of the .avro extensions. If it is set to *true* (by 
> default), files with and without .avro extensions should be loaded. Example 
> of usage:
> {code:scala}
> spark
>   .read
>   .option("ignoreExtension", false)
>   .avro("path to avro files")
> {code}
> The option duplicates Hadoop's config 
> avro.mapred.ignore.inputs.without.extension which is taken into account by 
> Avro datasource now and can be set like:
> {code:scala}
> spark
>   .sqlContext
>   .sparkContext
>   .hadoopConfiguration
>   .set("avro.mapred.ignore.inputs.without.extension", "true")
> {code}
> The ignoreExtension option must override 
> avro.mapred.ignore.inputs.without.extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24879.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.2

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Assignee: William Sheu
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
>

[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24879:
---

Assignee: William Sheu

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Assignee: William Sheu
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
> at 
>

[jira] [Comment Edited] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-20 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457
 ] 

Liang-Chi Hsieh edited comment on SPARK-24875 at 7/21/18 12:21 AM:
---

hmm, I think for calculation of precision, recall and true/false positive rate, 
we should only care about exact numbers but approximate one. Thus is it 
reasonable to use countByValueApprox here?


was (Author: viirya):
hmm, I think for calculation of precision, recall and true/false positive rate, 
we should only care about exact calculation but approximate one. Thus is it 
reasonable to use countByValueApprox here?

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-20 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551457#comment-16551457
 ] 

Liang-Chi Hsieh commented on SPARK-24875:
-

hmm, I think for calculation of precision, recall and true/false positive rate, 
we should only care about exact calculation but approximate one. Thus is it 
reasonable to use countByValueApprox here?

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-20 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551416#comment-16551416
 ] 

Liang-Chi Hsieh commented on SPARK-24862:
-

Isn't it inconsistent between the schema and the ser/de? And for serializer, 
for example, how can we get the {{y}} from {{Multi}} objects?

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> at 
>

[jira] [Resolved] (SPARK-24488) Analyzer throws when generator is aliased multiple times

2018-07-20 Thread Herman van Hovell (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-24488.
---
   Resolution: Fixed
 Assignee: Brandon Krieger
Fix Version/s: 2.4.0

> Analyzer throws when generator is aliased multiple times
> 
>
> Key: SPARK-24488
> URL: https://issues.apache.org/jira/browse/SPARK-24488
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Brandon Krieger
>Assignee: Brandon Krieger
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, the Analyzer throws an exception if your try to nest a generator. 
> However, it special cases generators "nested" in an alias, and allows that. 
> If you try to alias a generator twice, it is not caught by the special case, 
> so an exception is thrown:
>  
> {code:java}
> scala> Seq(("a", "b"))
> .toDF("col1","col2")
> .select(functions.array('col1,'col2).as("arr"))
> .select(functions.explode('arr).as("first").as("second"))
> .collect()
> org.apache.spark.sql.AnalysisException: Generators are not supported when 
> it's nested in expressions, but got: explode(arr) AS `first`;
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$23.applyOrElse(Analyzer.scala:1604)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$23.applyOrElse(Analyzer.scala:1601)
> {code}
>  
> In reality, aliasing twice is fine, so we can fix this by trimming non 
> top-level aliases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24880.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Fix the group id for spark-kubernetes-integration-tests
> ---
>
> Key: SPARK-24880
> URL: https://issues.apache.org/jira/browse/SPARK-24880
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> The correct group id should be `org.apache.spark`. This is causing the 
> nightly build failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on 
> project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: 
> Could not transfer artifact 
> spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1
>  from/to apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Access denied 
> to: 
> https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar,
>  ReasonPhrase: Forbidden. -> [Help 1]
> [ERROR] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24876) Simplify schema serialization

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24876.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Simplify schema serialization
> -
>
> Key: SPARK-24876
> URL: https://issues.apache.org/jira/browse/SPARK-24876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Previously in the refactoring of Avro Serializer and Deserializer, a new 
> class SerializableSchema is created for serializing the avro schema.
> [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
>  
> On second thought, we can use `toString` method for serialization. After 
> that, parse the JSON format schema on executor. This makes the code much 
> simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-20 Thread Antonio Murgia (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551312#comment-16551312
 ] 

Antonio Murgia commented on SPARK-24862:


Yeah, they are definitely not supported. Therefore I think they encoder 
generator should generate the schema based on the first parameter and the 
ser/de based on all the param lists. I can think of a PR if you’d like.

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> at 
>

[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-07-20 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551299#comment-16551299
 ] 

Thomas Graves commented on SPARK-23128:
---

[~carsonwang]  I'm curious if you are still running with this and how has it 
been running? This is definitely interesting and looks like it just stalled 
needing reviews but wasn't sure?

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Abhishek Madav (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551296#comment-16551296
 ] 

Abhishek Madav commented on SPARK-24864:


Thanks for the reply. The views are currently crated by the customer and the 
spark-job hasn't been able to keep up with the upgrade from 1.6 -> 2.0+ hence 
they feel it is a regression. Is there anything that can be done to go back to 
the 1.6 way of column referencing?

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-20 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551240#comment-16551240
 ] 

Thomas Graves commented on SPARK-24615:
---

the other thing which I think I mentioned above is could this handle saying, I 
want 1 node with 4GB, 4 cores and 3 nodes with 2 gpus, 10GB, 1cores) . 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24879:

Shepherd: Xiao Li  (was: William Sheu)

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Priority: Major
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
>

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-20 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551233#comment-16551233
 ] 

Thomas Graves commented on SPARK-24615:
---

Ok so thinking about this a bit more I slightly misread what it was applying 
to.  Really you are associated it with the new rdd that will be created (val 
rddWithGPUResult = rdd.withResources.xxx) , not the original and to regenerate 
the new rddwithGPUResult you would want to know it was created using those 
resources.  The thing that isn't clear to me is the scoping of this. 

For instance if I say I have the code

val rdd1 = {{sc.textFile("README.md")}}

val rdd2 = rdd1.withResources.mapPartitions().collect().

Does the withResources apply to the entire line up to the action?  

What if I change it to say

val rdd2 = rdd1.withResources.mapPartitions()

val res = rdd2.collect()

Does the withResources only apply to the mapPartitions, which is really what I 
think you want with some of the ml algorithms.  So we need to define what it 
applies to.  Doing something similar but would have more obvious scope:

 

val rdd2 = withResources {

   rdd1.mapPartitions()

}

The above would be very obvious to the user the scope on it.

You also have things people could do like:

val rdd1 = rdd.withResources(x).mapPartitions()

val rdd2 = rdd.withResources(y).mapPartitions()

val rdd3 rdd1.join(rdd2)

I think in this case rdd1 and rdd2 have to be individually materialized before 
you do the join for rdd3.

So its more like an implicit val rdd1 = 
rdd.withResources(x).mapPartitions().eval() . You end up with putting in some 
stage boundaries.

Have you thought about the scope and have ideas around this?

 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24879:


Assignee: (was: Apache Spark)

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Priority: Major
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>

[jira] [Commented] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551200#comment-16551200
 ] 

Apache Spark commented on SPARK-24879:
--

User 'PenguinToast' has created a pull request for this issue:
https://github.com/apache/spark/pull/21832

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Priority: Major
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
>

[jira] [Assigned] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24879:


Assignee: Apache Spark

> NPE in Hive partition filter pushdown for `partCol IN (NULL, )`
> ---
>
> Key: SPARK-24879
> URL: https://issues.apache.org/jira/browse/SPARK-24879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: William Sheu
>Assignee: Apache Spark
>Priority: Major
>
> The following query triggers a NPE:
> {code:java}
> create table foo (col1 int) partitioned by (col2 int);
> select * from foo where col2 in (1, NULL);
> {code}
> We try to push down the filter to Hive in order to do partition pruning, but 
> the filter converter breaks on a `null`.
> Here's the stack:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:355)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
> at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
> at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
> at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
> at 
>

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-20 Thread Mridul Muralidharan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551198#comment-16551198
 ] 

Mridul Muralidharan commented on SPARK-24615:
-

[~tgraves] This was indeed a recurring issue - the ability to modulate ask's to 
RM based on current requirements.
What you bring out is an excellent point - changing resource requirements would 
be very useful - particularly for applications with heterogenous resource 
needs. Even currently when executor_memory/executor_cores does not allign well 
with stage requirements, we end up with OOM - resulting in over-provisioning 
memory needs; resulting in suboptimal use. GPU/accelerator aware scheduler is 
an extension of the same - where we have other resources to consider.

I agree with [~tgraves] that a more general way to model this would look at all 
resources (when declaratively specified ofcourse) and use the information to 
allocate resources (from RM) and for task schedule (within spark).



> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22880) Add option to cascade jdbc truncate if database supports this (PostgreSQL and Oracle)

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22880.
-
   Resolution: Fixed
 Assignee: Daniel van der Ende
Fix Version/s: 2.4.0

> Add option to cascade jdbc truncate if database supports this (PostgreSQL and 
> Oracle)
> -
>
> Key: SPARK-22880
> URL: https://issues.apache.org/jira/browse/SPARK-22880
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel van der Ende
>Assignee: Daniel van der Ende
>Priority: Minor
> Fix For: 2.4.0
>
>
> When truncating tables, PostgreSQL and Oracle support an option `TRUNCATE`. 
> This cascades the truncate to tables with foreign key constraints on a column 
> in the table specified to truncate. It would be nice to be able to optionally 
> set this `TRUNCATE` option for PostgreSQL and Oracle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-24880:
-
Description: 
The correct group id should be `org.apache.spark`. This is causing the nightly 
build failure: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on 
project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: 
Could not transfer artifact 
spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1
 from/to apache.snapshots.https 
(https://repository.apache.org/content/repositories/snapshots): Access denied 
to: 
https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar,
 ReasonPhrase: Forbidden. -> [Help 1]
[ERROR] 
{code}

> Fix the group id for spark-kubernetes-integration-tests
> ---
>
> Key: SPARK-24880
> URL: https://issues.apache.org/jira/browse/SPARK-24880
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> The correct group id should be `org.apache.spark`. This is causing the 
> nightly build failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/2295/console
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:2.8.2:deploy (default-deploy) on 
> project spark-kubernetes-integration-tests_2.11: Failed to deploy artifacts: 
> Could not transfer artifact 
> spark-kubernetes-integration-tests:spark-kubernetes-integration-tests_2.11:jar:2.4.0-20180720.101629-1
>  from/to apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Access denied 
> to: 
> https://repository.apache.org/content/repositories/snapshots/spark-kubernetes-integration-tests/spark-kubernetes-integration-tests_2.11/2.4.0-SNAPSHOT/spark-kubernetes-integration-tests_2.11-2.4.0-20180720.101629-1.jar,
>  ReasonPhrase: Forbidden. -> [Help 1]
> [ERROR] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24880:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix the group id for spark-kubernetes-integration-tests
> ---
>
> Key: SPARK-24880
> URL: https://issues.apache.org/jira/browse/SPARK-24880
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551191#comment-16551191
 ] 

Apache Spark commented on SPARK-24880:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/21831

> Fix the group id for spark-kubernetes-integration-tests
> ---
>
> Key: SPARK-24880
> URL: https://issues.apache.org/jira/browse/SPARK-24880
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24880:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix the group id for spark-kubernetes-integration-tests
> ---
>
> Key: SPARK-24880
> URL: https://issues.apache.org/jira/browse/SPARK-24880
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Kubernetes
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24880) Fix the group id for spark-kubernetes-integration-tests

2018-07-20 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-24880:


 Summary: Fix the group id for spark-kubernetes-integration-tests
 Key: SPARK-24880
 URL: https://issues.apache.org/jira/browse/SPARK-24880
 Project: Spark
  Issue Type: Bug
  Components: Build, Kubernetes
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-20 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551174#comment-16551174
 ] 

Liang-Chi Hsieh commented on SPARK-24862:
-

Even we only retrieve the first parameter list at {{getConstructorParameters}}, 
when we need to deserialize {{Multi}}, we don't have the {{y}} in input columns 
because we only serialize {{x}}. I think the multiple parameter lists case 
class is not supported for Encoder.

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at

[jira] [Resolved] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-20 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24852.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21799
[https://github.com/apache/spark/pull/21799]

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24864.
-
Resolution: Won't Fix

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551123#comment-16551123
 ] 

Xiao Li commented on SPARK-24864:
-

Yeah, our generated alias names are different from the ones generated by Hive. 
Please explicitly specify the alias names in your query. 

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24879) NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`

2018-07-20 Thread William Sheu (JIRA)

William Sheu created SPARK-24879:


 Summary: NPE in Hive partition filter pushdown for `partCol IN 
(NULL, )`
 Key: SPARK-24879
 URL: https://issues.apache.org/jira/browse/SPARK-24879
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: William Sheu


The following query triggers a NPE:
{code:java}
create table foo (col1 int) partitioned by (col2 int);
select * from foo where col2 in (1, NULL);
{code}
We try to push down the filter to Hive in order to do partition pruning, but 
the filter converter breaks on a `null`.

Here's the stack:
{code:java}
java.lang.NullPointerException
at 
org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$2$.unapply(HiveShim.scala:601)
at 
org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
at 
org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$$anonfun$5.apply(HiveShim.scala:609)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiterals$2$.unapply(HiveShim.scala:609)
at 
org.apache.spark.sql.hive.client.Shim_v0_13.org$apache$spark$sql$hive$client$Shim_v0_13$$convert$1(HiveShim.scala:671)
at 
org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
at 
org.apache.spark.sql.hive.client.Shim_v0_13$$anonfun$convertFilters$1.apply(HiveShim.scala:704)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at 
org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:704)
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:725)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:678)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:676)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:676)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:955)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions$lzycompute(HiveTableScanExec.scala:172)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.rawPartitions(HiveTableScanExec.scala:164)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:190)
at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2418)
at 
org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:189)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at

[jira] [Commented] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550981#comment-16550981
 ] 

Apache Spark commented on SPARK-24878:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21830

> Fix reverse function for array type of primitive type containing null.
> --
>
> Key: SPARK-24878
> URL: https://issues.apache.org/jira/browse/SPARK-24878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> If we use {{reverse}} function for array type of primitive type containing 
> {{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
> wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
> re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24878:


Assignee: Apache Spark

> Fix reverse function for array type of primitive type containing null.
> --
>
> Key: SPARK-24878
> URL: https://issues.apache.org/jira/browse/SPARK-24878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> If we use {{reverse}} function for array type of primitive type containing 
> {{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
> wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
> re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24878:


Assignee: (was: Apache Spark)

> Fix reverse function for array type of primitive type containing null.
> --
>
> Key: SPARK-24878
> URL: https://issues.apache.org/jira/browse/SPARK-24878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> If we use {{reverse}} function for array type of primitive type containing 
> {{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
> wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
> re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24878) Fix reverse function for array type of primitive type containing null.

2018-07-20 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-24878:
-

 Summary: Fix reverse function for array type of primitive type 
containing null.
 Key: SPARK-24878
 URL: https://issues.apache.org/jira/browse/SPARK-24878
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takuya Ueshin


If we use {{reverse}} function for array type of primitive type containing 
{{null}} and the child array is {{UnsafeArrayData}}, the function returns a 
wrong result because {{UnsafeArrayData}} doesn't define the behavior of 
re-assignment, especially we can't set a valid value after we set {{null}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24792) Add API `.avro` in DataFrameReader/DataFrameWriter

2018-07-20 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550949#comment-16550949
 ] 

Xiao Li commented on SPARK-24792:
-

Since AVRO is an external modules, this does not make sense to have this API. 

> Add API `.avro` in DataFrameReader/DataFrameWriter
> --
>
> Key: SPARK-24792
> URL: https://issues.apache.org/jira/browse/SPARK-24792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add API `.avro` in DataFrameReader/DataFrameWriter
> remove the implicit class AvroDataFrameWriter/Reader
>  
> https://github.com/apache/spark/pull/21742#pullrequestreview-136075421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24792) Add API `.avro` in DataFrameReader/DataFrameWriter

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24792.
-
Resolution: Won't Fix

> Add API `.avro` in DataFrameReader/DataFrameWriter
> --
>
> Key: SPARK-24792
> URL: https://issues.apache.org/jira/browse/SPARK-24792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add API `.avro` in DataFrameReader/DataFrameWriter
> remove the implicit class AvroDataFrameWriter/Reader
>  
> https://github.com/apache/spark/pull/21742#pullrequestreview-136075421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24811) Add function `from_avro` and `to_avro`

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24811.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Add function `from_avro` and `to_avro`
> --
>
> Key: SPARK-24811
> URL: https://issues.apache.org/jira/browse/SPARK-24811
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a new function from_avro for parsing a binary column of avro format and 
> converting it into its corresponding catalyst value.
> Add a new function to_avro for converting a column into binary of avro format 
> with the specified schema.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23451) Deprecate KMeans computeCost

2018-07-20 Thread holdenk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23451.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> Deprecate KMeans computeCost
> 
>
> Key: SPARK-23451
> URL: https://issues.apache.org/jira/browse/SPARK-23451
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.4.0
>
>
> SPARK-11029 added the {{computeCost}} method as a temp fix for the lack of 
> proper cluster evaluators. Now SPARK-14516 introduces a proper 
> {{ClusteringEvaluator}}, so we should deprecate this method and maybe remove 
> it in the next releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24877) Ignore the task completion event from a zombie barrier task

2018-07-20 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24877:


 Summary: Ignore the task completion event from a zombie barrier 
task
 Key: SPARK-24877
 URL: https://issues.apache.org/jira/browse/SPARK-24877
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


Currently we abort the barrier stage if a zombie barrier task can't get killed 
to prevent data correctness issue. We can improve the behavior to let zombie 
barrier task continue running but not able to interact with other barrier tasks 
(maybe from different stage attempt) and ignore the task completion event from 
a zombie barrier task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-20 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550803#comment-16550803
 ] 

Thomas Graves commented on SPARK-24615:
---

yes if any requirement can't be satisfied it would use dynamic allocation to 
release and reacquire containers.    I'm not saying we have to implement those 
parts right now, I'm saying we should keep them in mind during the design of 
this so they could be added later.  I linked one old Jira that was about 
dynamically changing things. Its been brought up many times after in prs and 
just talking to customers not sure if there are other Jira as well.  Its also 
somewhat related to SPARK-20589 where people just want to configure things per 
stage.

I actually question if this should be done at the rdd level as well.  A set of 
partitions don't care what the resources are, its generally the action you are 
taking on those rdd(s). Note it could be more then one rdd.  I could do etl 
stuff on an RDD which resources would be totally different then if I ran 
tensorflow on that RDD for example.  I do realize this is being tied in with 
the barrier stuff which is on the mapPartitions

I'm not trying to be difficult and realize this Jira is more specific to the 
external ML algo's but don't want many api's for the same thing.

I unfortunately haven't thought through a good solution for this, a while back 
my initial thought was to be able to pass in that resource context to the api 
calls, this obviously gets more tricky especially with pure sql support.  I 
need to think about some more.  the above proposal for .withResources is 
definitely closer but wonder about tying to the rdd still.

cc [~irashid] [~mridulm80] who I think this has been brought up before with.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24876) Simplify schema serialization

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550763#comment-16550763
 ] 

Apache Spark commented on SPARK-24876:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21829

> Simplify schema serialization
> -
>
> Key: SPARK-24876
> URL: https://issues.apache.org/jira/browse/SPARK-24876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Previously in the refactoring of Avro Serializer and Deserializer, a new 
> class SerializableSchema is created for serializing the avro schema.
> [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
>  
> On second thought, we can use `toString` method for serialization. After 
> that, parse the JSON format schema on executor. This makes the code much 
> simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24876) Simplify schema serialization

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24876:


Assignee: Apache Spark

> Simplify schema serialization
> -
>
> Key: SPARK-24876
> URL: https://issues.apache.org/jira/browse/SPARK-24876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Previously in the refactoring of Avro Serializer and Deserializer, a new 
> class SerializableSchema is created for serializing the avro schema.
> [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
>  
> On second thought, we can use `toString` method for serialization. After 
> that, parse the JSON format schema on executor. This makes the code much 
> simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24876) Simplify schema serialization

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24876:


Assignee: (was: Apache Spark)

> Simplify schema serialization
> -
>
> Key: SPARK-24876
> URL: https://issues.apache.org/jira/browse/SPARK-24876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Previously in the refactoring of Avro Serializer and Deserializer, a new 
> class SerializableSchema is created for serializing the avro schema.
> [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
>  
> On second thought, we can use `toString` method for serialization. After 
> that, parse the JSON format schema on executor. This makes the code much 
> simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24876) Simplify schema serialization

2018-07-20 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-24876:
---
Summary: Simplify schema serialization  (was: Remove SerializableSchema and 
use json format string schema)

> Simplify schema serialization
> -
>
> Key: SPARK-24876
> URL: https://issues.apache.org/jira/browse/SPARK-24876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Previously in the refactoring of Avro Serializer and Deserializer, a new 
> class SerializableSchema is created for serializing the avro schema.
> [https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
>  
> On second thought, we can use `toString` method for serialization. After 
> that, parse the JSON format schema on executor. This makes the code much 
> simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24876) Remove SerializableSchema and use json format string schema

2018-07-20 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-24876:
--

 Summary: Remove SerializableSchema and use json format string 
schema
 Key: SPARK-24876
 URL: https://issues.apache.org/jira/browse/SPARK-24876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


Previously in the refactoring of Avro Serializer and Deserializer, a new class 
SerializableSchema is created for serializing the avro schema.

[https://github.com/apache/spark/commit/96030876383822645a5b35698ee407a8d4eb76af#diff-7ca6378b3afe21467a274983522ec48eR18]
 

On second thought, we can use `toString` method for serialization. After that, 
parse the JSON format schema on executor. This makes the code much simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination

2018-07-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23731:
---

Assignee: Hyukjin Kwon

> FileSourceScanExec throws NullPointerException in subexpression elimination
> ---
>
> Key: SPARK-23731
> URL: https://issues.apache.org/jira/browse/SPARK-23731
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Jacek Laskowski
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I 
> faced the following exception (in Spark 2.3.0):
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36)
>   at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358)
>   at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40)
>   at 
> scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136)
>   at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132)
>   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.get(HashMap.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54)
>   at 
>

[jira] [Updated] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24864:
--
Fix Version/s: (was: 2.4.0)

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination

2018-07-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23731.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21815
[https://github.com/apache/spark/pull/21815]

> FileSourceScanExec throws NullPointerException in subexpression elimination
> ---
>
> Key: SPARK-23731
> URL: https://issues.apache.org/jira/browse/SPARK-23731
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I 
> faced the following exception (in Spark 2.3.0):
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36)
>   at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358)
>   at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40)
>   at 
> scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136)
>   at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132)
>   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.get(HashMap.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54)
>   at 
>

[jira] [Assigned] (SPARK-24551) Add Integration tests for Secrets

2018-07-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24551:
-

Assignee: Stavros Kontopoulos

> Add Integration tests for Secrets
> -
>
> Key: SPARK-24551
> URL: https://issues.apache.org/jira/browse/SPARK-24551
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> Current 
> [suite|https://github.com/apache/spark/blob/7703b46d2843db99e28110c4c7ccf60934412504/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala]
>  needs to be expanded covering secrets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24551) Add Integration tests for Secrets

2018-07-20 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24551.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21652
[https://github.com/apache/spark/pull/21652]

> Add Integration tests for Secrets
> -
>
> Key: SPARK-24551
> URL: https://issues.apache.org/jira/browse/SPARK-24551
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
> Fix For: 2.4.0
>
>
> Current 
> [suite|https://github.com/apache/spark/blob/7703b46d2843db99e28110c4c7ccf60934412504/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala]
>  needs to be expanded covering secrets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-20 Thread Antoine Galataud (JIRA)

Antoine Galataud created SPARK-24875:


 Summary: MulticlassMetrics should offer a more efficient way to 
compute count by label
 Key: SPARK-24875
 URL: https://issues.apache.org/jira/browse/SPARK-24875
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.3.1
Reporter: Antoine Galataud


Currently _MulticlassMetrics_ calls _countByValue_() to get count by class/label
{code:java}
private lazy val labelCountByClass: Map[Double, Long] = 
predictionAndLabels.values.countByValue()
{code}
If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
test dataset), it will lead to poor execution performance.

One option could be to allow using _countByValueApprox_ (could require adding 
an extra configuration param for MulticlassMetrics).

Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, I 
don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-20 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550738#comment-16550738
 ] 

Takeshi Yamamuro commented on SPARK-24869:
--

In the example above, cache() is not called explicitly, so you mean we cache 
data implicitly for saving data?

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24874) Allow hybrid of both barrier tasks and regular tasks in a stage

2018-07-20 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24874:


 Summary: Allow hybrid of both barrier tasks and regular tasks in a 
stage
 Key: SPARK-24874
 URL: https://issues.apache.org/jira/browse/SPARK-24874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


Currently we only allow barrier tasks in a barrier stage, however, consider the 
following query:
{code}
sc = new SparkContext(conf)
val rdd1 = sc.parallelize(1 to 100, 10)
val rdd2 = sc.parallelize(1 to 1000, 20).barrier().mapPartitions((it, ctx) => 
it)
val rdd = rdd1.union(rdd2).mapPartitions(t => t)
{code}

Now it requires 30 free slots to run `rdd.collect()`. Actually, we can launch 
regular tasks to collect data from rdd1's partitions, they are not required to 
be launched together. If we can do that, we only need 20 free slots to run 
`rdd.collect()`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24871:
---

Assignee: Takuya Ueshin

> Refactor Concat and MapConcat to avoid creating concatenator object for each 
> row.
> -
>
> Key: SPARK-24871
> URL: https://issues.apache.org/jira/browse/SPARK-24871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Refactor {{Concat}} and {{MapConcat}} to:
>  - avoid creating concatenator object for each row.
>  - make {{Concat}} handle {{containsNull}} properly.
>  - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24871) Refactor Concat and MapConcat to avoid creating concatenator object for each row.

2018-07-20 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24871.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21824
[https://github.com/apache/spark/pull/21824]

> Refactor Concat and MapConcat to avoid creating concatenator object for each 
> row.
> -
>
> Key: SPARK-24871
> URL: https://issues.apache.org/jira/browse/SPARK-24871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Refactor {{Concat}} and {{MapConcat}} to:
>  - avoid creating concatenator object for each row.
>  - make {{Concat}} handle {{containsNull}} properly.
>  - make {{Concat}} shortcut if {{null}} child is found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread JieFang.He (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JieFang.He updated SPARK-24873:
---
Description: 
There is too much frequent interaction reports when i use spark shell commend 
which affect my input，so i think it need to increase a switch to shielding 
frequent interaction reports with yarn

 

!pic.jpg!

  was:There is too much frequent interaction reports when i use spark shell 
commend which affect my input，so i think it need to increase a switch to 
shielding frequent interaction reports with yarn


> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Major
> Attachments: pic.jpg
>
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn
>  
> !pic.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread JieFang.He (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JieFang.He updated SPARK-24873:
---
Attachment: pic.jpg

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Major
> Attachments: pic.jpg
>
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24868) add sequence function in Python

2018-07-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24868:


Assignee: Huaxin Gao

> add sequence function in Python
> ---
>
> Key: SPARK-24868
> URL: https://issues.apache.org/jira/browse/SPARK-24868
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24868) add sequence function in Python

2018-07-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24868.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21820
[https://github.com/apache/spark/pull/21820]

> add sequence function in Python
> ---
>
> Key: SPARK-24868
> URL: https://issues.apache.org/jira/browse/SPARK-24868
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> Seems the sequence function is only in functions.scala, not in functions.py. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24873:


Assignee: (was: Apache Spark)

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Major
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550583#comment-16550583
 ] 

Apache Spark commented on SPARK-24873:
--

User 'hejiefang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21827

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Priority: Major
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24873:


Assignee: Apache Spark

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Assignee: Apache Spark
>Priority: Major
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input，so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24872:


Assignee: (was: Apache Spark)

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24872:


Assignee: Apache Spark

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: Apache Spark
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550567#comment-16550567
 ] 

Apache Spark commented on SPARK-24872:
--

User 'httfighter' has created a pull request for this issue:
https://github.com/apache/spark/pull/21826

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread hantiantian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hantiantian updated SPARK-24872:

Description: 
“||” will perform the function of STRING concat, and it is also the symbol of 
the "OR" operation.

When I want use "||" as "OR" operation, I find that it perform the function of 
STRING concat，

  spark-sql> explain extended select * from aa where id==1 || id==2;

   == Parsed Logical Plan ==
    'Project [*]
     +- 'Filter (('id = concat(1, 'id)) = 2)
      +- 'UnresolvedRelation `aa`

   spark-sql> select "abc" || "DFF" ;

   And the result is "abcDFF".

In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?

 

  was:
“||” will perform the function of STRING concat, and it is also the symbol of 
the "OR" operation.

   spark-sql> select "abc" || "DFF" ;

   And the result is "abcDFF".

In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?

 


> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-20 Thread JieFang.He (JIRA)

JieFang.He created SPARK-24873:
--

 Summary: increase switch to shielding frequent interaction reports 
with yarn
 Key: SPARK-24873
 URL: https://issues.apache.org/jira/browse/SPARK-24873
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, YARN
Affects Versions: 2.4.0
Reporter: JieFang.He


There is too much frequent interaction reports when i use spark shell commend 
which affect my input，so i think it need to increase a switch to shielding 
frequent interaction reports with yarn



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-07-20 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550551#comment-16550551
 ] 

Szilard Nemeth commented on SPARK-20327:


Pull request is updated with the latest fixes

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread hantiantian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hantiantian updated SPARK-24872:

Description: 
“||” will perform the function of STRING concat, and it is also the symbol of 
the "OR" operation.

   spark-sql> select "abc" || "DFF" ;

   And the result is "abcDFF".

In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?

 

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2018-07-20 Thread hantiantian (JIRA)

hantiantian created SPARK-24872:
---

 Summary: Remove the symbol “||” of the “OR” operation
 Key: SPARK-24872
 URL: https://issues.apache.org/jira/browse/SPARK-24872
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: hantiantian






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24859) Predicates pushdown on outer joins

2018-07-20 Thread Johannes Mayer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Mayer updated SPARK-24859:
---
Description: 
I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
common column called part_col. Now I want to join both tables on their id but 
only for some of partitions.

If I use an inner join, everything works well:

 
{code:java}
select *
from FA f
join DI d
on(f.id = d.id and f.part_col = d.part_col)
where f.part_col = 'xyz'
{code}
 

In the sql explain plan I can see, that the predicate part_col = 'xyz' is also 
used in the DIm HiveTableScan.

 

When I execute the same query using a left join the full dim table is scanned. 
There are some workarounds for this issue, but I wanted to report this as a 
bug, since it works on an inner join, and i think the behaviour should be the 
same for an outer join.

Here is a self contained example (created in Zeppelin):

 
{code:java}
val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col")
val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col")
fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact")
dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim")
 
spark.sqlContext.sql("create table if not exists fact(id int) partitioned by 
(part_col int) stored as avro location '/tmp/jira/fact'")
spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create 
table if not exists dim(id int) partitioned by (part_col int) stored as avro 
location '/tmp/jira/dim'")
spark.sqlContext.sql("msck repair table dim"){code}
 
  
  
 *Inner join example:*
{code:java}
select * from fact f
join dim d
on (f.id = d.id
and f.part_col = d.part_col)
where f.part_col = 100{code}
Excerpt from Spark-SQL physical explain plan: 
{code:java}
HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], 
[isnotnull(part_col#412), (part_col#412 = 100)] 
HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], 
[isnotnull(part_col#414), (part_col#414 = 100)]{code}
 
 *Outer join example:*
{code:java}
select * from fact f
left join dim d
on (f.id = d.id
and f.part_col = d.part_col)
where f.part_col = 100{code}
 
 Excerpt from Spark-SQL physical explain plan:
  
{code:java}
HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], 
[isnotnull(part_col#427), (part_col#427 = 100)]   
HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code}
 
  

As you can see the predicate is not pushed down to the HiveTableScan of the dim 
table on the outer join.

 

  was:
I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
common column called part_col. Now I want to join both tables on their id but 
only for some of partitions.

If I use an inner join, everything works well:

 
{code:java}
select *
from FA f
join DI d
on(f.id = d.id and f.part_col = d.part_col)
where f.part_col = 'xyz'
{code}
 

In the sql explain plan I can see, that the predicate part_col = 'xyz' is also 
used in the DIm HiveTableScan.

 

When I execute the same query using a left join the full dim table is scanned. 
There are some workarounds for this issue, but I wanted to report this as a 
bug, since it works on an inner join, and i think the behaviour should be the 
same for an outer join.

Here is a self contained example (created in Zeppelin):

 
{code:java}
val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col")
val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col")
fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact")
dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim")
 
spark.sqlContext.sql("create table if not exists fact(id int) partitioned by 
(part_col int) stored as avro location '/tmp/jira/fact'")
spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create 
table if not exists dim(id int) partitioned by (part_col int) stored as avro 
location '/tmp/jira/dim'")
spark.sqlContext.sql("msck repair table dim"){code}
 
 
 
*Inner join example:*
{code:java}
select * from fact f
join dim d
on (f.id = d.id
and f.part_col = d.part_col)
where f.part_col = 100{code}
Excerpt from Spark-SQL physical explain plan: 
{code:java}
HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe,

[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins

2018-07-20 Thread Johannes Mayer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550541#comment-16550541
 ] 

Johannes Mayer commented on SPARK-24859:


Ok, i added the example

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join.
> Here is a self contained example (created in Zeppelin):
>  
> {code:java}
> val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col")
> val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col")
> fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact")
> dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim")
>  
> spark.sqlContext.sql("create table if not exists fact(id int) partitioned by 
> (part_col int) stored as avro location '/tmp/jira/fact'")
> spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create 
> table if not exists dim(id int) partitioned by (part_col int) stored as avro 
> location '/tmp/jira/dim'")
> spark.sqlContext.sql("msck repair table dim"){code}
>  
>  
>  
> *Inner join example:*
> {code:java}
> select * from fact f
> join dim d
> on (f.id = d.id
> and f.part_col = d.part_col)
> where f.part_col = 100{code}
> Excerpt from Spark-SQL physical explain plan: 
> {code:java}
> HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], 
> [isnotnull(part_col#412), (part_col#412 = 100)] 
> HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], 
> [isnotnull(part_col#414), (part_col#414 = 100)]{code}
>  
> *Outer join example:*
> {code:java}
> select * from fact f
> left join dim d
> on (f.id = d.id
> and f.part_col = d.part_col)
> where f.part_col = 100{code}
>  
> Excerpt from Spark-SQL physical explain plan:
>  
> {code:java}
> HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], 
> [isnotnull(part_col#427), (part_col#427 = 100)]   
> HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code}
>  
>  
> As you can see the predicate is not pushed down to the HiveTableScan of the 
> dim table on the outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24859) Predicates pushdown on outer joins

2018-07-20 Thread Johannes Mayer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Mayer updated SPARK-24859:
---
Description: 
I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
common column called part_col. Now I want to join both tables on their id but 
only for some of partitions.

If I use an inner join, everything works well:

 
{code:java}
select *
from FA f
join DI d
on(f.id = d.id and f.part_col = d.part_col)
where f.part_col = 'xyz'
{code}
 

In the sql explain plan I can see, that the predicate part_col = 'xyz' is also 
used in the DIm HiveTableScan.

 

When I execute the same query using a left join the full dim table is scanned. 
There are some workarounds for this issue, but I wanted to report this as a 
bug, since it works on an inner join, and i think the behaviour should be the 
same for an outer join.

Here is a self contained example (created in Zeppelin):

 
{code:java}
val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col")
val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col")
fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact")
dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim")
 
spark.sqlContext.sql("create table if not exists fact(id int) partitioned by 
(part_col int) stored as avro location '/tmp/jira/fact'")
spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create 
table if not exists dim(id int) partitioned by (part_col int) stored as avro 
location '/tmp/jira/dim'")
spark.sqlContext.sql("msck repair table dim"){code}
 
 
 
*Inner join example:*
{code:java}
select * from fact f
join dim d
on (f.id = d.id
and f.part_col = d.part_col)
where f.part_col = 100{code}
Excerpt from Spark-SQL physical explain plan: 
{code:java}
HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], 
[isnotnull(part_col#412), (part_col#412 = 100)] 
HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], 
[isnotnull(part_col#414), (part_col#414 = 100)]{code}
 
*Outer join example:*
{code:java}
select * from fact f
left join dim d
on (f.id = d.id
and f.part_col = d.part_col)
where f.part_col = 100{code}
 
Excerpt from Spark-SQL physical explain plan:
 
{code:java}
HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], 
[isnotnull(part_col#427), (part_col#427 = 100)]   
HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, 
org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code}
 
 

As you can see the predicate is not pushed down to the HiveTableScan of the dim 
table on the outer join.

 

  was:
I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
common column called part_col. Now I want to join both tables on their id but 
only for some of partitions.

If I use an inner join, everything works well:

 
{code:java}
select *
from FA f
join DI d
on(f.id = d.id and f.part_col = d.part_col)
where f.part_col = 'xyz'
{code}
 

In the sql explain plan I can see, that the predicate part_col = 'xyz' is also 
used in the DIm HiveTableScan.

 

When I execute the same query using a left join the full dim table is scanned. 
There are some workarounds for this issue, but I wanted to report this as a 
bug, since it works on an inner join, and i think the behaviour should be the 
same for an outer join

 

 


> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an

[jira] [Updated] (SPARK-23731) FileSourceScanExec throws NullPointerException in subexpression elimination

2018-07-20 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23731:
-
Priority: Major  (was: Minor)

> FileSourceScanExec throws NullPointerException in subexpression elimination
> ---
>
> Key: SPARK-23731
> URL: https://issues.apache.org/jira/browse/SPARK-23731
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
>
> While working with a SQL with many {{CASE WHEN}} and {{ScalarSubqueries}} I 
> faced the following exception (in Spark 2.3.0):
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:167)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:502)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:158)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:257)
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.semanticEquals(subquery.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions$Expr.equals(EquivalentExpressions.scala:36)
>   at scala.collection.mutable.HashTable$class.elemEquals(HashTable.scala:358)
>   at scala.collection.mutable.HashMap.elemEquals(HashMap.scala:40)
>   at 
> scala.collection.mutable.HashTable$class.scala$collection$mutable$HashTable$$findEntry0(HashTable.scala:136)
>   at scala.collection.mutable.HashTable$class.findEntry(HashTable.scala:132)
>   at scala.collection.mutable.HashMap.findEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.get(HashMap.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExpr(EquivalentExpressions.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:95)
>   at 
>

[jira] [Commented] (SPARK-24859) Predicates pushdown on outer joins

2018-07-20 Thread Johannes Mayer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550359#comment-16550359
 ] 

Johannes Mayer commented on SPARK-24859:


I will provide an example. Could you test it on the master branch?

> Predicates pushdown on outer joins
> --
>
> Key: SPARK-24859
> URL: https://issues.apache.org/jira/browse/SPARK-24859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Cloudera CDH 5.13.1
>Reporter: Johannes Mayer
>Priority: Major
>
> I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a 
> common column called part_col. Now I want to join both tables on their id but 
> only for some of partitions.
> If I use an inner join, everything works well:
>  
> {code:java}
> select *
> from FA f
> join DI d
> on(f.id = d.id and f.part_col = d.part_col)
> where f.part_col = 'xyz'
> {code}
>  
> In the sql explain plan I can see, that the predicate part_col = 'xyz' is 
> also used in the DIm HiveTableScan.
>  
> When I execute the same query using a left join the full dim table is 
> scanned. There are some workarounds for this issue, but I wanted to report 
> this as a bug, since it works on an inner join, and i think the behaviour 
> should be the same for an outer join
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18188) Add checksum for block of broadcast

2018-07-20 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550354#comment-16550354
 ] 

Apache Spark commented on SPARK-18188:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/21825

> Add checksum for block of broadcast
> ---
>
> Key: SPARK-18188
> URL: https://issues.apache.org/jira/browse/SPARK-18188
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Major
> Fix For: 2.1.0
>
>
> There is an understanding issue for a long time: 
> https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for 
> the blocks, it's very hard for us to identify where is the bug came from.
> Shuffle blocks are compressed separate (have checksum in it), but broadcast 
> blocks are compressed together, we should add checksum for each of separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24424) Support ANSI-SQL compliant syntax for GROUPING SET

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24424.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.4.0

> Support ANSI-SQL compliant syntax for  GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24424) Support ANSI-SQL compliant syntax for GROUPING SET

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24424:

Summary: Support ANSI-SQL compliant syntax for  GROUPING SET  (was: Support 
ANSI-SQL compliant syntax for ROLLUP, CUBE and GROUPING SET)

> Support ANSI-SQL compliant syntax for  GROUPING SET
> ---
>
> Key: SPARK-24424
> URL: https://issues.apache.org/jira/browse/SPARK-24424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, our Group By clause follows Hive 
> [https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup]
>  :
>  However, this does not match ANSI SQL compliance. The proposal is to update 
> our parser and analyzer for ANSI compliance. 
>  For example,
> {code:java}
> GROUP BY col1, col2 WITH ROLLUP
> GROUP BY col1, col2 WITH CUBE
> GROUP BY col1, col2 GROUPING SET ...
> {code}
> It is nice to support ANSI SQL syntax at the same time.
> {code:java}
> GROUP BY ROLLUP(col1, col2)
> GROUP BY CUBE(col1, col2)
> GROUP BY GROUPING SET(...) 
> {code}
> Note, we only need to support one-level grouping set in this stage. That 
> means, nested grouping set is not supported.
> Note, we should not break the existing syntax. The parser changes should be 
> like
> {code:sql}
> group-by-expressions
> >>-GROUP BY+-hive-sql-group-by-expressions-+---><
>'-ansi-sql-grouping-set-expressions-'
> hive-sql-group-by-expressions
> '--GROUPING SETS--(--grouping-set-expressions--)--'
>.-,--.   +--WITH CUBE--+
>V|   +--WITH ROLLUP+
> >>---+-expression-+-+---+-+-><
> grouping-expressions-list
>.-,--.  
>V|  
> >>---+-expression-+-+--><
> grouping-set-expressions
> .-,.
> |  .-,--.  |
> |  V|  |
> V '-(--expression---+-)-'  |
> >>+-expression--+--+-><
> ansi-sql-grouping-set-expressions
> >>-+-ROLLUP--(--grouping-expression-list--)-+--><
>+-CUBE--(--grouping-expression-list--)---+   
>'-GROUPING SETS--(--grouping-set-expressions--)--'  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24268) DataType in error messages are not coherent

2018-07-20 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24268.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> DataType in error messages are not coherent
> ---
>
> Key: SPARK-24268
> URL: https://issues.apache.org/jira/browse/SPARK-24268
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> In SPARK-22893 there was a tentative to unify the way dataTypes are reported 
> in error messages. There, we decided to use always {{dataType.simpleString}}. 
> Unfortunately, we missed many places where this still needed to be fixed. 
> Moreover, it turns out that the right method to use is not {{simpleString}}, 
> but we should use {{catalogString}} instead (for further details please check 
> the discussion in the PR https://github.com/apache/spark/pull/21321).
> So we should update all the missing places in order to provide error messages 
> coherently throughout the project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550289#comment-16550289
 ] 

Dilip Biswal edited comment on SPARK-24864 at 7/20/18 6:23 AM:
---

[~abhimadav] I don't see a problem here. The generated column name is different 
between spark and hive. Perhaps in spark 1.6, the generated column names were 
same between spark and hive i.e it starts with `_c[number]`. In this repro, 
spark by default generates the column name as "upper(name)".

{code}
scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema
root
 |-- id: integer (nullable = true)
 |-- upper(name): string (nullable = true)
 {code}
 

So following would work in spark.
 
{code:java}
scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname 
FROM (SELECT id, upper(name) FROM src1) vsrc1new");
 res13: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from vsrc1new").show()
 +++
|id|uname|

+++
|1|TEST   |

+++
{code}

In my opinion, its a good practice to give explicit aliases instead of relying 
on system generated ones especially if we r looking for portability across 
different database systems.
 
spark.sql("CREATE VIEW vsrc1new AS SELECT id, upper_name AS uname FROM (SELECT 
id, upper(name) as upper_name FROM src1) ");

cc [~smilegator] We changed the generated column names on purpose to make them 
more readable, right ?


was (Author: dkbiswal):
[~abhimadav] I don't see a problem here. The generated column name is different 
between spark and hive. Perhaps in spark 1.6, the generated column names were 
same between spark and hive i.e it starts with `_c[number]`. In this repro, 
spark by default generates the column name as "upper(name)".

{code}
scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema
root
 |-- id: integer (nullable = true)
 |-- upper(name): string (nullable = true)
 {code}
 

So following would work in spark.
 
{code:java}
scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname 
FROM (SELECT id, upper(name) FROM src1) vsrc1new");
 res13: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from vsrc1new").show()
 +++
|id|uname|

+++
|1|TEST   |

+++
{code}

cc [~smilegator] We changed the generated column names on purpose to make them 
more readable, right ?

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
>

[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-20 Thread Dilip Biswal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550289#comment-16550289
 ] 

Dilip Biswal commented on SPARK-24864:
--

[~abhimadav] I don't see a problem here. The generated column name is different 
between spark and hive. Perhaps in spark 1.6, the generated column names were 
same between spark and hive i.e it starts with `_c[number]`. In this repro, 
spark by default generates the column name as "upper(name)".

{code}
scala> spark.sql("SELECT id, upper(name) FROM src1").printSchema
root
 |-- id: integer (nullable = true)
 |-- upper(name): string (nullable = true)
 {code}
 

So following would work in spark.
 
{code:java}
scala> spark.sql("CREATE VIEW vsrc1new AS SELECT id, `upper(name)` AS uname 
FROM (SELECT id, upper(name) FROM src1) vsrc1new");
 res13: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from vsrc1new").show()
 +++
|id|uname|

+++
|1|TEST   |

+++
{code}

cc [~smilegator] We changed the generated column names on purpose to make them 
more readable, right ?

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

82 matches

Mail list logo