[jira] [Created] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception
Bago Amirbekian created SPARK-25268: --- Summary: runParallelPersonalizedPageRank throws serialization Exception Key: SPARK-25268 URL: https://issues.apache.org/jira/browse/SPARK-25268 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.4.0 Reporter: Bago Amirbekian A recent change to PageRank introduced a bug in the ParallelPersonalizedPageRank implementation. The change prevents serialization of a Map which needs to be broadcast to all workers. The issue is in this line here: [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201] Because graphx units tests are run in local mode, the Serialization issue is not caught. {code:java} [info] - Star example parallel personalized PageRank *** FAILED *** (2 seconds, 160 milliseconds) [info] java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$2 [info] Serialization stack: [info] - object not serializable (class: scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> SparseVector(3)((2,1.0 [info] at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) [info] at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) [info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) [info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) [info] at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292) [info] at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127) [info] at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) [info] at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) [info] at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489) [info] at org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205) [info] at org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31) [info] at org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115) [info] at org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:383) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at
[jira] [Resolved] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches
[ https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25235. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22246 [https://github.com/apache/spark/pull/22246] > Merge the REPL code in Scala 2.11 and 2.12 branches > --- > > Key: SPARK-25235 > URL: https://issues.apache.org/jira/browse/SPARK-25235 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Using some reflection tricks to merge Scala 2.11 and 2.12 codebase -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25248) Audit barrier APIs for Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595891#comment-16595891 ] Apache Spark commented on SPARK-25248: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/22261 > Audit barrier APIs for Spark 2.4 > > > Key: SPARK-25248 > URL: https://issues.apache.org/jira/browse/SPARK-25248 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Make a pass over APIs added for barrier execution mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25112) Spark history cleaner does not work.
[ https://issues.apache.org/jira/browse/SPARK-25112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gu Chao resolved SPARK-25112. - Resolution: Invalid Sorry, It's not a problem. > Spark history cleaner does not work. > > > Key: SPARK-25112 > URL: https://issues.apache.org/jira/browse/SPARK-25112 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gu Chao >Priority: Major > Labels: newbie > > spark-env.sh configuration: > {code:java} > export SPARK_DAEMON_MEMORY=1g > export > SPARK_HISTORY_OPTS="-Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider > -Dspark.history.fs.logDirectory=hdfs://xxx:9000/spark-logs > -Dspark.history.fs.cleaner.enabled=true > -Dspark.history.fs.cleaner.interval=1h > -Dspark.history.fs.cleaner.maxAge=7d"{code} > hdfs /spark-logs: > {code:java} > -rwxrwx--- 2 root supergroup 0 2018-08-08 12:25 > /spark-logs/app-20180804141138-0546.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 16:02 > /spark-logs/app-20180804150203-0547.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 16:02 > /spark-logs/app-20180804150204-0548.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 17:02 > /spark-logs/app-20180804160203-0549.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 17:02 > /spark-logs/app-20180804160204-0550.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 18:02 > /spark-logs/app-20180804170203-0551.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 18:02 > /spark-logs/app-20180804170204-0552.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 19:02 > /spark-logs/app-20180804180203-0553.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 19:02 > /spark-logs/app-20180804180204-0554.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 20:02 > /spark-logs/app-20180804190203-0555.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 20:02 > /spark-logs/app-20180804190204-0556.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 21:02 > /spark-logs/app-20180804200203-0557.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 21:02 > /spark-logs/app-20180804200204-0558.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 22:02 > /spark-logs/app-20180804210203-0559.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 22:02 > /spark-logs/app-20180804210204-0560.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 23:02 > /spark-logs/app-20180804220203-0561.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-04 23:02 > /spark-logs/app-20180804220204-0562.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 00:02 > /spark-logs/app-20180804230203-0563.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 00:02 > /spark-logs/app-20180804230204-0564.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 01:02 > /spark-logs/app-20180805000203-0565.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 01:02 > /spark-logs/app-20180805000204-0566.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 02:02 > /spark-logs/app-20180805010203-0567.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 02:02 > /spark-logs/app-20180805010204-0568.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 03:02 > /spark-logs/app-20180805020203-0569.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 03:02 > /spark-logs/app-20180805020204-0570.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 04:02 > /spark-logs/app-20180805030203-0571.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 04:02 > /spark-logs/app-20180805030204-0572.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 05:02 > /spark-logs/app-20180805040203-0573.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 05:02 > /spark-logs/app-20180805040204-0574.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 06:02 > /spark-logs/app-20180805050203-0575.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 06:02 > /spark-logs/app-20180805050204-0576.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 07:02 > /spark-logs/app-20180805060203-0577.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 07:02 > /spark-logs/app-20180805060204-0578.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 08:02 > /spark-logs/app-20180805070203-0579.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 08:02 > /spark-logs/app-20180805070204-0580.inprogress > -rwxrwx--- 2 root supergroup 0 2018-08-05 09:02 >
[jira] [Comment Edited] (SPARK-19490) Hive partition columns are case-sensitive
[ https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595852#comment-16595852 ] Harish edited comment on SPARK-19490 at 8/29/18 2:19 AM: - I see the same issue in 2.3.1. After changing the column to lower case(in select statement) then join works fine was (Author: harishk15): I > Hive partition columns are case-sensitive > - > > Key: SPARK-19490 > URL: https://issues.apache.org/jira/browse/SPARK-19490 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai >Priority: Major > > The real partitions columns are lower case (year, month, day) > {code} > Caused by: java.lang.RuntimeException: Expected only partition pruning > predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) > at > org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85) > at > org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213) > at > org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > {code} > Use these sql can reproduce this bug: > CREATE TABLE partition_test (key Int) partitioned by (date string) > SELECT * FROM partition_test where DATE = '20170101' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive
[ https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595852#comment-16595852 ] Harish commented on SPARK-19490: I > Hive partition columns are case-sensitive > - > > Key: SPARK-19490 > URL: https://issues.apache.org/jira/browse/SPARK-19490 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: cen yuhai >Priority: Major > > The real partitions columns are lower case (year, month, day) > {code} > Caused by: java.lang.RuntimeException: Expected only partition pruning > predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) > at > org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85) > at > org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213) > at > org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > {code} > Use these sql can reproduce this bug: > CREATE TABLE partition_test (key Int) partitioned by (date string) > SELECT * FROM partition_test where DATE = '20170101' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595835#comment-16595835 ] Bryan Cutler commented on SPARK-23874: -- [~smilegator] I linked the most relevant pyarrow issues above, which deal mostly with binary type support. If I get the time, I'll go through the changelog here https://arrow.apache.org/release/0.10.0.html and look for other relevant ones. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25253: Assignee: Imran Rashid > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25253) Refactor pyspark connection & authentication
[ https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25253. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22247 [https://github.com/apache/spark/pull/22247] > Refactor pyspark connection & authentication > > > Key: SPARK-25253 > URL: https://issues.apache.org/jira/browse/SPARK-25253 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.4.0 > > > We've got a few places in pyspark that connect to local sockets, with varying > levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. > should be pretty easy to clean this up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25266) Fix memory leak in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-25266: --- Description: BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. Once a TimerTask is scheduled, the reference to it is not released until `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. was: BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. Once a TimerTask is scheduled, the reference to it is not released until `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > Fix memory leak in Barrier Execution Mode > - > > Key: SPARK-25266 > URL: https://issues.apache.org/jira/browse/SPARK-25266 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25266) Fix memory leak in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-25266: --- Summary: Fix memory leak in Barrier Execution Mode (was: Fix memory leak vulnerability in Barrier Execution Mode) > Fix memory leak in Barrier Execution Mode > - > > Key: SPARK-25266 > URL: https://issues.apache.org/jira/browse/SPARK-25266 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
[ https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25260: Assignee: Arun Mahadevan > Fix namespace handling in SchemaConverters.toAvroType > - > > Key: SPARK-25260 > URL: https://issues.apache.org/jira/browse/SPARK-25260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > `toAvroType` converts spark data type to avro schema. It always appends the > record name to namespace so its impossible to have an Avro namespace > independent of the record name. > > When invoked with a spark data type like, > > {code:java} > val sparkSchema = StructType(Seq( > StructField("name", StringType, nullable = false), > StructField("address", StructType(Seq( > StructField("city", StringType, nullable = false), > StructField("state", StringType, nullable = false))), > nullable = false))) > > // map it to an avro schema with top level namespace "foo.bar", > val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", > "foo.bar") > // result is > // avroSchema.getName = employee > // avroSchema.getNamespace = foo.bar.employee > // avroSchema.getFullname = foo.bar.employee.employee > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
[ https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25260. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22251 [https://github.com/apache/spark/pull/22251] > Fix namespace handling in SchemaConverters.toAvroType > - > > Key: SPARK-25260 > URL: https://issues.apache.org/jira/browse/SPARK-25260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > `toAvroType` converts spark data type to avro schema. It always appends the > record name to namespace so its impossible to have an Avro namespace > independent of the record name. > > When invoked with a spark data type like, > > {code:java} > val sparkSchema = StructType(Seq( > StructField("name", StringType, nullable = false), > StructField("address", StructType(Seq( > StructField("city", StringType, nullable = false), > StructField("state", StringType, nullable = false))), > nullable = false))) > > // map it to an avro schema with top level namespace "foo.bar", > val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", > "foo.bar") > // result is > // avroSchema.getName = employee > // avroSchema.getNamespace = foo.bar.employee > // avroSchema.getFullname = foo.bar.employee.employee > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter
[ https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22357: - Assignee: Bo Meng > SparkContext.binaryFiles ignore minPartitions parameter > --- > > Key: SPARK-22357 > URL: https://issues.apache.org/jira/browse/SPARK-22357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Weichen Xu >Assignee: Bo Meng >Priority: Major > Fix For: 2.4.0 > > > this is a bug in binaryFiles - even though we give it the partitions, > binaryFiles ignores it. > This is a bug introduced in spark 2.1 from spark 2.0, in file > PortableDataStream.scala the argument “minPartitions” is no longer used (with > the push to master on 11/7/6): > {code} > /** > Allow minPartitions set by end-user in order to keep compatibility with old > Hadoop API > which is set through setMaxSplitSize > */ > def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: > Int) { > val defaultMaxSplitBytes = > sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) > val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) > val defaultParallelism = sc.defaultParallelism > val files = listStatus(context).asScala > val totalBytes = files.filterNot(.isDirectory).map(.getLen + > openCostInBytes).sum > val bytesPerCore = totalBytes / defaultParallelism > val maxSplitSize = Math.min(defaultMaxSplitBytes, > Math.max(openCostInBytes, bytesPerCore)) > super.setMaxSplitSize(maxSplitSize) > } > {code} > The code previously, in version 2.0, was: > {code} > def setMinPartitions(context: JobContext, minPartitions: Int) { > val totalLen = > listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum > val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, > 1.0)).toLong > super.setMaxSplitSize(maxSplitSize) > } > {code} > The new code is very smart, but it ignores what the user passes in and uses > the data size, which is kind of a breaking change in some sense > In our specific case this was a problem, because we initially read in just > the files names and only after that the dataframe becomes very large, when > reading in the images themselves – and in this case the new code does not > handle the partitioning very well. > I’m not sure if it can be easily fixed because I don’t understand the full > context of the change in spark (but at the very least the unused parameter > should be removed to avoid confusion). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter
[ https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22357. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21638 [https://github.com/apache/spark/pull/21638] > SparkContext.binaryFiles ignore minPartitions parameter > --- > > Key: SPARK-22357 > URL: https://issues.apache.org/jira/browse/SPARK-22357 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Weichen Xu >Assignee: Bo Meng >Priority: Major > Fix For: 2.4.0 > > > this is a bug in binaryFiles - even though we give it the partitions, > binaryFiles ignores it. > This is a bug introduced in spark 2.1 from spark 2.0, in file > PortableDataStream.scala the argument “minPartitions” is no longer used (with > the push to master on 11/7/6): > {code} > /** > Allow minPartitions set by end-user in order to keep compatibility with old > Hadoop API > which is set through setMaxSplitSize > */ > def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: > Int) { > val defaultMaxSplitBytes = > sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) > val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) > val defaultParallelism = sc.defaultParallelism > val files = listStatus(context).asScala > val totalBytes = files.filterNot(.isDirectory).map(.getLen + > openCostInBytes).sum > val bytesPerCore = totalBytes / defaultParallelism > val maxSplitSize = Math.min(defaultMaxSplitBytes, > Math.max(openCostInBytes, bytesPerCore)) > super.setMaxSplitSize(maxSplitSize) > } > {code} > The code previously, in version 2.0, was: > {code} > def setMinPartitions(context: JobContext, minPartitions: Int) { > val totalLen = > listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum > val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, > 1.0)).toLong > super.setMaxSplitSize(maxSplitSize) > } > {code} > The new code is very smart, but it ignores what the user passes in and uses > the data size, which is kind of a breaking change in some sense > In our specific case this was a problem, because we initially read in just > the files names and only after that the dataframe becomes very large, when > reading in the images themselves – and in this case the new code does not > handle the partitioning very well. > I’m not sure if it can be easily fixed because I don’t understand the full > context of the change in spark (but at the very least the unused parameter > should be removed to avoid confusion). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25212. - Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.4.0 > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2+
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595650#comment-16595650 ] Sean Owen commented on SPARK-25258: --- Both of these suggest kryo 4. The issue is, does it cause any breaking changes. I don't so far know of any problems, but that's the kind of input that would be great to confirm a change like this, from those that may be watching Kryo closely. > Upgrade kryo package to version 4.0.2+ > -- > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595647#comment-16595647 ] Apache Spark commented on SPARK-25044: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/22259 > Address translation of LMF closure primitive args to Object in Scala 2.12 > - > > Key: SPARK-25044 > URL: https://issues.apache.org/jira/browse/SPARK-25044 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 > Fix HandleNullInputsForUDF rule": > {code:java} > - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED *** > Results do not match for query: > ... > == Results == > == Results == > !== Correct Answer - 3 == == Spark Answer - 3 == > !struct<> struct > ![0,10,null] [0,10,0] > ![1,12,null] [1,12,1] > ![2,14,null] [2,14,2] (QueryTest.scala:163){code} > You can kind of get what's going on reading the test: > {code:java} > test("SPARK-24891 Fix HandleNullInputsForUDF rule") { > // assume(!ClosureCleanerSuite2.supportsLMFs) > // This test won't test what it intends to in 2.12, as lambda metafactory > closures > // have arg types that are not primitive, but Object > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > .withColumn("c", udf1($"a", lit(null))) > val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed > comparePlans(df.logicalPlan, plan) > checkAnswer( > df, > Seq( > Row(0, 10, null), > Row(1, 12, null), > Row(2, 14, null))) > }{code} > > It seems that the closure that is fed in as a UDF changes behavior, in a way > that primitive-type arguments are handled differently. For example an Int > argument, when fed 'null', acts like 0. > I'm sure it's a difference in the LMF closure and how its types are > understood, but not exactly sure of the cause yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595633#comment-16595633 ] Steve Loughran commented on SPARK-20799: [~jzijlstra] yes, the final listing. Note that in HADOOP-14833 I'm removing the user:secret in URI feature. It's just too dangerous; ends up in logs, ends up in support JIRAs. Per bucket options in hadoop-2.8; JCECKs files in 2.9+ give you much better security and work well with things. > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
[ https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25004. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21977 [https://github.com/apache/spark/pull/21977] > Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS > -- > > Key: SPARK-25004 > URL: https://issues.apache.org/jira/browse/SPARK-25004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > Some platforms support limiting Python's addressable memory space by limiting > [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS]. > We've found that adding a limit is very useful when running in YARN because > when Python doesn't know about memory constraints, it doesn't know when to > garbage collect and will continue using memory when it doesn't need to. > Adding a limit reduces PySpark memory consumption and avoids YARN killing > containers because Python hasn't cleaned up memory. > This also improves error messages for users, allowing them to see when Python > is allocating too much memory instead of YARN killing the container: > {code:lang=python} > File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in > fe_engineer > fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) > File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp > comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, > []), mat_rec_prep.get(item, [])) > File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in > leven_list_compare > permutations = sorted(permutations, reverse=True) > MemoryError > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
[ https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25004: -- Assignee: Ryan Blue > Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS > -- > > Key: SPARK-25004 > URL: https://issues.apache.org/jira/browse/SPARK-25004 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > Some platforms support limiting Python's addressable memory space by limiting > [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS]. > We've found that adding a limit is very useful when running in YARN because > when Python doesn't know about memory constraints, it doesn't know when to > garbage collect and will continue using memory when it doesn't need to. > Adding a limit reduces PySpark memory consumption and avoids YARN killing > containers because Python hasn't cleaned up memory. > This also improves error messages for users, allowing them to see when Python > is allocating too much memory instead of YARN killing the container: > {code:lang=python} > File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in > fe_engineer > fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) > File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp > comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, > []), mat_rec_prep.get(item, [])) > File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in > leven_list_compare > permutations = sorted(permutations, reverse=True) > MemoryError > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595450#comment-16595450 ] Dilip Biswal commented on SPARK-25267: -- [~smilegator] Thanks Sean.. will take a look. > Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive > - > > Key: SPARK-25267 > URL: https://issues.apache.org/jira/browse/SPARK-25267 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > In SharedSparkSession and TestHive, we need to disable the rule > ConvertToLocalRelation for better test case coverage. To exclude the rules, > we can use the SQLConf `spark.sql.optimizer.excludedRules` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25240. - Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 2.4.0 > A deadlock in ALTER TABLE RECOVER PARTITIONS > > > Key: SPARK-25240 > URL: https://issues.apache.org/jira/browse/SPARK-25240 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Blocker > Fix For: 2.4.0 > > > Recover Partitions in ALTER TABLE is performed in recursive way by calling > the scanPartitions() method. scanPartitions() lists files sequentially or in > parallel if the > [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685] > is true: > {code:scala} > partitionNames.length > 1 && statuses.length > threshold || > partitionNames.length > 2 > {code} > Parallel listening is executed on [the fixed thread > pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622] > which can have 8 threads in total. Dead lock occurs when all 8 cores have > been already occupied and recursive call of scanPartitions() submits new > parallel file listening. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
[ https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595388#comment-16595388 ] Xiao Li commented on SPARK-25267: - cc [~dkbiswal] > Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive > - > > Key: SPARK-25267 > URL: https://issues.apache.org/jira/browse/SPARK-25267 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > In SharedSparkSession and TestHive, we need to disable the rule > ConvertToLocalRelation for better test case coverage. To exclude the rules, > we can use the SQLConf `spark.sql.optimizer.excludedRules` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
Xiao Li created SPARK-25267: --- Summary: Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive Key: SPARK-25267 URL: https://issues.apache.org/jira/browse/SPARK-25267 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li In SharedSparkSession and TestHive, we need to disable the rule ConvertToLocalRelation for better test case coverage. To exclude the rules, we can use the SQLConf `spark.sql.optimizer.excludedRules` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595367#comment-16595367 ] Apache Spark commented on SPARK-25266: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/22258 > Fix memory leak vulnerability in Barrier Execution Mode > --- > > Key: SPARK-25266 > URL: https://issues.apache.org/jira/browse/SPARK-25266 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25266: Assignee: Apache Spark (was: Kousuke Saruta) > Fix memory leak vulnerability in Barrier Execution Mode > --- > > Key: SPARK-25266 > URL: https://issues.apache.org/jira/browse/SPARK-25266 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode
[ https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25266: Assignee: Kousuke Saruta (was: Apache Spark) > Fix memory leak vulnerability in Barrier Execution Mode > --- > > Key: SPARK-25266 > URL: https://issues.apache.org/jira/browse/SPARK-25266 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked > in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. > Once a TimerTask is scheduled, the reference to it is not released until > `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595363#comment-16595363 ] Vasanthkumar Velayudham commented on SPARK-25219: - Hi [~mgaido] - Thanks for your response! PFA the excel that contains the details of the text data used for performing clustering in Spark and SKLearn. Excel workbook contains two sheets - one with the results of Spark and other with the results of SKLearn. I have attached the code that I have used for performing the clustering too. As you notice, these code are straightforward implementation of text transformations. In the results, you can observe that the clustering of Sklearn is more uniform. Whereas Spark results contain one of the cluster with almost 44 records. Please let me know if you require any additional details. > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, > Spark_Kmeans.txt > > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect
[ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vasanthkumar Velayudham updated SPARK-25219: Attachment: SKLearn_Kmeans.txt Apache_Logs_Results.xlsx Spark_Kmeans.txt > KMeans Clustering - Text Data - Results are incorrect > - > > Key: SPARK-25219 > URL: https://issues.apache.org/jira/browse/SPARK-25219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Vasanthkumar Velayudham >Priority: Major > Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, > Spark_Kmeans.txt > > > Hello Everyone, > I am facing issues with the usage of KMeans Clustering on my text data. When > I apply clustering on my text data, after performing various transformations > such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated > clusters are not proper and one cluster is found to have lot of data points > assigned to it. > I am able to perform clustering with similar kind of processing and with the > same attributes on the SKLearn KMeans algorithm. > Upon searching in internet, I observe many have reported the same issue with > KMeans clustering library of Spark. > Request your help in fixing this issue. > Please let me know if you require any additional details. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24704) The order of stages in the DAG graph is incorrect
[ https://issues.apache.org/jira/browse/SPARK-24704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24704: --- Fix Version/s: 2.3.2 > The order of stages in the DAG graph is incorrect > - > > Key: SPARK-24704 > URL: https://issues.apache.org/jira/browse/SPARK-24704 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: StanZhai >Assignee: StanZhai >Priority: Minor > Labels: regression > Fix For: 2.3.2, 2.4.0 > > Attachments: WX20180630-161907.png > > > The regression is introduced by Spark 2.3.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode
Kousuke Saruta created SPARK-25266: -- Summary: Fix memory leak vulnerability in Barrier Execution Mode Key: SPARK-25266 URL: https://issues.apache.org/jira/browse/SPARK-25266 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. Once a TimerTask is scheduled, the reference to it is not released until `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode
Kousuke Saruta created SPARK-25265: -- Summary: Fix memory leak vulnerability in Barrier Execution Mode Key: SPARK-25265 URL: https://issues.apache.org/jira/browse/SPARK-25265 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked. Once a TimerTask is scheduled, the reference to it is not released until `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25119) stages in wrong order within job page DAG chart
[ https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25119. Resolution: Duplicate > stages in wrong order within job page DAG chart > --- > > Key: SPARK-25119 > URL: https://issues.apache.org/jira/browse/SPARK-25119 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Yunjian Zhang >Priority: Minor > Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png > > > {color:#33}multiple stages for same job are shown with wrong order in DAG > Visualization of job page.{color} > e.g. > stage27 stage19 stage20 stage24 stage21 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN
[ https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23679: -- Assignee: Saisai Shao > uiWebUrl show inproper URL when running on YARN > --- > > Key: SPARK-23679 > URL: https://issues.apache.org/jira/browse/SPARK-23679 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Assignee: Saisai Shao >Priority: Major > Fix For: 2.4.0 > > > uiWebUrl returns local url. > Using it will cause HTTP ERROR 500 > {code} > Problem accessing /. Reason: > Server Error > Caused by: > javax.servlet.ServletException: Could not determine the proxy server for > redirection > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} > We should give address to yarn proxy instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23679) uiWebUrl show inproper URL when running on YARN
[ https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23679. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22164 [https://github.com/apache/spark/pull/22164] > uiWebUrl show inproper URL when running on YARN > --- > > Key: SPARK-23679 > URL: https://issues.apache.org/jira/browse/SPARK-23679 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Assignee: Saisai Shao >Priority: Major > Fix For: 2.4.0 > > > uiWebUrl returns local url. > Using it will cause HTTP ERROR 500 > {code} > Problem accessing /. Reason: > Server Error > Caused by: > javax.servlet.ServletException: Could not determine the proxy server for > redirection > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} > We should give address to yarn proxy instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23997) Configurable max number of buckets
[ https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23997. - Resolution: Fixed Assignee: Fernando Pereira Fix Version/s: 2.4.0 > Configurable max number of buckets > -- > > Key: SPARK-23997 > URL: https://issues.apache.org/jira/browse/SPARK-23997 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Fernando Pereira >Assignee: Fernando Pereira >Priority: Major > Fix For: 2.4.0 > > > When exporting data as a table the user can choose to split data in buckets > by choosing the columns and the number of buckets. Currently there is a > hard-coded limit of 99'999 buckets. > However, for heavy workloads this limit might be too restrictive, a situation > that will eventually become more common as workloads grow. > As per the comments in SPARK-19618 this limit could be made configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner
[ https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25264: Assignee: Apache Spark > Fix comma-delineated arguments passed into PythonRunner and RRunner > --- > > Key: SPARK-25264 > URL: https://issues.apache.org/jira/browse/SPARK-25264 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > > The arguments passed into the PythonRunner and RRunner are comma-delineated. > Because the Runners do a arg.slice(2,...) This means that the delineation in > the entrypoint needs to be a space, as it would be expected by the Runner > arguments. > This issue was logged here: > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner
[ https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25264: Assignee: (was: Apache Spark) > Fix comma-delineated arguments passed into PythonRunner and RRunner > --- > > Key: SPARK-25264 > URL: https://issues.apache.org/jira/browse/SPARK-25264 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > The arguments passed into the PythonRunner and RRunner are comma-delineated. > Because the Runners do a arg.slice(2,...) This means that the delineation in > the entrypoint needs to be a space, as it would be expected by the Runner > arguments. > This issue was logged here: > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner
[ https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595306#comment-16595306 ] Apache Spark commented on SPARK-25264: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22257 > Fix comma-delineated arguments passed into PythonRunner and RRunner > --- > > Key: SPARK-25264 > URL: https://issues.apache.org/jira/browse/SPARK-25264 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > The arguments passed into the PythonRunner and RRunner are comma-delineated. > Because the Runners do a arg.slice(2,...) This means that the delineation in > the entrypoint needs to be a space, as it would be expected by the Runner > arguments. > This issue was logged here: > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai edited comment on SPARK-25206 at 8/28/18 5:06 PM: Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it could be more meaningful. was (Author: yucai): Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai edited comment on SPARK-25206 at 8/28/18 5:05 PM: Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. was (Author: yucai): Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298 ] yucai commented on SPARK-25206: --- Do you want to simulate an Exception in Spark? Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source table, it looks more meaningful. > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner
[ https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-25264: --- Description: The arguments passed into the PythonRunner and RRunner are comma-delineated. Because the Runners do a arg.slice(2,...) This means that the delineation in the entrypoint needs to be a space, as it would be expected by the Runner arguments. This issue was logged here: [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273] was: The arguments passed into the PythonRunner and RRunner are comma-delineated. This issue was logged here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273 > Fix comma-delineated arguments passed into PythonRunner and RRunner > --- > > Key: SPARK-25264 > URL: https://issues.apache.org/jira/browse/SPARK-25264 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > The arguments passed into the PythonRunner and RRunner are comma-delineated. > Because the Runners do a arg.slice(2,...) This means that the delineation in > the entrypoint needs to be a space, as it would be expected by the Runner > arguments. > This issue was logged here: > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner
Ilan Filonenko created SPARK-25264: -- Summary: Fix comma-delineated arguments passed into PythonRunner and RRunner Key: SPARK-25264 URL: https://issues.apache.org/jira/browse/SPARK-25264 Project: Spark Issue Type: Bug Components: Kubernetes, PySpark Affects Versions: 2.4.0 Reporter: Ilan Filonenko The arguments passed into the PythonRunner and RRunner are comma-delineated. This issue was logged here: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595254#comment-16595254 ] Apache Spark commented on SPARK-25262: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22256 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25262: Assignee: Apache Spark > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25262: Assignee: (was: Apache Spark) > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595240#comment-16595240 ] Chenxiao Mao commented on SPARK-25175: -- [~cloud_fan] Does it make sense? > Field resolution should fail if there is ambiguity for ORC data source native > implementation > > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Summary: Field resolution should fail if there is ambiguity for ORC data source native implementation (was: Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC) > Field resolution should fail if there is ambiguity for ORC data source native > implementation > > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595225#comment-16595225 ] Chenxiao Mao commented on SPARK-25175: -- After a deep dive into ORC file read paths (data source native, data source hive, hive serde), I realized that this is a little complicated. I'm not sure whether it's technically possible to make all three read paths consistent with respect to case sensitivity, because we rely on hive InputFormat/SerDe which we might not be able to change. Please also see [~cloud_fan]'s comment on Parquet: [https://github.com/apache/spark/pull/22184/files#r212849852] So I changed the title of this Jira to reduce the scope. This ticket aims to make ORC data source native impl consistent with Parquet data source. The gap is that field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC. Does it make sense? As for duplicate fields with different letter cases, we don't have real use cases. It's just for testing purpose. > Field resolution should fail if there is ambiguity in case-insensitive mode > when reading from ORC > - > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Summary: Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC (was: Case-insensitive field resolution when reading from ORC) > Field resolution should fail if there is ambiguity in case-insensitive mode > when reading from ORC > - > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593194#comment-16593194 ] Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 4:00 PM: --- [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables with hive impl always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. They always do case-insensitive field resolution, and if there is ambiguity they return the first matched one. Given ORC file schema is (a,B,c,C) ** Is it better to return null in scenario 2 and 10? ** Is it better to return C in scenario 12? ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return lower case one? * The data source tables with native impl, compared to hive impl, handle scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in the same way as hive impl, which is not consistent with Parquet data source. * The hive serde tables always throw IndexOutOfBoundsException at runtime when ORC file schema has more fields than table schema. If ORC schema does NOT have more fields, hive serde tables do field resolution by ordinal rather than by name. * Since in case-sensitive mode analysis should fail if a column name in query and metastore schema are in different cases, all AnalysisException(s) are reasonable. Stacktrace of IndexOutOfBoundsException: {code:java} java.lang.IndexOutOfBoundsException: toIndex = 4 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was (Author: seancxmao): [~dongjoon] [~yucai] Here is a brief summary. We can
[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-25175: - Description: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues, but not identical to Parquet. Spark has two OrcFileFormat. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the first matched field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. If ORC data file has more fields than table schema, we just can't read hive serde tables. If ORC data file does not have more fields, hive serde tables always do field resolution by ordinal, rather than by name. Both ORC data source hive impl and hive serde table rely on the hive orc InputFormat/SerDe to read table. I'm not sure whether we can change underlying hive classes to make all orc read behaviors consistent. This ticket aims to make read behavior of ORC data source native impl consistent with Parquet data source. was: SPARK-25132 adds support for case-insensitive field resolution when reading from Parquet files. We found ORC files have similar issues. Since Spark has 2 OrcFileFormat, we should add support for both. * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive dependency. This hive OrcFileFormat always do case-insensitive field resolution regardless of case sensitivity mode. When there is ambiguity, hive OrcFileFormat always returns the lower case field, rather than failing the reading operation. * SPARK-20682 adds a new ORC data source inside sql/core. This native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. Besides data source tables, hive serde tables also have issues. When there are duplicate fields (e.g. c, C), we just can't read hive serde tables. If there are no duplicate fields, hive serde tables always do case-insensitive field resolution regardless of case sensitivity mode. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues, but not identical > to Parquet. Spark has two OrcFileFormat. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat always do case-insensitive field > resolution regardless of case sensitivity mode. When there is ambiguity, hive > OrcFileFormat always returns the first matched field, rather than failing the > reading operation. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. > Besides data source tables, hive serde tables also have issues. If ORC data > file has more fields than table schema, we just can't read hive serde tables. > If ORC data file does not have more fields, hive serde tables always do field > resolution by ordinal, rather than by name. > Both ORC data source hive impl and hive serde table rely on the hive orc > InputFormat/SerDe to read table. I'm not sure whether we can change > underlying hive classes to make all orc read behaviors consistent. > This ticket aims to make read behavior of ORC data source native impl > consistent with Parquet data source. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595182#comment-16595182 ] Imran Rashid commented on SPARK-24918: -- Yes, that's right, its to avoid putting a reference to the static initializer in every single task. That's a nuisance when your task definitions are For my intended use, the task itself doesn't depend on the plugin at all. The plugin gives you added debug info, that's all. That's why its OK for the original code to not know anything about the plugin. There were other requests on the earlier jira to do lifecycle management, eg. eager initialization of resources. I agree that's not as clear (if you need the resource, your task will have to reference it, so you have a spot for your static initializer). In any case, you could use this same mechanism for that as well, if you wanted. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595178#comment-16595178 ] Rob Vesse commented on SPARK-25262: --- I have changes for this almost ready and plan to open a PR tomorrow > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)
[ https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-25005. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.4.0 > Structured streaming doesn't support kafka transaction (creating empty offset > with abort & markers) > --- > > Key: SPARK-25005 > URL: https://issues.apache.org/jira/browse/SPARK-25005 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Quentin Ambard >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Structured streaming can't consume kafka transaction. > We could try to apply SPARK-24720 (DStream) logic to Structured Streaming > source -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25218) Potential resource leaks in TransportServer and SocketAuthHelper
[ https://issues.apache.org/jira/browse/SPARK-25218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-25218. -- Resolution: Fixed Fix Version/s: 2.4.0 > Potential resource leaks in TransportServer and SocketAuthHelper > > > Key: SPARK-25218 > URL: https://issues.apache.org/jira/browse/SPARK-25218 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.4.0 > > > They don't release the resources for all types of errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593185#comment-16593185 ] Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:27 PM: --- Investigation about ORC tables with duplicate fields (c and C), thus also data file has more fields than table schema. {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table, hive impl)||orc column (select via data source table, native impl)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |a|IndexOutOfBoundsException | |2| | |b|B |null|IndexOutOfBoundsException | |3| | |c|c |c|IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException| |8| | |b|AnalysisException |AnalysisException|AnalysisException | |9| | |c|AnalysisException |AnalysisException|AnalysisException | |10| | |A|a |null|IndexOutOfBoundsException | |11| | |B|B |B|IndexOutOfBoundsException | |12| | |C|c |C|IndexOutOfBoundsException | |13|false|a, b, c|a|a |a|IndexOutOfBoundsException | |14| | |b|B |B|IndexOutOfBoundsException | |15| | |c|c |c|IndexOutOfBoundsException | |16| | |A|a |a|IndexOutOfBoundsException | |17| | |B|B |B|IndexOutOfBoundsException | |18| | |C|c |c|IndexOutOfBoundsException | |19| |A, B, C|a|a |a|IndexOutOfBoundsException | |20| | |b|B |B|IndexOutOfBoundsException | |21| | |c|c |c|IndexOutOfBoundsException | |22| | |A|a |a|IndexOutOfBoundsException | |23| | |B|B |B|IndexOutOfBoundsException | |24| | |C|c |c|IndexOutOfBoundsException | Followup tests that use ORC files with no duplicate fields (only a,B). {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup") $> hive --orcfiledump /user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc Structure for /user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc Type: struct CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data_nodup' CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data_nodup' DESC EXTENDED orc_nodup_hive_serde_lower; DESC EXTENDED orc_nodup_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via hive serde table)|| |1|true|a, b|a|a| |2| | |b|B| |4| | |A|AnalysisException| |5| | |B|AnalysisException| |7| |A, B|a|AnalysisException| |8| | |b|AnalysisException | |10| | |A|a| |11| | |B|B| |13|false|a, b|a|a| |14| | |b|B| |16| | |A|a| |17| | |B|B| |19| |A, B|a|a| |20| | |b|B| |22| | |A|a| |23| | |B|B| Tests show that for hive serde table field resolution is by ordinal, not by name. {code} spark.conf.set("spark.sql.caseSensitive", true) spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) val data = spark.range(1).selectExpr("id + 1 as x", "id + 2 as y", "id + 3 as z") data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_xyz") sql("CREATE TABLE orc_table_ABC (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data_xyz'") sql("select B from orc_table_ABC").show +---+ | B| +---+ | 2| +---+ {code} was (Author: seancxmao): Thorough investigation about ORC tables with duplicate fields (c and C). {code:java} val data =
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593194#comment-16593194 ] Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:22 PM: --- [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables with hive impl always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. They always do case-insensitive field resolution, and if there is ambiguity they return the first matched one. Given ORC file schema is (a,B,c,C) ** Is it better to return null in scenario 2 and 10? ** Is it better to return C in scenario 12? ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return lower case one? * The data source tables with native impl, compared to hive impl, handle scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in the same way as hive impl. * The hive serde tables always throw IndexOutOfBoundsException at runtime when ORC schema has more fields than table schema. If ORC schema does NOT have more fields, hive serde tables do field resolution by ordinal rather than by name. * Since in case-sensitive mode analysis should fail if a column name in query and metastore schema are in different cases, all AnalysisException(s) are reasonable. Stacktrace of IndexOutOfBoundsException: {code:java} java.lang.IndexOutOfBoundsException: toIndex = 4 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was (Author: seancxmao): [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables with hive impl always
[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595068#comment-16595068 ] Marco Gaido commented on SPARK-23622: - This failure became permanent in the last build (at least seems so). > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test > (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 > {code} > Error Message > java.lang.reflect.InvocationTargetException: null > Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) > at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) > at > org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) > at > org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) > at > org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) > at org.scalatest.Suite$class.run(Suite.scala:1144) > at > org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) > ... 29 more > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425) > ... 31 more > Caused by: sbt.ForkMain$ForkError: >
[jira] [Commented] (SPARK-23663) Spark Streaming Kafka 010 , fails with "java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access"
[ https://issues.apache.org/jira/browse/SPARK-23663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595044#comment-16595044 ] Gabor Somogyi commented on SPARK-23663: --- [~sipeti] from the stacktrace the root cause looks the same. There are additional jiras which tries to address this issue but not yet closed. > Spark Streaming Kafka 010 , fails with > "java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access" > --- > > Key: SPARK-23663 > URL: https://issues.apache.org/jira/browse/SPARK-23663 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 > Spark streaming kafka 010 > >Reporter: kaushik srinivas >Priority: Major > > test being tried: > 10 kafka topics created. Streamed with avro data from kafka producers. > org.apache.spark.streaming.kafka010 used for creating directStream to kafka. > A single direct stream is created for all the ten topics. > And on each RDD(batch of 50 seconds), key of kafka consumer record is checked > and seperate RDDs are created for seperate topics. > Each topic has records with key as topic name and value of avro messages. > Finally ten RDDs are converted to data frames and registered as separate temp > tables. > Once all the temp tables are registered, few sql queries are run on these > temp tables. > > Exception seen: > 18/03/12 11:58:34 WARN TaskSetManager: Lost task 23.0 in stage 4.0 (TID 269, > 192.168.24.145, executor 7): java.util.ConcurrentModificationException: > KafkaConsumer is not safe for multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:80) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189) > at > org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918) > at > org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.1 in stage 4.0 (TID > 828, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes) > 18/03/12 11:58:34 INFO TaskSetManager: Lost task 23.1 in stage 4.0 (TID 828) > on 192.168.24.145, executor 7: java.util.ConcurrentModificationException > (KafkaConsumer is not safe for multi-threaded access) [duplicate 1] > 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.2 in stage 4.0 (TID > 829, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes) > 18/03/12 11:58:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 30, > 192.168.24.147, executor 6): java.lang.IllegalStateException: This consumer > has already been closed. > at > org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417) > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189) > at >
[jira] [Commented] (SPARK-23663) Spark Streaming Kafka 010 , fails with "java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access"
[ https://issues.apache.org/jira/browse/SPARK-23663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595035#comment-16595035 ] Peter Simon commented on SPARK-23663: - [~gsomogyi] isn't this the same as SPARK-19185? > Spark Streaming Kafka 010 , fails with > "java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access" > --- > > Key: SPARK-23663 > URL: https://issues.apache.org/jira/browse/SPARK-23663 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 > Spark streaming kafka 010 > >Reporter: kaushik srinivas >Priority: Major > > test being tried: > 10 kafka topics created. Streamed with avro data from kafka producers. > org.apache.spark.streaming.kafka010 used for creating directStream to kafka. > A single direct stream is created for all the ten topics. > And on each RDD(batch of 50 seconds), key of kafka consumer record is checked > and seperate RDDs are created for seperate topics. > Each topic has records with key as topic name and value of avro messages. > Finally ten RDDs are converted to data frames and registered as separate temp > tables. > Once all the temp tables are registered, few sql queries are run on these > temp tables. > > Exception seen: > 18/03/12 11:58:34 WARN TaskSetManager: Lost task 23.0 in stage 4.0 (TID 269, > 192.168.24.145, executor 7): java.util.ConcurrentModificationException: > KafkaConsumer is not safe for multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:80) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189) > at > org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918) > at > org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.1 in stage 4.0 (TID > 828, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes) > 18/03/12 11:58:34 INFO TaskSetManager: Lost task 23.1 in stage 4.0 (TID 828) > on 192.168.24.145, executor 7: java.util.ConcurrentModificationException > (KafkaConsumer is not safe for multi-threaded access) [duplicate 1] > 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.2 in stage 4.0 (TID > 829, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes) > 18/03/12 11:58:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 30, > 192.168.24.147, executor 6): java.lang.IllegalStateException: This consumer > has already been closed. > at > org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417) > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189) > at > org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918) > at >
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595020#comment-16595020 ] Sean Owen commented on SPARK-24918: --- Wait, why doesn't a static init "run everywhere"? is "everywhere" the same as "on every executor that will run a task"? and why would an executor be able to init without touching a static initializer that the user code touches? Yes you just call it somewhere in any task that needs the init. There's no overhead in checking the init and initting if not. Therein is the issue, I suppose: you have to put some reference to a class or method in every task. The upshot is it requires no additional mechanism or reasoning about what thread runs what. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25263) Add scheduler integration test for SPARK-24909
Thomas Graves created SPARK-25263: - Summary: Add scheduler integration test for SPARK-24909 Key: SPARK-25263 URL: https://issues.apache.org/jira/browse/SPARK-25263 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.1 Reporter: Thomas Graves for Jira SPARK-24909 . the unit test isn't sufficient to test across interactions between DAGScheduler and TAskSetManager. Add a test to the SchedulerIntegrationSuite to cover this. The SchedulerIntegrationSuite needs to be updated to handle multiple executors and also be able to control the flow of tasks starting/ending. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594981#comment-16594981 ] Chao Fang edited comment on SPARK-25091 at 8/28/18 1:39 PM: Yes, I think it's UI issue. Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. And I get the attached files. As you can see, the storage tab is ok, while Storage Memory in Executor Tab always increase. And as you can see, the old gen space release memory space as expected and the Disk Memory in Executor Tab is 0.0B was (Author: chao fang): Yes, I think it's UI issue. Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. And I get the attached files. As you can see, the storage tab is ok, while Storage Memory in Executor Tab always increase. And as you can see, the old gen space release memory space as expected and the Disk Memory in Executor Tab is 0.0B !4.png! > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594981#comment-16594981 ] Chao Fang commented on SPARK-25091: --- Yes, I think it's UI issue. Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. And I get the attached files. As you can see, the storage tab is ok, while Storage Memory in Executor Tab always increase. And as you can see, the old gen space release memory space as expected and the Disk Memory in Executor Tab is 0.0B !4.png! > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Fang updated SPARK-25091: -- Attachment: 4.png > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Fang updated SPARK-25091: -- Attachment: 3.png 2.png 1.png 0.png > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows
[ https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594973#comment-16594973 ] Li Jin commented on SPARK-25213: This is resolved by https://github.com/apache/spark/pull/22104 > DataSourceV2 doesn't seem to produce unsafe rows > - > > Key: SPARK-25213 > URL: https://issues.apache.org/jira/browse/SPARK-25213 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 2.4.0 > > > Reproduce (Need to compile test-classes): > bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes > {code:java} > datasource_v2_df = spark.read \ > .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") > \ > .load() > result = datasource_v2_df.withColumn('x', udf(lambda x: x, > 'int')(datasource_v2_df['i'])) > result.show() > {code} > The above code fails with: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast > to org.apache.spark.sql.catalyst.expressions.UnsafeRow > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > {code} > Seems like Data Source V2 doesn't produce unsafeRows here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594958#comment-16594958 ] yucai commented on SPARK-25206: --- [~smilegator] , 2.1's exception is from parquet. {code:java} java.lang.IllegalArgumentException: Column [ID] was not found in schema! at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58) at org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194) at org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121) {code} 2.1 uses parquet 1.8.1, while 2.3 uses parquet 1.8.3, it is behavior change in parquet. See: https://issues.apache.org/jira/browse/PARQUET-389 [https://github.com/apache/parquet-mr/commit/2282c22c5b252859b459cc2474350fbaf2a588e9] > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception for the same > query: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
Rob Vesse created SPARK-25262: - Summary: Make Spark local dir volumes configurable with Spark on Kubernetes Key: SPARK-25262 URL: https://issues.apache.org/jira/browse/SPARK-25262 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.3.1, 2.3.0 Reporter: Rob Vesse As discussed during review of the design document for SPARK-24434 while providing pod templates will provide more in-depth customisation for Spark on Kubernetes there are some things that cannot be modified because Spark code generates pod specs in very specific ways. The particular issue identified relates to handling on {{spark.local.dirs}} which is done by {{LocalDirsFeatureStep.scala}}. For each directory specified, or a single default if no explicit specification, it creates a Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation this will be backed by the node storage (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some compute environments this may be extremely undesirable. For example with diskless compute resources the node storage will likely be a non-performant remote mounted disk, often with limited capacity. For such environments it would likely be better to set {{medium: Memory}} on the volume per the K8S documentation to use a {{tmpfs}} volume instead. Another closely related issue is that users might want to use a different volume type to back the local directories and there is no possibility to do that. Pod templates will not really solve either of these issues because Spark is always going to attempt to generate a new volume for each local directory and always going to set these as {{emptyDir}}. Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} volumes * Modify the logic to check if there is a volume already defined with the name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported
[ https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594933#comment-16594933 ] Rob Vesse commented on SPARK-25024: --- [~tgraves] Attempting to answer your questions: * We never used cluster mode so can't comment * Yes and no ** Similar to YARN it does the login locally in the client and then uses HDFS delegation tokens so it doesn't ship the keytabs AFAIK but it does ship the delegation tokens * We never used Spark Shuffle Service either so can't comment * Yes ** Mesos does authentication at the framework level rather than the user level so it depends on your setup. You might have setups where there is a single principal and secret used by all Spark users or you might have setups where you create a principal and secret for each user. You can optionally do ACLs within Mesos for each framework principal including configuring things like which users a framework is allowed to launch jobs as. * Again not used this feature, think these are similar to K8S secrets in that they are created separately and you are just passing identifiers for these to Spark and Mesos takes care of providing these securely to your jobs. Generally we have dropped use of Spark on Mesos in favour of Spark on K8S because the security story for Mesos was poor and we had to do a lot of extra stuff to provide multi-tenancy whereas with K8S a lot more was available out of the box (even if secure HDFS support has yet to land in mainline Spark) > Update mesos documentation to be clear about security supported > --- > > Key: SPARK-25024 > URL: https://issues.apache.org/jira/browse/SPARK-25024 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.2.2 >Reporter: Thomas Graves >Priority: Major > > I was reading through our mesos deployment docs and security docs and its not > clear at all what type of security and how to set it up for mesos. I think > we should clarify this and have something about exactly what is supported and > what is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25102: Assignee: Apache Spark > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Assignee: Apache Spark >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594914#comment-16594914 ] Apache Spark commented on SPARK-25102: -- User 'npoberezkin' has created a pull request for this issue: https://github.com/apache/spark/pull/22255 > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25102: Assignee: (was: Apache Spark) > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24411) Adding native Java tests for `isInCollection`
[ https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24411: Assignee: (was: Apache Spark) > Adding native Java tests for `isInCollection` > - > > Key: SPARK-24411 > URL: https://issues.apache.org/jira/browse/SPARK-24411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Minor > Labels: starter > > In the past, some of our Java APIs have been difficult to call from Java. We > should add tests in Java directly to make sure it works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24411) Adding native Java tests for `isInCollection`
[ https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594855#comment-16594855 ] Apache Spark commented on SPARK-24411: -- User 'aai95' has created a pull request for this issue: https://github.com/apache/spark/pull/22253 > Adding native Java tests for `isInCollection` > - > > Key: SPARK-24411 > URL: https://issues.apache.org/jira/browse/SPARK-24411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Minor > Labels: starter > > In the past, some of our Java APIs have been difficult to call from Java. We > should add tests in Java directly to make sure it works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24411) Adding native Java tests for `isInCollection`
[ https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24411: Assignee: Apache Spark > Adding native Java tests for `isInCollection` > - > > Key: SPARK-24411 > URL: https://issues.apache.org/jira/browse/SPARK-24411 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Minor > Labels: starter > > In the past, some of our Java APIs have been difficult to call from Java. We > should add tests in Java directly to make sure it works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594809#comment-16594809 ] Nikita Poberezkin edited comment on SPARK-25102 at 8/28/18 10:43 AM: - Hi, [~zi]. I've tried to override getName method in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The problem is that the only way (that i know about) to find Spark version programmatically on an executor node is SparkSession.builder().getOrCreate().version. But, when I ran tests i received the following error: Caused by: java.lang.IllegalStateException: SparkSession should only be created and accessed on the driver. Do you know any other way to find Spark version in an executor? was (Author: npoberezkin): Hi, Zoltan. I've tried to override getName method in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The problem is that the only way (that i know about) to find Spark version programmatically on an executor node is SparkSession.builder().getOrCreate().version. But, when I ran tests i received the following error: Caused by: java.lang.IllegalStateException: SparkSession should only be created and accessed on the driver. Do you know any other way to find Spark version in an executor? > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594809#comment-16594809 ] Nikita Poberezkin commented on SPARK-25102: --- Hi, Zoltan. I've tried to override getName method in org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The problem is that the only way (that i know about) to find Spark version programmatically on an executor node is SparkSession.builder().getOrCreate().version. But, when I ran tests i received the following error: Caused by: java.lang.IllegalStateException: SparkSession should only be created and accessed on the driver. Do you know any other way to find Spark version in an executor? > Write Spark version information to Parquet file footers > --- > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Zoltan Ivanfi >Priority: Major > > -PARQUET-352- added support for the "writer.model.name" property in the > Parquet metadata to identify the object model (application) that wrote the > file. > The easiest way to write this property is by overriding getName() of > org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding > getName() to the > org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory
[ https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25261: Assignee: (was: Apache Spark) > Update configuration.md, correct the default units of > spark.driver|executor.memory > -- > > Key: SPARK-25261 > URL: https://issues.apache.org/jira/browse/SPARK-25261 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.0 >Reporter: huangtengfei >Priority: Minor > > From > [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464] > and > [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we > can see that spark.driver.memory and spark.executor.memory are parsed as > bytes if no units specified. But in the doc, they are described as mb in > default, which may lead to some misunderstanding. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory
[ https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594716#comment-16594716 ] Apache Spark commented on SPARK-25261: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/22252 > Update configuration.md, correct the default units of > spark.driver|executor.memory > -- > > Key: SPARK-25261 > URL: https://issues.apache.org/jira/browse/SPARK-25261 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.0 >Reporter: huangtengfei >Priority: Minor > > From > [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464] > and > [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we > can see that spark.driver.memory and spark.executor.memory are parsed as > bytes if no units specified. But in the doc, they are described as mb in > default, which may lead to some misunderstanding. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory
[ https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25261: Assignee: Apache Spark > Update configuration.md, correct the default units of > spark.driver|executor.memory > -- > > Key: SPARK-25261 > URL: https://issues.apache.org/jira/browse/SPARK-25261 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.0 >Reporter: huangtengfei >Assignee: Apache Spark >Priority: Minor > > From > [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464] > and > [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we > can see that spark.driver.memory and spark.executor.memory are parsed as > bytes if no units specified. But in the doc, they are described as mb in > default, which may lead to some misunderstanding. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory
huangtengfei created SPARK-25261: Summary: Update configuration.md, correct the default units of spark.driver|executor.memory Key: SPARK-25261 URL: https://issues.apache.org/jira/browse/SPARK-25261 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.3.0 Reporter: huangtengfei From [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464] and [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we can see that spark.driver.memory and spark.executor.memory are parsed as bytes if no units specified. But in the doc, they are described as mb in default, which may lead to some misunderstanding. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
[ https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25260: Assignee: Apache Spark > Fix namespace handling in SchemaConverters.toAvroType > - > > Key: SPARK-25260 > URL: https://issues.apache.org/jira/browse/SPARK-25260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Major > > `toAvroType` converts spark data type to avro schema. It always appends the > record name to namespace so its impossible to have an Avro namespace > independent of the record name. > > When invoked with a spark data type like, > > {code:java} > val sparkSchema = StructType(Seq( > StructField("name", StringType, nullable = false), > StructField("address", StructType(Seq( > StructField("city", StringType, nullable = false), > StructField("state", StringType, nullable = false))), > nullable = false))) > > // map it to an avro schema with top level namespace "foo.bar", > val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", > "foo.bar") > // result is > // avroSchema.getName = employee > // avroSchema.getNamespace = foo.bar.employee > // avroSchema.getFullname = foo.bar.employee.employee > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
[ https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594682#comment-16594682 ] Apache Spark commented on SPARK-25260: -- User 'arunmahadevan' has created a pull request for this issue: https://github.com/apache/spark/pull/22251 > Fix namespace handling in SchemaConverters.toAvroType > - > > Key: SPARK-25260 > URL: https://issues.apache.org/jira/browse/SPARK-25260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Major > > `toAvroType` converts spark data type to avro schema. It always appends the > record name to namespace so its impossible to have an Avro namespace > independent of the record name. > > When invoked with a spark data type like, > > {code:java} > val sparkSchema = StructType(Seq( > StructField("name", StringType, nullable = false), > StructField("address", StructType(Seq( > StructField("city", StringType, nullable = false), > StructField("state", StringType, nullable = false))), > nullable = false))) > > // map it to an avro schema with top level namespace "foo.bar", > val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", > "foo.bar") > // result is > // avroSchema.getName = employee > // avroSchema.getNamespace = foo.bar.employee > // avroSchema.getFullname = foo.bar.employee.employee > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
[ https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25260: Assignee: (was: Apache Spark) > Fix namespace handling in SchemaConverters.toAvroType > - > > Key: SPARK-25260 > URL: https://issues.apache.org/jira/browse/SPARK-25260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Major > > `toAvroType` converts spark data type to avro schema. It always appends the > record name to namespace so its impossible to have an Avro namespace > independent of the record name. > > When invoked with a spark data type like, > > {code:java} > val sparkSchema = StructType(Seq( > StructField("name", StringType, nullable = false), > StructField("address", StructType(Seq( > StructField("city", StringType, nullable = false), > StructField("state", StringType, nullable = false))), > nullable = false))) > > // map it to an avro schema with top level namespace "foo.bar", > val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", > "foo.bar") > // result is > // avroSchema.getName = employee > // avroSchema.getNamespace = foo.bar.employee > // avroSchema.getFullname = foo.bar.employee.employee > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType
Arun Mahadevan created SPARK-25260: -- Summary: Fix namespace handling in SchemaConverters.toAvroType Key: SPARK-25260 URL: https://issues.apache.org/jira/browse/SPARK-25260 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Arun Mahadevan `toAvroType` converts spark data type to avro schema. It always appends the record name to namespace so its impossible to have an Avro namespace independent of the record name. When invoked with a spark data type like, {code:java} val sparkSchema = StructType(Seq( StructField("name", StringType, nullable = false), StructField("address", StructType(Seq( StructField("city", StringType, nullable = false), StructField("state", StringType, nullable = false))), nullable = false))) // map it to an avro schema with top level namespace "foo.bar", val avroSchema = SchemaConverters.toAvroType(sparkSchema, false, "employee", "foo.bar") // result is // avroSchema.getName = employee // avroSchema.getNamespace = foo.bar.employee // avroSchema.getFullname = foo.bar.employee.employee {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20977) NPE in CollectionAccumulator
[ https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594659#comment-16594659 ] howie yu commented on SPARK-20977: -- I use pyspark 2.3.1 also have this problem , but in different line [2018-08-28 15:29:06,146][ERROR] Utils : Uncaught exception in thread heartbeat-receiver-event-loop-thread java.lang.NullPointerException at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:477) at org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:449) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8$$anonfun$9.apply(TaskSchedulerImpl.scala:449) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8$$anonfun$9.apply(TaskSchedulerImpl.scala:449) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8.apply(TaskSchedulerImpl.scala:449) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8.apply(TaskSchedulerImpl.scala:448) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:448) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360) at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) > NPE in CollectionAccumulator > > > Key: SPARK-20977 > URL: https://issues.apache.org/jira/browse/SPARK-20977 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: sharkd tu >Priority: Major > > {code:java} > 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread > heartbeat-receiver-event-loop-thread > java.lang.NullPointerException > at > org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464) > at > org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407) > at >
[jira] [Resolved] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25226. -- Resolution: Won't Fix Please don't reopen until we are clear that this is going to be fixed. For the current status, I wouldn't fix it, or less sure what to fix for now. > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594648#comment-16594648 ] Hyukjin Kwon commented on SPARK-25226: -- {quote} 1. What exactly is a catalog string? {quote} Hive DDL formatted schema string. {quote} 2. We need JSON, because the client will expect JSON. I am not sure, why you are saying, that JSON should be deprecated. {quote} JSON format support only exists for backword compatibility, which can be replaced by the catalog string in most case. This can be retreived by {{dataType.catalogString}}. {quote} 3. Yes, I can use 'array' in the current master. But I cannot use array or more complex types (nested arrays, etc.), which is what I really need (see ticket description). {quote} Spark SQL requires a schema known ahead so that we can do other optimizations. What type do you expect for that? String can contains all of the data inside. Do you propose a like object type to contains all of them? If so, please ask it to the mailing list. FWIW, I don't think it's a good idea since Spark SQL's advantage is basically from the fact that we know the schema ahead. > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy Davygora reopened SPARK-25226: > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements
[ https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594643#comment-16594643 ] Yuriy Davygora commented on SPARK-25226: [~hyukjin.kwon] 1. What exactly is a catalog string? 2. We need JSON, because the client will expect JSON. I am not sure, why you are saying, that JSON should be deprecated. 3. Yes, I can use 'array' in the current master. But I cannot use array or more complex types (nested arrays, etc.), which is what I really need (see ticket description). Reopening > Extend functionality of from_json to support arrays of differently-typed > elements > - > > Key: SPARK-25226 > URL: https://issues.apache.org/jira/browse/SPARK-25226 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 2.3.1 >Reporter: Yuriy Davygora >Priority: Minor > > At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of > STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with > Spark 2.4, but it will only support arrays of elements of same data type. It > will not, for example, support JSON-arrays like > {noformat} > ["string_value", 0, true, null] > {noformat} > which is JSON-valid with schema > {noformat} > {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"} > {noformat} > We would like to kindly ask you to add support for different-typed element > arrays in the 'from_json' function. This will necessitate extending the > functionality of ArrayType or maybe adding a new type (refer to > [[SPARK-25225]]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25259: Comment: was deleted (was: I'm working on.) > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25259: Assignee: (was: Apache Spark) > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25259: Assignee: Apache Spark > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594606#comment-16594606 ] Apache Spark commented on SPARK-25259: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22250 > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates
[ https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594563#comment-16594563 ] Yuming Wang commented on SPARK-25259: - I'm working on. > Left/Right join support push down during-join predicates > > > Key: SPARK-25259 > URL: https://issues.apache.org/jira/browse/SPARK-25259 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > create temporary view EMPLOYEE as select * from values > ("10", "HAAS", "A00"), > ("10", "THOMPSON", "B01"), > ("30", "KWAN", "C01"), > ("000110", "LUCCHESSI", "A00"), > ("000120", "O'CONNELL", "A))"), > ("000130", "QUINTANA", "C01") > as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); > create temporary view DEPARTMENT as select * from values > ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), > ("B01", "PLANNING", "20"), > ("C01", "INFORMATION CENTER", "30"), > ("D01", "DEVELOPMENT CENTER", null) > as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); > create temporary view PROJECT as select * from values > ("AD3100", "ADMIN SERVICES", "D01"), > ("IF1000", "QUERY SERVICES", "C01"), > ("IF2000", "USER EDUCATION", "E01"), > ("MA2100", "WELD LINE AUDOMATION", "D01"), > ("PL2100", "WELD LINE PLANNING", "01") > as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); > {code} > below SQL: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} > can Optimized to: > {code:sql} > SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME > FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D > ON P.DEPTNO = D.DEPTNO > AND P.DEPTNO='E01'; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25259) Left/Right join support push down during-join predicates
Yuming Wang created SPARK-25259: --- Summary: Left/Right join support push down during-join predicates Key: SPARK-25259 URL: https://issues.apache.org/jira/browse/SPARK-25259 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang For example: {code:sql} create temporary view EMPLOYEE as select * from values ("10", "HAAS", "A00"), ("10", "THOMPSON", "B01"), ("30", "KWAN", "C01"), ("000110", "LUCCHESSI", "A00"), ("000120", "O'CONNELL", "A))"), ("000130", "QUINTANA", "C01") as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); create temporary view DEPARTMENT as select * from values ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), ("B01", "PLANNING", "20"), ("C01", "INFORMATION CENTER", "30"), ("D01", "DEVELOPMENT CENTER", null) as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); create temporary view PROJECT as select * from values ("AD3100", "ADMIN SERVICES", "D01"), ("IF1000", "QUERY SERVICES", "C01"), ("IF2000", "USER EDUCATION", "E01"), ("MA2100", "WELD LINE AUDOMATION", "D01"), ("PL2100", "WELD LINE PLANNING", "01") as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); {code} below SQL: {code:sql} SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; {code} can Optimized to: {code:sql} SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org