[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:
[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973985#comment-16973985 ] sandeshyapuram commented on SPARK-29682: Thanks! > Failure when resolving conflicting references in Join: > -- > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) > at > org.apache.spark.sql.catalyst.analysis.Check
[jira] [Updated] (SPARK-29888) New interval string parser parse '.111 seconds' to null
[ https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-29888: - Issue Type: Bug (was: Improvement) > New interval string parser parse '.111 seconds' to null > > > Key: SPARK-29888 > URL: https://issues.apache.org/jira/browse/SPARK-29888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > Current string to interval cast logic does not support i.e. cast('.111 > second' as interval) which will fail in SIGN state and return null, actually, > it is 00:00:00.111. > {code:java} > These are the results of the master branch. > -- !query 63 > select interval '.111 seconds' > -- !query 63 schema > struct<0.111 seconds:interval> > -- !query 63 output > 0.111 seconds > -- !query 64 > select cast('.111 seconds' as interval) > -- !query 64 schema > struct > -- !query 64 output > NULL > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29888) New interval string parser parse '.111 seconds' to null
Kent Yao created SPARK-29888: Summary: New interval string parser parse '.111 seconds' to null Key: SPARK-29888 URL: https://issues.apache.org/jira/browse/SPARK-29888 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Current string to interval cast logic does not support i.e. cast('.111 second' as interval) which will fail in SIGN state and return null, actually, it is 00:00:00.111. {code:java} These are the results of the master branch. -- !query 63 select interval '.111 seconds' -- !query 63 schema struct<0.111 seconds:interval> -- !query 63 output 0.111 seconds -- !query 64 select cast('.111 seconds' as interval) -- !query 64 schema struct -- !query 64 output NULL {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29682) Failure when resolving conflicting references in Join:
[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29682. --- Fix Version/s: 3.0.0 2.4.5 Assignee: Terry Kim Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26441 > Failure when resolving conflicting references in Join: > -- > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > :+- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) >+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] >+- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.a
[jira] [Resolved] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29873. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26497 [https://github.com/apache/spark/pull/26497] > Support `--import` directive to load queries from another test case in > SQLQueryTestSuite > > > Key: SPARK-29873 > URL: https://issues.apache.org/jira/browse/SPARK-29873 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends > to support `--import` directive to load queries from another test case in > SQLQueryTestSuite. > This fix comes from the @cloud-fan suggestion in > https://github.com/apache/spark/pull/26479#discussion_r345086978 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-29873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29873: --- Assignee: Takeshi Yamamuro > Support `--import` directive to load queries from another test case in > SQLQueryTestSuite > > > Key: SPARK-29873 > URL: https://issues.apache.org/jira/browse/SPARK-29873 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends > to support `--import` directive to load queries from another test case in > SQLQueryTestSuite. > This fix comes from the @cloud-fan suggestion in > https://github.com/apache/spark/pull/26479#discussion_r345086978 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29885) Improve the exception message when reading the daemon port
[ https://issues.apache.org/jira/browse/SPARK-29885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-29885: Fix Version/s: (was: 3.0.0) > Improve the exception message when reading the daemon port > -- > > Key: SPARK-29885 > URL: https://issues.apache.org/jira/browse/SPARK-29885 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: jiaan.geng >Priority: Major > > In production environment, my pyspark application occurs an exception and > it's message as below: > {code:java} > 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: No port number in pyspark.daemon's stdout > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > > At first, I think a physical node has many ports are occupied by a large > number of processes. > But I found the total number of ports in use is only 671. > > {code:java} > [yarn@r1115 ~]$ netstat -a | wc -l > 671 > {code} > I checked the code of PythonWorkerFactory in line 204 and found: > {code:java} > daemon = pb.start() > val in = new DataInputStream(daemon.getInputStream) > try { > daemonPort = in.readInt() > } catch { > case _: EOFException => > throw new SparkException(s"No port number in $daemonModule's stdout") > } > {code} > I added some code here: > {code:java} > logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") > logError("Exit value: ${daemon.exitValue()}") > {code} > Then I recurrent the exception and it's message as below: > {code:java} > 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is > alive: false > 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 > 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: No port number in pyspark.daemon's stdout > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) >
[jira] [Commented] (SPARK-29887) PostgreSQL dialect: cast to smallint
[ https://issues.apache.org/jira/browse/SPARK-29887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973942#comment-16973942 ] jobit mathew commented on SPARK-29887: -- I will work on this > PostgreSQL dialect: cast to smallint > > > Key: SPARK-29887 > URL: https://issues.apache.org/jira/browse/SPARK-29887 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > Make SparkSQL's cast to smallint behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29887) PostgreSQL dialect: cast to smallint
jobit mathew created SPARK-29887: Summary: PostgreSQL dialect: cast to smallint Key: SPARK-29887 URL: https://issues.apache.org/jira/browse/SPARK-29887 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: jobit mathew Make SparkSQL's cast to smallint behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29886) Add support for Spark style HashDistribution and Partitioning to V2 Datasource
Andrew K Long created SPARK-29886: - Summary: Add support for Spark style HashDistribution and Partitioning to V2 Datasource Key: SPARK-29886 URL: https://issues.apache.org/jira/browse/SPARK-29886 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Andrew K Long Currently a v2 datasource does not have the ability specify that its Distribution iscompatible with sparks HashClusteredDistribution. We need to add the appropriate class in the interface and add support in DataSourcePartitioning so that EnsureRequirements is aware of the tables partitioning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29837) PostgreSQL dialect: cast to boolean
[ https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29837: --- Assignee: wuyi > PostgreSQL dialect: cast to boolean > --- > > Key: SPARK-29837 > URL: https://issues.apache.org/jira/browse/SPARK-29837 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29837) PostgreSQL dialect: cast to boolean
[ https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29837. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26463 [https://github.com/apache/spark/pull/26463] > PostgreSQL dialect: cast to boolean > --- > > Key: SPARK-29837 > URL: https://issues.apache.org/jira/browse/SPARK-29837 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when > spark.sql.dialect is configured as PostgreSQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.
[ https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29619: --- Description: This ticket is related to https://issues.apache.org/jira/browse/SPARK-29885 and add try mechanism. > Add retry times when reading the daemon port. > - > > Key: SPARK-29619 > URL: https://issues.apache.org/jira/browse/SPARK-29619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > This ticket is related to https://issues.apache.org/jira/browse/SPARK-29885 > and add try mechanism. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.
[ https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29619: --- Summary: Add retry times when reading the daemon port. (was: Improve the exception message when reading the daemon port and add retry times.) > Add retry times when reading the daemon port. > - > > Key: SPARK-29619 > URL: https://issues.apache.org/jira/browse/SPARK-29619 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > In production environment, my pyspark application occurs an exception and > it's message as below: > {code:java} > 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: No port number in pyspark.daemon's stdout > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){code} > > At first, I think a physical node has many ports are occupied by a large > number of processes. > But I found the total number of ports in use is only 671. > > {code:java} > [yarn@r1115 ~]$ netstat -a | wc -l > 671 > {code} > I checked the code of PythonWorkerFactory in line 204 and found: > {code:java} > daemon = pb.start() > val in = new DataInputStream(daemon.getInputStream) > try { > daemonPort = in.readInt() > } catch { > case _: EOFException => > throw new SparkException(s"No port number in $daemonModule's stdout") > } > {code} > I added some code here: > {code:java} > logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") > logError("Exit value: ${daemon.exitValue()}") > {code} > Then I recurrent the exception and it's message as below: > {code:java} > 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is > alive: false > 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 > 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: No port number in pyspark.daemon's stdout > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at
[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.
[ https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29619: --- Description: (was: In production environment, my pyspark application occurs an exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} At first, I think a physical node has many ports are occupied by a large number of processes. But I found the total number of ports in use is only 671. {code:java} [yarn@r1115 ~]$ netstat -a | wc -l 671 {code} I checked the code of PythonWorkerFactory in line 204 and found: {code:java} daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) try { daemonPort = in.readInt() } catch { case _: EOFException => throw new SparkException(s"No port number in $daemonModule's stdout") } {code} I added some code here: {code:java} logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") logError("Exit value: ${daemon.exitValue()}") {code} Then I recurrent the exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is alive: false 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at or
[jira] [Updated] (SPARK-29885) Improve the exception message when reading the daemon port
[ https://issues.apache.org/jira/browse/SPARK-29885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-29885: --- Description: In production environment, my pyspark application occurs an exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} At first, I think a physical node has many ports are occupied by a large number of processes. But I found the total number of ports in use is only 671. {code:java} [yarn@r1115 ~]$ netstat -a | wc -l 671 {code} I checked the code of PythonWorkerFactory in line 204 and found: {code:java} daemon = pb.start() val in = new DataInputStream(daemon.getInputStream) try { daemonPort = in.readInt() } catch { case _: EOFException => throw new SparkException(s"No port number in $daemonModule's stdout") } {code} I added some code here: {code:java} logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}") logError("Exit value: ${daemon.exitValue()}") {code} Then I recurrent the exception and it's message as below: {code:java} 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is alive: false 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: No port number in pyspark.daemon's stdout at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.
[jira] [Created] (SPARK-29885) Improve the exception message when reading the daemon port
jiaan.geng created SPARK-29885: -- Summary: Improve the exception message when reading the daemon port Key: SPARK-29885 URL: https://issues.apache.org/jira/browse/SPARK-29885 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: jiaan.geng Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29649) Stop task set if FileAlreadyExistsException was thrown when writing to output file
[ https://issues.apache.org/jira/browse/SPARK-29649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29649. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26312 [https://github.com/apache/spark/pull/26312] > Stop task set if FileAlreadyExistsException was thrown when writing to output > file > -- > > Key: SPARK-29649 > URL: https://issues.apache.org/jira/browse/SPARK-29649 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > We already know task attempts that do not clean up output files in staging > directory can cause job failure (SPARK-27194). There was proposals trying to > fix it by changing output filename, or deleting existing output files. These > proposals are not reliable completely. > The difficulty is, as previous failed task attempt wrote the output file, at > next task attempt the output file is still under same staging directory, even > the output file name is different. > If the job will go to fail eventually, there is no point to re-run the task > until max attempts are reached. For the jobs running a lot of time, > re-running the task can waste a lot of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29644. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26301 [https://github.com/apache/spark/pull/26301] > ShortType is wrongly set as Int in JDBCUtils.scala > -- > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Assignee: Shiv Prashant Sood >Priority: Minor > Fix For: 3.0.0 > > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29644) Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29644: -- Summary: Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils (was: ShortType is wrongly set as Int in JDBCUtils.scala) > Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils > - > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Assignee: Shiv Prashant Sood >Priority: Minor > Fix For: 3.0.0 > > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala
[ https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29644: - Assignee: Shiv Prashant Sood > ShortType is wrongly set as Int in JDBCUtils.scala > -- > > Key: SPARK-29644 > URL: https://issues.apache.org/jira/browse/SPARK-29644 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Shiv Prashant Sood >Assignee: Shiv Prashant Sood >Priority: Minor > > @maropu pointed out this issue during [PR > 25344|https://github.com/apache/spark/pull/25344] review discussion. > In > [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala] > line number 547 > case ShortType => > (stmt: PreparedStatement, row: Row, pos: Int) => > stmt.setInt(pos + 1, row.getShort(pos)) > I dont see any reproducible issue, but this is clearly a problem that must be > fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29287) Executors should not receive any offers before they are actually constructed
[ https://issues.apache.org/jira/browse/SPARK-29287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-29287. -- Fix Version/s: 3.0.0 Assignee: Kent Yao Resolution: Done Fixed by https://github.com/apache/spark/pull/25964 > Executors should not receive any offers before they are actually constructed > - > > Key: SPARK-29287 > URL: https://issues.apache.org/jira/browse/SPARK-29287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > The executors send RegisterExecutor messages to the driver when onStart. > The driver put the executor data in “the ready to serve map” if it could be, > then send RegisteredExecutor back to the executor. The driver now can make > an offer to this executor. > But the executor is not fully constructed yet. When it received > RegisteredExecutor, it start to construct itself, initializing block manager, > maybe register to the local shuffle server in the way of retrying, then start > the heart beating to driver ... > The task allocated here may fail if the executor fails to start or cannot get > heart beating to the driver in time. > Sometimes, even worse, when dynamic allocation and blacklisting is enabled > and when the runtime executor number down to min executor setting, and those > executors receive tasks before fully constructed and if any error happens, > the application may be blocked or tear down. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29778) saveAsTable append mode is not passing writer options
[ https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29778. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26474 [https://github.com/apache/spark/pull/26474] > saveAsTable append mode is not passing writer options > - > > Key: SPARK-29778 > URL: https://issues.apache.org/jira/browse/SPARK-29778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Wesley Hoffman >Priority: Critical > Fix For: 3.0.0 > > > There was an oversight where AppendData is not getting the WriterOptions in > saveAsTable. > [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29778) saveAsTable append mode is not passing writer options
[ https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29778: - Assignee: Wesley Hoffman > saveAsTable append mode is not passing writer options > - > > Key: SPARK-29778 > URL: https://issues.apache.org/jira/browse/SPARK-29778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Wesley Hoffman >Priority: Critical > > There was an oversight where AppendData is not getting the WriterOptions in > saveAsTable. > [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-24203. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26331 [https://github.com/apache/spark/pull/26331] > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Assignee: Nishchal Venkataramana >Priority: Major > Labels: bulk-closed > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973715#comment-16973715 ] koert kuipers commented on SPARK-28945: --- i understand there is a great deal of complexity in the committer and this might require more work to get it right but its still unclear to me if the committer is doing anything at all in case of dynamic partition overwrite. what do i lose by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true? and if i lose nothing, is that a good thing, or does that mean i should be worried about the current state? > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973695#comment-16973695 ] Maciej Szymkiewicz edited comment on SPARK-29748 at 11/13/19 9:13 PM: -- While this is a step in the right direction I think it justifies a broader discussion about {{Row}} purpose, API, and behavior guarantees. Especially if we're going to introduce diverging implementations, with {{Row}} and {{LegacyRow}}. Over the years Spark code have accumulated a lot of conflicting behaviors and special cases related to {{Row}}: * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not. * Sometimes there are treated as ordered products ({{tuples}}), sometimes as unordered dictionaries. * We provide efficient access only by position, but the primary access method is by name. * etc. Some of the unusual properties, are well documented (but still confusing), other are not. For example objects that are indistinguishable using public API {code:python} from pyspark.sql.types import StructType, StructField, IntegerType, StringType a = Row(x1=1, x2="foo") b = Row("x1", "x2")(1, "foo") a == b # True type(a) == type(b) #True list(a.__fields__) == list(b.__fields__) # Not really public, but just to make a point {code} cannot be substituted in practice. {code:python} schema = StructType([ StructField("x2", StringType()), StructField("x1", IntegerType())]) spark.createDataFrame([a], schema) # DataFrame[x2: string, x1: int] spark.createDataFrame([b], schema) # TypeError Traceback (most recent call last) # ... # TypeError: field x1: IntegerType can not accept object 'foo' in type {code} To make things even worse the primary (while I don't have hard data here, but it is common both in the internal API as well as the code I've seen in the wild) access method - by name - is _O(M)_ where is the width of schema. So if we're going to modify the core behavior (sorting) it makes sense to rethink the whole design. Since the schema is carried around with each object and pass over the wire we might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something around these lines: {code:python} import sys from collections import OrderedDict class Row: slots = ["_store"] def __init__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if args: self._store = OrderedDict.fromkeys(args) else: self._store = OrderedDict(kwargs) def __getattr__(self, x): return self._store[x] def __getitem__(self, x): if isinstance(x, int): return list(self._store.values())[x] else: return self._store[x] def __iter__(self): return iter(self._store.values()) def __repr__(self): return "Row({})".format(", ".join( "{}={}".format(k, v) for k, v in self._store.items() )) def __len__(self): return len(self._store) def __call__(self, *args): if len(args) > len(self): raise ValueError("Can not create Row with fields %s, expected %d values " "but got %s" % (self, len(self), args)) self._store.update(zip(self._store.keys(), args)) return self def __eq__(self, other): return isinstance(other, Row) and self._store == other._store @property def _fields(self): return self._store.keys() @staticmethod def _conv(obj): if isinstance(obj, Row): return obj.asDict(True) elif isinstance(obj, list): return [conv(o) for o in obj] elif isinstance(obj, dict): return dict((k, conv(v)) for k, v in obj.items()) else: return obj def asDict(self, recursive=False): if recursive: result = OrderedDict.fromkeys(self._fields) for key in self._fields: result[key] = Row._conv(self._store[key]) return result else: return self._store @classmethod def from_dict(cls, d): if sys.version_info >= (3, 6): if not(isinstance(d, dict)): raise ValueError( "from_dict requires dict but got {}".format( type(d))) else: if not(isinstance(d, OrderedDict)): raise ValueError( "from_dict requires collections.OrderedDict {}".format( type(d))) return cls(**d) {code} If we're committed to {{Row}} being a {{tuple}} (with _O(1)_
[jira] [Comment Edited] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973695#comment-16973695 ] Maciej Szymkiewicz edited comment on SPARK-29748 at 11/13/19 9:11 PM: -- While this is a step in the right direction I think it justifies a broader discussion about {{Row}} purpose, API, and behavior guarantees. Especially if we're going to introduce diverging implementations, with {{Row}} and {{LegacyRow}}. Over the years Spark code have accumulated a lot of conflicting behaviors and special cases related to {{Row}}: * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not. * Sometimes there are treated as ordered products ({{tuples}}), sometimes as unordered dictionaries. * We provide efficient access only by position, but the primary access method is by name. * etc. Some of the unusual properties, are well documented (but still confusing), other are not. For example objects that are indistinguishable using public API {code:python} from pyspark.sql.types import StructType, StructField, IntegerType, StringType a = Row(x1=1, x2="foo") b = Row("x1", "x2")(1, "foo") a == b # True type(a) == type(b) #True list(a.__fields__) == list(b.__fields__) # Not really public, but just to make a point {code} cannot be substituted in practice. {code:python} schema = StructType([ StructField("x2", StringType()), StructField("x1", IntegerType())]) spark.createDataFrame([a], schema) # DataFrame[x2: string, x1: int] spark.createDataFrame([b], schema) # TypeError Traceback (most recent call last) # ... # TypeError: field x1: IntegerType can not accept object 'foo' in type {code} To make things even worse the primary (while I don't have hard data here, but it is common both in the internal API as well as the code I've seen in the wild) access method - by name - is _O(M)_ where is the width of schema. So if we're going to modify the core behavior (sorting) it makes sense to rethink the whole design. Since the schema is carried around with each object and pass over the wire we might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something around these lines: {code:python} import sys from collections import OrderedDict class Row: slots = ["_store"] def __init__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if args: self._store = OrderedDict.fromkeys(args) else: self._store = OrderedDict(kwargs) def __getattr__(self, x): return self._store[x] def __getitem__(self, x): if isinstance(x, int): return list(self._store.values())[x] else: return self._store[x] def __iter__(self): return iter(self._store.values()) def __repr__(self): return "Row({})".format(", ".join( "{}={}".format(k, v) for k, v in self._store.items() )) def __len__(self): return len(self._store) def __call__(self, *args): if len(args) > len(self): raise ValueError("Can not create Row with fields %s, expected %d values " "but got %s" % (self, len(self), args)) self._store.update(zip(self._store.keys(), args)) return self def __eq__(self, other): return isinstance(other, Row) and self._store == other._store @property def _fields(self): return self._store.keys() @staticmethod def _conv(obj): if isinstance(obj, Row): return obj.asDict(True) elif isinstance(obj, list): return [conv(o) for o in obj] elif isinstance(obj, dict): return dict((k, conv(v)) for k, v in obj.items()) else: return obj def asDict(self, recursive=False): if recursive: result = OrderedDict.fromkeys(self._fields) for key in self._fields: result[key] = Row._conv(self._store[key]) return result else: return self._store @classmethod def from_dict(cls, d): if sys.version_info >= (3, 6): if not(isinstance(d, dict)): raise ValueError( "from_dict requires dict but got {}".format( type(d))) else: if not(isinstance(d, OrderedDict)): raise ValueError( "from_dict requires collections.OrderedDict {}".format( type(d))) return cls(**d) {code} If we're committed to {{Row}} being a {{tuple}} (with _O(1)_
[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973695#comment-16973695 ] Maciej Szymkiewicz commented on SPARK-29748: While this is a step in the right direction I think it justifies a broader discussion about {{Row}} purpose, API, and behavior guarantees. Especially if we're going to introduce diverging implementations, with {{Row}} and {{LegacyRow}}. Over the years Spark code have accumulated a lot of conflicting behaviors and special cases related to {{Row}}: * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not. * Sometimes there are treated as ordered products ({{tuples}}), sometimes as unordered dictionaries. * We provide efficient access only by position, but the primary access method is by name. * etc. Some of the unusual properties, are well documented (but still confusing), other are not. For example objects that are indistinguishable using public API {code:python} from pyspark.sql.types import StructType, StructField, IntegerType, StringType a = Row(x1=1, x2="foo") b = Row("x1", "x2")(1, "foo") a == b # True type(a) == type(b) #True list(a.__fields__) == list(b.__fields__) # Not really public, but just to make a point {code} cannot be substituted in practice. {code:python} schema = StructType([ StructField("x2", StringType()), StructField("x1", IntegerType())]) spark.createDataFrame([a], schema) # DataFrame[x2: string, x1: int] spark.createDataFrame([b], schema) # TypeError Traceback (most recent call last) # ... # TypeError: field x1: IntegerType can not accept object 'foo' in type {code} To make things even worse the primary (while I don't have hard data here, but it is common both in the internal API as well as the code I've seen in the wild) access method - by name - is _O(M)_ where is the width of schema. So if we're going to modify the core behavior (sorting) it makes sense to rethink the whole design. Since the schema is carried around with each object and pass over the wire we might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something around these lines: {code:python} import sys from collections import OrderedDict class Row: slots = ["_store"] def __init__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if args: self._store = OrderedDict.fromkeys(args) else: self._store = OrderedDict(kwargs) def __getattr__(self, x): return self._store[x] def __getitem__(self, x): if isinstance(x, int): return list(self._store.values())[x] else: return self._store[x] def __iter__(self): return iter(self._store.values()) def __repr__(self): return "Row({})".format(", ".join( "{}={}".format(k, v) for k, v in self._store.items() )) def __len__(self): return len(self._store) def __call__(self, *args): if len(args) > len(self): raise ValueError("Can not create Row with fields %s, expected %d values " "but got %s" % (self, len(self), args)) self._store.update(zip(self._store.keys(), args)) return self def __eq__(self, other): return isinstance(other, Row) and self._store == other._store @property def _fields(self): return self._store.keys() @staticmethod def _conv(obj): if isinstance(obj, Row): return obj.asDict(True) elif isinstance(obj, list): return [conv(o) for o in obj] elif isinstance(obj, dict): return dict((k, conv(v)) for k, v in obj.items()) else: return obj def asDict(self, recursive=False): if recursive: result = OrderedDict.fromkeys(self._fields) for key in self._fields: result[key] = Row._conv(self._store[key]) return result else: return self._store @classmethod def from_dict(cls, d): if sys.version_info >= (3, 6): if not(isinstance(d, dict)): raise ValueError( "from_dict requires dict but got {}".format( type(d))) else: if not(isinstance(d, OrderedDict)): raise ValueError( "from_dict requires collections.OrderedDict {}".format( type(d))) return cls(**d) {code} If we're committed to {{Row}} being a {{tuple}} (with _O(1)_ by index access) we could actually try to hack {{
[jira] [Updated] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate
[ https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy updated SPARK-29884: --- Summary: spark-submit to kuberentes can not parse valid ca certificate (was: spark-Submit to kuberentes can not parse valid ca certificate) > spark-submit to kuberentes can not parse valid ca certificate > - > > Key: SPARK-29884 > URL: https://issues.apache.org/jira/browse/SPARK-29884 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 > Environment: A kuberentes cluster that has been in use for over 2 > years and handles large amounts of production payloads. >Reporter: Jeremy >Priority: Major > > spark submit can not be used to to schedule to kuberentes with oauth token > and cacert > {code:java} > spark-submit \ > --deploy-mode cluster \ > --class org.apache.spark.examples.SparkPi \ > --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt > \ > --conf spark.kubernetes.namespace=here-olp-3dds-sit \ > --conf spark.executor.instances=1 \ > --conf spark.app.name=spark-pi \ > --conf > spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 > \ > --conf > spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 > \ > local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar > {code} > returns > {code:java} > log4j:WARN No appenders could be found for logger > (io.fabric8.kubernetes.client.Config). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) > at > org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) > at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) > at > org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.security.cert.CertificateException: Could not parse > certificate: java.io.IOException: Empty input > at > sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) > at > java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78) > ... 13 more > Caused by: java.io.IOException: Empty input > at > sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106) > ... 19 more > {code} > The cacert and token are both valid and work even with curl > {code:java} > cur
[jira] [Updated] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate
[ https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy updated SPARK-29884: --- Description: spark submit can not be used to to schedule to kuberentes with oauth token and cacert {code:java} spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt \ --conf spark.kubernetes.namespace=here-olp-3dds-sit \ --conf spark.executor.instances=1 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \ --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \ local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar {code} returns {code:java} log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) at org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.security.cert.CertificateException: Could not parse certificate: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) at java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78) ... 13 more Caused by: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106) ... 19 more {code} The cacert and token are both valid and work even with curl {code:java} curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H "Authorization: bearer $TOKEN" -v https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods -o out % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.117.233.37:443... * TCP_NODELAY set * Connected to api.borg-dev-1-aws-eu-west-1.k8s.in.here.com (10.117.233.37) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshak
[jira] [Updated] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate
[ https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy updated SPARK-29884: --- Description: spark submit can not be used to to schedule to kuberentes with oauth token and cacert {code:java} spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt \ --conf spark.kubernetes.namespace=here-olp-3dds-sit \ --conf spark.executor.instances=1 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \ --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \ local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar {code} returns {code:java} log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) at org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.security.cert.CertificateException: Could not parse certificate: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) at java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78) ... 13 more Caused by: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106) ... 19 more {code} The cacert and token are both valid and work even with curl {code:java} curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H "Authorization: bearer $TOKEN" -v https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods -o out % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.117.233.37:443... * TCP_NODELAY set * Connected to api.borg-dev-1-aws-eu-west-1.k8s.in.here.com (10.117.233.37) port 443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshak
[jira] [Created] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate
Jeremy created SPARK-29884: -- Summary: spark-Submit to kuberentes can not parse valid ca certificate Key: SPARK-29884 URL: https://issues.apache.org/jira/browse/SPARK-29884 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.4 Environment: A kuberentes cluster that has been in use for over 2 years and handles large amounts of production payloads. Reporter: Jeremy spark submit can not be used to to schedule to kuberentes with oauth token and cacert {code:java} spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \ --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt \ --conf spark.kubernetes.namespace=here-olp-3dds-sit \ --conf spark.executor.instances=1 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \ --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0 \ local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar {code} returns {code:java} log4j:WARN No appenders could be found for logger (io.fabric8.kubernetes.client.Config). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183) at org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.security.cert.CertificateException: Could not parse certificate: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110) at java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104) at io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) at io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122) at io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78) ... 13 more Caused by: java.io.IOException: Empty input at sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106) ... 19 more {code} The cacert and token are both valid and work even with curl {code:java} curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H "Authorization: bearer $TOKEN" -v https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods -o out % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 0 00 00 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.117.233.37:443... *
[jira] [Updated] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29152: -- Summary: Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled (was: Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled[SPARK-24918]) > Spark Executor Plugin API shutdown is not proper when dynamic allocation > enabled > > > Key: SPARK-29152 > URL: https://issues.apache.org/jira/browse/SPARK-29152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Major > > *Issue Description* > Spark Executor Plugin API *shutdown handling is not proper*, when dynamic > allocation enabled .Plugin's shutdown method is not processed when dynamic > allocation is enabled and *executors become dead* after inactive time. > *Test Precondition* > 1. Create a plugin and make a jar named SparkExecutorplugin.jar > import org.apache.spark.ExecutorPlugin; > public class ExecutoTest1 implements ExecutorPlugin{ > public void init(){ > System.out.println("Executor Plugin Initialised."); > } > public void shutdown(){ > System.out.println("Executor plugin closed successfully."); > } > } > 2. Create the jars with the same and put it in folder /spark/examples/jars > *Test Steps* > 1. launch bin/spark-sql with dynamic allocation enabled > ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1 --jars > /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.initialExecutors=2 --conf > spark.dynamicAllocation.minExecutors=1 > 2 create a table , insert the data and select * from tablename > 3.Check the spark UI Jobs tab/SQL tab > 4. Check all Executors(executor tab will give all executors details) > application log file for Executor plugin Initialization and Shutdown messages > or operations. > Example > /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/ > stdout > 5. Wait for the executor to be dead after the inactive time and check the > same container log > 6. Kill the spark sql and check the container log for executor plugin > shutdown. > *Expect Output* > 1. Job should be success. Create table ,insert and select query should be > success. > 2.While running query All Executors log should contain the executor plugin > Init messages or operations. > "Executor Plugin Initialised. > 3.Once the executors are dead ,shutdown message should be there in log file. > “ Executor plugin closed successfully. > 4.Once the sql application closed ,shutdown message should be there in log. > “ Executor plugin closed successfully". > *Actual Output* > Shutdown message is not called when executor is dead after inactive time. > *Observation* > Without dynamic allocation Executor plugin is working fine. But after > enabling dynamic allocation,Executor shutdown is not processed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled
[ https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29152: -- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 > Spark Executor Plugin API shutdown is not proper when dynamic allocation > enabled > > > Key: SPARK-29152 > URL: https://issues.apache.org/jira/browse/SPARK-29152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: jobit mathew >Priority: Major > > *Issue Description* > Spark Executor Plugin API *shutdown handling is not proper*, when dynamic > allocation enabled .Plugin's shutdown method is not processed when dynamic > allocation is enabled and *executors become dead* after inactive time. > *Test Precondition* > 1. Create a plugin and make a jar named SparkExecutorplugin.jar > import org.apache.spark.ExecutorPlugin; > public class ExecutoTest1 implements ExecutorPlugin{ > public void init(){ > System.out.println("Executor Plugin Initialised."); > } > public void shutdown(){ > System.out.println("Executor plugin closed successfully."); > } > } > 2. Create the jars with the same and put it in folder /spark/examples/jars > *Test Steps* > 1. launch bin/spark-sql with dynamic allocation enabled > ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1 --jars > /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.initialExecutors=2 --conf > spark.dynamicAllocation.minExecutors=1 > 2 create a table , insert the data and select * from tablename > 3.Check the spark UI Jobs tab/SQL tab > 4. Check all Executors(executor tab will give all executors details) > application log file for Executor plugin Initialization and Shutdown messages > or operations. > Example > /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/ > stdout > 5. Wait for the executor to be dead after the inactive time and check the > same container log > 6. Kill the spark sql and check the container log for executor plugin > shutdown. > *Expect Output* > 1. Job should be success. Create table ,insert and select query should be > success. > 2.While running query All Executors log should contain the executor plugin > Init messages or operations. > "Executor Plugin Initialised. > 3.Once the executors are dead ,shutdown message should be there in log file. > “ Executor plugin closed successfully. > 4.Once the sql application closed ,shutdown message should be there in log. > “ Executor plugin closed successfully". > *Actual Output* > Shutdown message is not called when executor is dead after inactive time. > *Observation* > Without dynamic allocation Executor plugin is working fine. But after > enabling dynamic allocation,Executor shutdown is not processed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29878) Improper cache strategies in GraphX
[ https://issues.apache.org/jira/browse/SPARK-29878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973637#comment-16973637 ] Aman Omer commented on SPARK-29878: --- [https://github.com/apache/spark/pull/7469] this PR had introduced the cache() on newEdges and vertices > Improper cache strategies in GraphX > --- > > Key: SPARK-29878 > URL: https://issues.apache.org/jira/browse/SPARK-29878 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Major > > I have run examples.graphx.SSPExample and looked through the RDD dependency > graphs as well as persist operations. There are some improper cache > strategies in GraphX. The same situations also exist when I run > ConnectedComponentsExample. > 1. vertices.cache() and newEdges.cache() are unnecessary > In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this > method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), > and RDD vertices/newEdges are cached in apply(). But these two RDDs are not > directly used anymore (their children RDDs has been cached) in SSPExample, so > the persists can be unnecessary here. > However, the other examples may need these two persists, so I think they > cannot be simply removed. It might be hard to fix. > {code:scala} > def apply[VD: ClassTag, ED: ClassTag]( > vertices: VertexRDD[VD], > edges: EdgeRDD[ED]): GraphImpl[VD, ED] = { > vertices.cache() // It is unnecessary for SSPExample and > ConnectedComponentsExample > // Convert the vertex partitions in edges to the correct type > val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]] > .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD]) > .cache() // It is unnecessary for SSPExample and > ConnectedComponentsExample > GraphImpl.fromExistingRDDs(vertices, newEdges) > } > {code} > 2. Missing persist on newEdges > SSSPExample will invoke pregel to do execution. Pregel will ultilize > ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use > by multiple actions in Pregel. So newEdges should be persisted. > Same as the above issue, this issue is also found in > ConnectedComponentsExample. It is also hard to fix, because the persist added > may be unnecessary for other examples. > {code:scala} > // Pregel.scala > // compute the messages > var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // > newEdges is created here > val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( > checkpointInterval, graph.vertices.sparkContext) > messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) > var activeMessages = messages.count() // The first time use newEdges > ... > while (activeMessages > 0 && i < maxIterations) { > // Receive the messages and update the vertices. > prevG = g > g = g.joinVertices(messages)(vprog) // Generate g will depends on > newEdges > ... > activeMessages = messages.count() // The second action to use newEdges. > newEdges should be unpersisted after this instruction. > {code} > {code:scala} > // ReplicatedVertexView.scala > def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: > Boolean): Unit = { > ... >val newEdges = > edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) { > (ePartIter, shippedVertsIter) => ePartIter.map { > case (pid, edgePartition) => > (pid, > edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator))) > } > }) > edges = newEdges // newEdges should be persisted > hasSrcId = includeSrc > hasDstId = includeDst > } > } > {code} > As I don't have much knowledge about Graphx, so I don't know how to fix these > issues well. > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29883) Improve error messages when function name is an alias
[ https://issues.apache.org/jira/browse/SPARK-29883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973633#comment-16973633 ] Aman Omer commented on SPARK-29883: --- Working on this. > Improve error messages when function name is an alias > - > > Key: SPARK-29883 > URL: https://issues.apache.org/jira/browse/SPARK-29883 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > > A general issue in error message when the function name is just an alias name > of the actual built-in function. For example, every is an alias of bool_and > in Spark 3.0 > {code:java} > cannot resolve 'every('true')' due to data type mismatch: Input to function > 'every' should have been boolean, but it's [string].; line 1 pos 7 > {code} > {code:java} > cannot resolve 'bool_and('true')' due to data type mismatch: Input to > function 'bool_and' should have been boolean, but it's [string].; line 1 pos > 7{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29883) Improve error messages when function name is an alias
Xiao Li created SPARK-29883: --- Summary: Improve error messages when function name is an alias Key: SPARK-29883 URL: https://issues.apache.org/jira/browse/SPARK-29883 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li A general issue in error message when the function name is just an alias name of the actual built-in function. For example, every is an alias of bool_and in Spark 3.0 {code:java} cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7 {code} {code:java} cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29882) SPARK on Kubernetes is Broken for SPARK with no Hadoop release
[ https://issues.apache.org/jira/browse/SPARK-29882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29882. Fix Version/s: (was: 2.4.4) Target Version/s: (was: 2.4.4) Resolution: Duplicate > SPARK on Kubernetes is Broken for SPARK with no Hadoop release > -- > > Key: SPARK-29882 > URL: https://issues.apache.org/jira/browse/SPARK-29882 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.4 > Environment: h3. How was this patch tested? > Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to > {{-cp }} param of entrypoint.sh enables launching the executors correctly. >Reporter: Shahin Shakeri >Priority: Major > > h3. What changes were proposed in this pull request? > Include {{$SPARK_DIST_CLASSPATH}} in class path when launching > {{CoarseGrainedExecutorBackend}} on Kubernetes executors using the provided > {{entrypoint.sh}} > h3. Why are the changes needed? > For user provided Hadoop (3.2.1 in this example) {{$SPARK_DIST_CLASSPATH}} > contains the required jars. > h3. Does this PR introduce any user-facing change? > no > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29875: - Assignee: Hyukjin Kwon > Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x > -- > > Key: SPARK-29875 > URL: https://issues.apache.org/jira/browse/SPARK-29875 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of > warnings as below: > {code} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP > def add_one(x): > return x + 1 > spark.range(100).select(add_one("id")).collect() > {code} > {code} > UserWarning: pyarrow.open_stream is deprecated, please use > pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29875. --- Fix Version/s: 2.4.5 Resolution: Fixed Issue resolved by pull request 26501 [https://github.com/apache/spark/pull/26501] > Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x > -- > > Key: SPARK-29875 > URL: https://issues.apache.org/jira/browse/SPARK-29875 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.4 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.5 > > > In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of > warnings as below: > {code} > from pyspark.sql.functions import pandas_udf, PandasUDFType > @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP > def add_one(x): > return x + 1 > spark.range(100).select(add_one("id")).collect() > {code} > {code} > UserWarning: pyarrow.open_stream is deprecated, please use > pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: > pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream > warnings.warn("pyarrow.open_stream is deprecated, please use " > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-23151. --- Resolution: Duplicate > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973546#comment-16973546 ] Dongjoon Hyun commented on SPARK-23151: --- Yes. We did as we shipped in 3.0-preview release. > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29149) Update YARN cluster manager For Stage Level Scheduling
[ https://issues.apache.org/jira/browse/SPARK-29149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-29149: - Assignee: Thomas Graves > Update YARN cluster manager For Stage Level Scheduling > -- > > Key: SPARK-29149 > URL: https://issues.apache.org/jira/browse/SPARK-29149 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For Stage Level Scheduling, we need to update the YARN allocator to handle > requesting executors for multiple ResourceProfiles. > * The container requests have to be updated to be based on a number of > containers per ResourceProfile. This is a larger change then you might > expect because on YARN you can’t ask for different container sizes within the > same YARN container priority. So we will have to ask for containers at > different priorities to be able to get different container sizes. Other YARN > applications like Tez handle this now so it shouldn’t be a big deal just a > matter of mapping stages to different priorities. > * The allocation response from YARN has to match the containers to a > resource profile. > * We need to launch the container with additional parameters so the executor > knows its resource profile to report back to the ExecutorMonitor. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29882) SPARK on Kubernetes is Broken for SPARK with no Hadoop release
Shahin Shakeri created SPARK-29882: -- Summary: SPARK on Kubernetes is Broken for SPARK with no Hadoop release Key: SPARK-29882 URL: https://issues.apache.org/jira/browse/SPARK-29882 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.4 Environment: h3. How was this patch tested? Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to {{-cp }} param of entrypoint.sh enables launching the executors correctly. Reporter: Shahin Shakeri Fix For: 2.4.4 h3. What changes were proposed in this pull request? Include {{$SPARK_DIST_CLASSPATH}} in class path when launching {{CoarseGrainedExecutorBackend}} on Kubernetes executors using the provided {{entrypoint.sh}} h3. Why are the changes needed? For user provided Hadoop (3.2.1 in this example) {{$SPARK_DIST_CLASSPATH}} contains the required jars. h3. Does this PR introduce any user-facing change? no -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29872) Improper cache strategy in examples
[ https://issues.apache.org/jira/browse/SPARK-29872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29872. -- Resolution: Won't Fix > Improper cache strategy in examples > --- > > Key: SPARK-29872 > URL: https://issues.apache.org/jira/browse/SPARK-29872 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 3.0.0 >Reporter: Dong Wang >Priority: Minor > > 1. Improper cache in examples.SparkTC > The RDD edges should be cached because it is used multiple times in while > loop. And it should be unpersisted before the last action tc.count(), because > tc has been persisted. > On the other hand, many tc objects is cached in while loop but never > uncached, which will waste memory. > {code:scala} > val edges = tc.map(x => (x._2, x._1)) // Edges should be cached > // This join is iterated until a fixed point is reached. > var oldCount = 0L > var nextCount = tc.count() > do { > oldCount = nextCount > // Perform the join, obtaining an RDD of (y, (z, x)) pairs, > // then project the result to obtain the new (x, z) paths. > tc = tc.union(tc.join(edges).map(x => (x._2._2, > x._2._1))).distinct().cache() > nextCount = tc.count() > } while (nextCount != oldCount) > println(s"TC has ${tc.count()} edges.") > {code} > 2. Cache needed in examples.ml.LogisticRegressionSummary > The DataFrame fMeasure should be cached. > {code:scala} > // Set the model threshold to maximize F-Measure > val fMeasure = trainingSummary.fMeasureByThreshold // fMeasures should be > cached > val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0) > val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure) > .select("threshold").head().getDouble(0) > lrModel.setThreshold(bestThreshold) > {code} > 3. Cache needed in examples.sql.SparkSQLExample > {code:scala} > val peopleDF = spark.sparkContext > .textFile("examples/src/main/resources/people.txt") > .map(_.split(",")) > .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) // > This RDD should be cahced > .toDF() > // Register the DataFrame as a temporary view > peopleDF.createOrReplaceTempView("people") > val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age > BETWEEN 13 AND 19") > teenagersDF.map(teenager => "Name: " + teenager(0)).show() > teenagersDF.map(teenager => "Name: " + > teenager.getAs[String]("name")).show() > implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, > Any]] > teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", > "age"))).collect() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24353) Add support for pod affinity/anti-affinity
[ https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973392#comment-16973392 ] Rafael Felix Correa edited comment on SPARK-24353 at 11/13/19 2:35 PM: --- Hi all, I gave this problem a shot in my PR [https://github.com/apache/spark/pull/26505], which implements tolerations it in a slightly different way than the proposed one due to the key field being optional (therefore it cannot be used as a unique identifier). It doesn't entirely close this issue because pod affinity/anti-affinity is not covered in my PR. I have a custom build with this patch already working internally here in OLX, please have a look at the PR if you have time. Hope it helps more people as well :) Cheers, was (Author: rafaelfc): Hi all, I gave this problem a shot in my PR [https://github.com/apache/spark/pull/26505], which implements tolerations it in a slightly different way than the proposed one due to the key field being optional (therefore it cannot be used as a unique identifier). I have a custom build with this patch already working internally here in OLX, please have a look at the PR if you have time. Hope it helps more people as well :) Cheers, > Add support for pod affinity/anti-affinity > -- > > Key: SPARK-24353 > URL: https://issues.apache.org/jira/browse/SPARK-24353 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Spark on K8s allows to place driver/executor pods on specific k8s nodes, > using nodeSelector. NodeSelector is a very simple way to constrain pods to > nodes with particular labels. The > [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity] > feature, currently in beta, greatly expands the types of constraints you can > express. Aim here is to bring support of this feature to Spark on K8s, in > detail: > * Node affinity > * Toleration/taints > * Inter-Pod affinity/anti-affinity > Note that nodeSelector will be deprecated in the future. > Design doc: > [https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU|https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU/edit?usp=sharing] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24353) Add support for pod affinity/anti-affinity
[ https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973392#comment-16973392 ] Rafael Felix Correa commented on SPARK-24353: - Hi all, I gave this problem a shot in my PR [https://github.com/apache/spark/pull/26505], which implements tolerations it in a slightly different way than the proposed one due to the key field being optional (therefore it cannot be used as a unique identifier). I have a custom build with this patch already working internally here in OLX, please have a look at the PR if you have time. Hope it helps more people as well :) Cheers, > Add support for pod affinity/anti-affinity > -- > > Key: SPARK-24353 > URL: https://issues.apache.org/jira/browse/SPARK-24353 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos >Priority: Major > > Spark on K8s allows to place driver/executor pods on specific k8s nodes, > using nodeSelector. NodeSelector is a very simple way to constrain pods to > nodes with particular labels. The > [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity] > feature, currently in beta, greatly expands the types of constraints you can > express. Aim here is to bring support of this feature to Spark on K8s, in > detail: > * Node affinity > * Toleration/taints > * Inter-Pod affinity/anti-affinity > Note that nodeSelector will be deprecated in the future. > Design doc: > [https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU|https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU/edit?usp=sharing] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29808) StopWordsRemover should support multi-cols
[ https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29808. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26480 [https://github.com/apache/spark/pull/26480] > StopWordsRemover should support multi-cols > -- > > Key: SPARK-29808 > URL: https://issues.apache.org/jira/browse/SPARK-29808 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > As a basic Transformer, StopWordsRemover should support multi-cols. > Param {color:#93a6f5}stopWords{color} can be applied across all columns. > {color:#93a6f5} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()
[ https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29823. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26483 [https://github.com/apache/spark/pull/26483] > Improper persist strategy in mllib.clustering.KMeans.run() > -- > > Key: SPARK-29823 > URL: https://issues.apache.org/jira/browse/SPARK-29823 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.3 >Reporter: Dong Wang >Assignee: Aman Omer >Priority: Major > Fix For: 3.0.0 > > > In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is > persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., > the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not > persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply > on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd > {color:#de350b}_zippedData_{color} will be used by multiple times in > _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be > persisted, not _{color:#de350b}norms{color}_. > {code:scala} > private[spark] def run( > data: RDD[Vector], > instr: Option[Instrumentation]): KMeansModel = { > if (data.getStorageLevel == StorageLevel.NONE) { > logWarning("The input data is not directly cached, which may hurt > performance if its" > + " parent RDDs are also uncached.") > } > // Compute squared norms and cache them. > val norms = data.map(Vectors.norm(_, 2.0)) > norms.persist() // Unnecessary persist. Only used to generate zippedData. > val zippedData = data.zip(norms).map { case (v, norm) => > new VectorWithNorm(v, norm) > } // needs to persist > val model = runAlgorithm(zippedData, instr) > norms.unpersist() // Change to zippedData.unpersist() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()
[ https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29823: Assignee: Aman Omer > Improper persist strategy in mllib.clustering.KMeans.run() > -- > > Key: SPARK-29823 > URL: https://issues.apache.org/jira/browse/SPARK-29823 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.3 >Reporter: Dong Wang >Assignee: Aman Omer >Priority: Major > > In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is > persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., > the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not > persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply > on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd > {color:#de350b}_zippedData_{color} will be used by multiple times in > _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be > persisted, not _{color:#de350b}norms{color}_. > {code:scala} > private[spark] def run( > data: RDD[Vector], > instr: Option[Instrumentation]): KMeansModel = { > if (data.getStorageLevel == StorageLevel.NONE) { > logWarning("The input data is not directly cached, which may hurt > performance if its" > + " parent RDDs are also uncached.") > } > // Compute squared norms and cache them. > val norms = data.map(Vectors.norm(_, 2.0)) > norms.persist() // Unnecessary persist. Only used to generate zippedData. > val zippedData = data.zip(norms).map { case (v, norm) => > new VectorWithNorm(v, norm) > } // needs to persist > val model = runAlgorithm(zippedData, instr) > norms.unpersist() // Change to zippedData.unpersist() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()
[ https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29823: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) These aren't bugs. > Improper persist strategy in mllib.clustering.KMeans.run() > -- > > Key: SPARK-29823 > URL: https://issues.apache.org/jira/browse/SPARK-29823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.4.3 >Reporter: Dong Wang >Assignee: Aman Omer >Priority: Minor > Fix For: 3.0.0 > > > In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is > persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., > the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not > persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply > on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd > {color:#de350b}_zippedData_{color} will be used by multiple times in > _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be > persisted, not _{color:#de350b}norms{color}_. > {code:scala} > private[spark] def run( > data: RDD[Vector], > instr: Option[Instrumentation]): KMeansModel = { > if (data.getStorageLevel == StorageLevel.NONE) { > logWarning("The input data is not directly cached, which may hurt > performance if its" > + " parent RDDs are also uncached.") > } > // Compute squared norms and cache them. > val norms = data.map(Vectors.norm(_, 2.0)) > norms.persist() // Unnecessary persist. Only used to generate zippedData. > val zippedData = data.zip(norms).map { case (v, norm) => > new VectorWithNorm(v, norm) > } // needs to persist > val model = runAlgorithm(zippedData, instr) > norms.unpersist() // Change to zippedData.unpersist() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28168) release-build.sh for Hadoop-3.2
[ https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-28168. - Resolution: Duplicate The issue fixed by SPARK-29608. > release-build.sh for Hadoop-3.2 > --- > > Key: SPARK-28168 > URL: https://issues.apache.org/jira/browse/SPARK-28168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hadoop 3.2 support was added. We might have to add Hadoop-3.2 in > release-build.sh. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973372#comment-16973372 ] Yuming Wang commented on SPARK-23151: - [~dongjoon] Is this issue fixed by SPARK-29608? > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29881) Introduce API for manually breaking up dataset plan
[ https://issues.apache.org/jira/browse/SPARK-29881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devyn Cairns updated SPARK-29881: - Description: I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through {{transform}}. Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example: {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} {{FROM (}} {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} {{)}} the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately. I've found a workaround for now is to insert something like {{.filter \{ _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once. But it would be great to have an API on Dataset like {{.barrier()}} to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations. was: I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through {{transform}}. Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example: {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} {{FROM (}} {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} {{)}} the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately. I've found a workaround for now is to insert something like {{.filter \{ _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once. But it would be great to have an API on Dataset like {{.barrier()}} to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations. > Introduce API for manually breaking up dataset plan > --- > > Key: SPARK-29881 > URL: https://issues.apache.org/jira/browse/SPARK-29881 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.4 >Reporter: Devyn Cairns >Priority: Trivial > > I have an interesting situation where I'm calling functions that are > relatively expensive from Spark SQL, and then using the result several times > in a loop through {{transform}}. > Although the WholeStageCodegen is usually helpful, it always calls > expressions as they're used, which means that in the case of, for example: > {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} > {{FROM (}} > {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} > {{)}} > the expensive_operation function will almost certainly be called 32 times for > each source row, without any explicit way to cache that value intermediately. > I've found a workaround for now is to insert something like {{.filter \{ _ => > true }}} in the middle, which will create a barrier to whole-stage codegen > without much negative impact, aside from preventing other optimizations like > PushDown. This does indeed produce the intended result and > expensive_operation is only run once. > But it would be great to have an API on Dataset like {{.barrier()}} to > introduce an explicit barrier to whole-stage codegen without adding any > additional behavior or getting in the way of any PushDown optimizations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29881) Introduce API for manually breaking up dataset plan
Devyn Cairns created SPARK-29881: Summary: Introduce API for manually breaking up dataset plan Key: SPARK-29881 URL: https://issues.apache.org/jira/browse/SPARK-29881 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.4.4 Reporter: Devyn Cairns I have an interesting situation where I'm calling functions that are relatively expensive from Spark SQL, and then using the result several times in a loop through {{transform}}. Although the WholeStageCodegen is usually helpful, it always calls expressions as they're used, which means that in the case of, for example: {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}} {{FROM (}} {{ SELECT expensive_operation(foo) AS expensive_result FROM source}} {{)}} the expensive_operation function will almost certainly be called 32 times for each source row, without any explicit way to cache that value intermediately. I've found a workaround for now is to insert something like {{.filter \{ _ => true }}} in the middle, which will create a barrier to whole-stage codegen without much negative impact, aside from preventing other optimizations like PushDown. This does indeed produce the intended result and expensive_operation is only run once. But it would be great to have an API on Dataset like {{.barrier()}} to introduce an explicit barrier to whole-stage codegen without adding any additional behavior or getting in the way of any PushDown optimizations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29863) rename EveryAgg/AnyAgg to BoolAnd/BoolOr
[ https://issues.apache.org/jira/browse/SPARK-29863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29863. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26486 [https://github.com/apache/spark/pull/26486] > rename EveryAgg/AnyAgg to BoolAnd/BoolOr > > > Key: SPARK-29863 > URL: https://issues.apache.org/jira/browse/SPARK-29863 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29753) refine the default catalog config
[ https://issues.apache.org/jira/browse/SPARK-29753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29753. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26395 [https://github.com/apache/spark/pull/26395] > refine the default catalog config > - > > Key: SPARK-29753 > URL: https://issues.apache.org/jira/browse/SPARK-29753 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29877) static PageRank allow checkPoint from previous computations
[ https://issues.apache.org/jira/browse/SPARK-29877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joan Fontanals updated SPARK-29877: --- Priority: Major (was: Minor) > static PageRank allow checkPoint from previous computations > --- > > Key: SPARK-29877 > URL: https://issues.apache.org/jira/browse/SPARK-29877 > Project: Spark > Issue Type: Wish > Components: GraphX >Affects Versions: 2.3.0 >Reporter: Joan Fontanals >Priority: Major > Labels: graphx, pagerank > Fix For: 2.3.0 > > > It would be really helpful to have the possibility, when computing > staticPageRank to use a previous computation as a checkpoint to continue the > iterations. > I have done a small code proposal, but there is a problem because at the end > of the iterations a normalization step is performed. > Therefore, if we use this normalized graph as a checkpoint for a new set of > iterations, the algorithm will not restart in a coherent way. > I created a branch in my fork: > [https://github.com/JoanFM/spark/tree/pageRank_checkPoint] > I hope you can consider this feature, and if you do that you can implement > yourself or propose me a way to handle this identified problem > Thank you very much, > Best regards, > Joan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29880) Handle submit exception when target hadoop cluster is Federation
zhoukang created SPARK-29880: Summary: Handle submit exception when target hadoop cluster is Federation Key: SPARK-29880 URL: https://issues.apache.org/jira/browse/SPARK-29880 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.4 Reporter: zhoukang When we submit application to federation yarn cluster. Since getYarnClusterMetrics is not implemented. The submission will exit with failure. {code:java} def submitApplication(): ApplicationId = { ResourceRequestHelper.validateResources(sparkConf) var appId: ApplicationId = null try { launcherBackend.connect() yarnClient.init(hadoopConf) yarnClient.start() logInfo("Requesting a new application from cluster with %d NodeManagers" .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29879) Support unbounded waiting for RPC reply
wenxuanguan created SPARK-29879: --- Summary: Support unbounded waiting for RPC reply Key: SPARK-29879 URL: https://issues.apache.org/jira/browse/SPARK-29879 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: wenxuanguan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE
[ https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29835. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26464 [https://github.com/apache/spark/pull/26464] > Remove the unnecessary conversion from Statement to LogicalPlan for > DELETE/UPDATE > - > > Key: SPARK-29835 > URL: https://issues.apache.org/jira/browse/SPARK-29835 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > Fix For: 3.0.0 > > > The current parse and analyze flow for DELETE is: 1, the SQL string will be > firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be > converted to `DeleteFromTable`. However, the SQL string can be parsed to > `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be > redundant. > It is the same for UPDATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE
[ https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29835: --- Assignee: Xianyin Xin > Remove the unnecessary conversion from Statement to LogicalPlan for > DELETE/UPDATE > - > > Key: SPARK-29835 > URL: https://issues.apache.org/jira/browse/SPARK-29835 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > > The current parse and analyze flow for DELETE is: 1, the SQL string will be > firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be > converted to `DeleteFromTable`. However, the SQL string can be parsed to > `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be > redundant. > It is the same for UPDATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29866) Upper case enum values
[ https://issues.apache.org/jira/browse/SPARK-29866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29866. -- Resolution: Won't Fix > Upper case enum values > -- > > Key: SPARK-29866 > URL: https://issues.apache.org/jira/browse/SPARK-29866 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Major > > Unify naming of enum values and upper case their names. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29878) Improper cache strategies in GraphX
Dong Wang created SPARK-29878: - Summary: Improper cache strategies in GraphX Key: SPARK-29878 URL: https://issues.apache.org/jira/browse/SPARK-29878 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 3.0.0 Reporter: Dong Wang I have run examples.graphx.SSPExample and looked through the RDD dependency graphs as well as persist operations. There are some improper cache strategies in GraphX. The same situations also exist when I run ConnectedComponentsExample. 1. vertices.cache() and newEdges.cache() are unnecessary In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), and RDD vertices/newEdges are cached in apply(). But these two RDDs are not directly used anymore (their children RDDs has been cached) in SSPExample, so the persists can be unnecessary here. However, the other examples may need these two persists, so I think they cannot be simply removed. It might be hard to fix. {code:scala} def apply[VD: ClassTag, ED: ClassTag]( vertices: VertexRDD[VD], edges: EdgeRDD[ED]): GraphImpl[VD, ED] = { vertices.cache() // It is unnecessary for SSPExample and ConnectedComponentsExample // Convert the vertex partitions in edges to the correct type val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]] .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD]) .cache() // It is unnecessary for SSPExample and ConnectedComponentsExample GraphImpl.fromExistingRDDs(vertices, newEdges) } {code} 2. Missing persist on newEdges SSSPExample will invoke pregel to do execution. Pregel will ultilize ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use by multiple actions in Pregel. So newEdges should be persisted. Same as the above issue, this issue is also found in ConnectedComponentsExample. It is also hard to fix, because the persist added may be unnecessary for other examples. {code:scala} // Pregel.scala // compute the messages var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // newEdges is created here val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( checkpointInterval, graph.vertices.sparkContext) messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() // The first time use newEdges ... while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog) // Generate g will depends on newEdges ... activeMessages = messages.count() // The second action to use newEdges. newEdges should be unpersisted after this instruction. {code} {code:scala} // ReplicatedVertexView.scala def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: Boolean): Unit = { ... val newEdges = edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) { (ePartIter, shippedVertsIter) => ePartIter.map { case (pid, edgePartition) => (pid, edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator))) } }) edges = newEdges // newEdges should be persisted hasSrcId = includeSrc hasDstId = includeDst } } {code} As I don't have much knowledge about Graphx, so I don't know how to fix these issues well. This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading
[ https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29851: --- Assignee: Terry Kim > V2 Catalog: Default behavior of dropping namespace is cascading > --- > > Key: SPARK-29851 > URL: https://issues.apache.org/jira/browse/SPARK-29851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Instead of introducing additional 'cascade' option to dropNamespace(), the > default behavior of dropping a namespace will be cascading. Now, to implement > the cascade option, Spark side needs to ensure a namespace is empty before > calling dropNamespace(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading
[ https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29851. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26476 [https://github.com/apache/spark/pull/26476] > V2 Catalog: Default behavior of dropping namespace is cascading > --- > > Key: SPARK-29851 > URL: https://issues.apache.org/jira/browse/SPARK-29851 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Instead of introducing additional 'cascade' option to dropNamespace(), the > default behavior of dropping a namespace will be cascading. Now, to implement > the cascade option, Spark side needs to ensure a namespace is empty before > calling dropNamespace(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29877) static PageRank allow checkPoint from previous computations
Joan Fontanals created SPARK-29877: -- Summary: static PageRank allow checkPoint from previous computations Key: SPARK-29877 URL: https://issues.apache.org/jira/browse/SPARK-29877 Project: Spark Issue Type: Wish Components: GraphX Affects Versions: 2.3.0 Reporter: Joan Fontanals Fix For: 2.3.0 It would be really helpful to have the possibility, when computing staticPageRank to use a previous computation as a checkpoint to continue the iterations. I have done a small code proposal, but there is a problem because at the end of the iterations a normalization step is performed. Therefore, if we use this normalized graph as a checkpoint for a new set of iterations, the algorithm will not restart in a coherent way. I created a branch in my fork: [https://github.com/JoanFM/spark/tree/pageRank_checkPoint] I hope you can consider this feature, and if you do that you can implement yourself or propose me a way to handle this identified problem Thank you very much, Best regards, Joan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29876) Delete/archive file source completed files in separate thread
[ https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-29876: -- Summary: Delete/archive file source completed files in separate thread (was: Delete/archive file source data in separate thread) > Delete/archive file source completed files in separate thread > - > > Key: SPARK-29876 > URL: https://issues.apache.org/jira/browse/SPARK-29876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > SPARK-20568 added the possibility to clean up completed files in streaming > query. Deleting/archiving uses the main thread which can slow down > processing. It would be good to do this on separate thread(s). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29876) Delete/archive file source data in separate thread
[ https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973216#comment-16973216 ] Gabor Somogyi commented on SPARK-29876: --- I'm working on this. > Delete/archive file source data in separate thread > -- > > Key: SPARK-29876 > URL: https://issues.apache.org/jira/browse/SPARK-29876 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > SPARK-20568 added the possibility to clean up completed files in streaming > query. Deleting/archiving uses the main thread which can slow down > processing. It would be good to do this on separate thread(s). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29876) Delete/archive file source data in separate thread
Gabor Somogyi created SPARK-29876: - Summary: Delete/archive file source data in separate thread Key: SPARK-29876 URL: https://issues.apache.org/jira/browse/SPARK-29876 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi SPARK-20568 added the possibility to clean up completed files in streaming query. Deleting/archiving uses the main thread which can slow down processing. It would be good to do this on separate thread(s). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
Hyukjin Kwon created SPARK-29875: Summary: Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x Key: SPARK-29875 URL: https://issues.apache.org/jira/browse/SPARK-29875 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.4 Reporter: Hyukjin Kwon In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of warnings as below: {code} from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP def add_one(x): return x + 1 spark.range(100).select(add_one("id")).collect() {code} {code} UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly
[ https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-29710. --- Resolution: Incomplete Please attach logs or any further information and re-open it. I'm closing it for now. > Seeing offsets not resetting even when reset policy is configured explicitly > > > Key: SPARK-29710 > URL: https://issues.apache.org/jira/browse/SPARK-29710 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.1 > Environment: Window10 , eclipse neos >Reporter: Shyam >Priority: Major > > > even after setting *"auto.offset.reset" to "latest"* I am getting below > error > > org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of > range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > \{COMPANY_TRANSACTIONS_INBOUND-16=168} at > org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209) > at > org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234) > > [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29874) Optimize Dataset.isEmpty()
angerszhu created SPARK-29874: - Summary: Optimize Dataset.isEmpty() Key: SPARK-29874 URL: https://issues.apache.org/jira/browse/SPARK-29874 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite
Takeshi Yamamuro created SPARK-29873: Summary: Support `--import` directive to load queries from another test case in SQLQueryTestSuite Key: SPARK-29873 URL: https://issues.apache.org/jira/browse/SPARK-29873 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends to support `--import` directive to load queries from another test case in SQLQueryTestSuite. This fix comes from the @cloud-fan suggestion in https://github.com/apache/spark/pull/26479#discussion_r345086978 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org