date:20191113

[jira] [Commented] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-11-13 Thread sandeshyapuram (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973985#comment-16973985
 ] 

sandeshyapuram commented on SPARK-29682:


Thanks!

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96)
>   at 
>

[jira] [Updated] (SPARK-29888) New interval string parser parse '.111 seconds' to null

2019-11-13 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-29888:
-
Issue Type: Bug  (was: Improvement)

> New interval string parser parse '.111 seconds' to null 
> 
>
> Key: SPARK-29888
> URL: https://issues.apache.org/jira/browse/SPARK-29888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Current string to interval cast logic does not support i.e. cast('.111 
> second' as interval) which will fail in SIGN state and return null, actually, 
> it is 00:00:00.111. 
> {code:java}
> These are the results of the master branch.
> -- !query 63
> select interval '.111 seconds'
> -- !query 63 schema
> struct<0.111 seconds:interval>
> -- !query 63 output
> 0.111 seconds
> -- !query 64
> select cast('.111 seconds' as interval)
> -- !query 64 schema
> struct
> -- !query 64 output
> NULL
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29888) New interval string parser parse '.111 seconds' to null

2019-11-13 Thread Kent Yao (Jira)

Kent Yao created SPARK-29888:


 Summary: New interval string parser parse '.111 seconds' to null 
 Key: SPARK-29888
 URL: https://issues.apache.org/jira/browse/SPARK-29888
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Current string to interval cast logic does not support i.e. cast('.111 second' 
as interval) which will fail in SIGN state and return null, actually, it is 
00:00:00.111. 


{code:java}
These are the results of the master branch.

-- !query 63
select interval '.111 seconds'
-- !query 63 schema
struct<0.111 seconds:interval>
-- !query 63 output
0.111 seconds


-- !query 64
select cast('.111 seconds' as interval)
-- !query 64 schema
struct
-- !query 64 output
NULL
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29682) Failure when resolving conflicting references in Join:

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29682.
---
Fix Version/s: 3.0.0
   2.4.5
 Assignee: Terry Kim
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26441

> Failure when resolving conflicting references in Join:
> --
>
> Key: SPARK-29682
> URL: https://issues.apache.org/jira/browse/SPARK-29682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> When I try to self join a parentDf with multiple childDf say childDf1 ... ... 
> where childDfs are derived after a cube or rollup and are filtered based on 
> group bys,
> I get and error 
> {{Failure when resolving conflicting references in Join: }}
> This shows a long error message which is quite unreadable. On the other hand, 
> if I replace cube or rollup with old groupBy, it works without issues.
>  
> *Sample code:* 
> {code:java}
> val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums")
> val cubeDF = numsDF
> .cube("nums")
> .agg(
> max(lit(0)).as("agcol"),
> grouping_id().as("gid")
> )
> 
> val group0 = cubeDF.filter(col("gid") <=> lit(0))
> val group1 = cubeDF.filter(col("gid") <=> lit(1))
> cubeDF.printSchema
> group0.printSchema
> group1.printSchema
> //Recreating cubeDf
> cubeDF.select("nums").distinct
> .join(group0, Seq("nums"), "inner")
> .join(group1, Seq("nums"), "inner")
> .show
> {code}
> *Sample output:*
> {code:java}
> numsDF: org.apache.spark.sql.DataFrame = [nums: int]
> cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more 
> field]
> group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, 
> agcol: int ... 1 more field]
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> root
>  |-- nums: integer (nullable = true)
>  |-- agcol: integer (nullable = true)
>  |-- gid: integer (nullable = false)
> org.apache.spark.sql.AnalysisException:
> Failure when resolving conflicting references in Join:
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
> Conflicting attributes: nums#220
> ;;
> 'Join Inner
> :- Deduplicate [nums#220]
> :  +- Project [nums#220]
> : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
> :+- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
> :   +- Project [nums#212, nums#212 AS nums#219]
> :  +- Project [value#210 AS nums#212]
> : +- SerializeFromObject [input[0, int, false] AS value#210]
> :+- ExternalRDD [obj#209]
> +- Filter (gid#217 <=> 0)
>+- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS 
> agcol#216, spark_grouping_id#218 AS gid#217]
>   +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], 
> [nums#212, nums#220, spark_grouping_id#218]
>  +- Project [nums#212, nums#212 AS nums#219]
> +- Project [value#210 AS nums#212]
>+- SerializeFromObject [input[0, int, false] AS value#210]
>   +- ExternalRDD [obj#209]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>   at 
>

[jira] [Resolved] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29873.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26497
[https://github.com/apache/spark/pull/26497]

> Support `--import` directive to load queries from another test case in 
> SQLQueryTestSuite
> 
>
> Key: SPARK-29873
> URL: https://issues.apache.org/jira/browse/SPARK-29873
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends 
> to support `--import` directive to load queries from another test case in 
> SQLQueryTestSuite.
> This fix comes from the @cloud-fan suggestion in 
> https://github.com/apache/spark/pull/26479#discussion_r345086978



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29873:
---

Assignee: Takeshi Yamamuro

> Support `--import` directive to load queries from another test case in 
> SQLQueryTestSuite
> 
>
> Key: SPARK-29873
> URL: https://issues.apache.org/jira/browse/SPARK-29873
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends 
> to support `--import` directive to load queries from another test case in 
> SQLQueryTestSuite.
> This fix comes from the @cloud-fan suggestion in 
> https://github.com/apache/spark/pull/26479#discussion_r345086978



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29885) Improve the exception message when reading the daemon port

2019-11-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29885:

Fix Version/s: (was: 3.0.0)

> Improve the exception message when reading the daemon port
> --
>
> Key: SPARK-29885
> URL: https://issues.apache.org/jira/browse/SPARK-29885
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> In production environment, my pyspark application occurs an exception and 
> it's message as below:
> {code:java}
> 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at 
> org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}
>  
> At first, I think a physical node has many ports are occupied by a large 
> number of processes.
> But I found the total number of ports in use is only 671.
>  
> {code:java}
> [yarn@r1115 ~]$ netstat -a | wc -l
> 671
> {code}
> I  checked the code of PythonWorkerFactory in line 204 and found:
> {code:java}
> daemon = pb.start()
> val in = new DataInputStream(daemon.getInputStream)
>  try {
>  daemonPort = in.readInt()
>  } catch {
>  case _: EOFException =>
>  throw new SparkException(s"No port number in $daemonModule's stdout")
>  }
> {code}
> I added some code here:
> {code:java}
> logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
> logError("Exit value: ${daemon.exitValue()}")
> {code}
> Then I recurrent the exception and it's message as below:
> {code:java}
> 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
> alive: false
> 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
> 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at 
> org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
>

[jira] [Commented] (SPARK-29887) PostgreSQL dialect: cast to smallint

2019-11-13 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973942#comment-16973942
 ] 

jobit mathew commented on SPARK-29887:
--

I will work on this 

> PostgreSQL dialect: cast to smallint
> 
>
> Key: SPARK-29887
> URL: https://issues.apache.org/jira/browse/SPARK-29887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Make SparkSQL's cast to smallint behavior be consistent with PostgreSQL when
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29887) PostgreSQL dialect: cast to smallint

2019-11-13 Thread jobit mathew (Jira)

jobit mathew created SPARK-29887:


 Summary: PostgreSQL dialect: cast to smallint
 Key: SPARK-29887
 URL: https://issues.apache.org/jira/browse/SPARK-29887
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jobit mathew


Make SparkSQL's cast to smallint behavior be consistent with PostgreSQL when

spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29886) Add support for Spark style HashDistribution and Partitioning to V2 Datasource

2019-11-13 Thread Andrew K Long (Jira)

Andrew K Long created SPARK-29886:
-

 Summary: Add support for Spark style HashDistribution and 
Partitioning to V2 Datasource
 Key: SPARK-29886
 URL: https://issues.apache.org/jira/browse/SPARK-29886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Andrew K Long


Currently a v2 datasource does not have the ability specify that its 
Distribution iscompatible with sparks HashClusteredDistribution.  We need to 
add the appropriate class in the interface and add support in 
DataSourcePartitioning so that EnsureRequirements is aware of the tables 
partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29837) PostgreSQL dialect: cast to boolean

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29837:
---

Assignee: wuyi

> PostgreSQL dialect: cast to boolean
> ---
>
> Key: SPARK-29837
> URL: https://issues.apache.org/jira/browse/SPARK-29837
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29837) PostgreSQL dialect: cast to boolean

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29837.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26463
[https://github.com/apache/spark/pull/26463]

> PostgreSQL dialect: cast to boolean
> ---
>
> Key: SPARK-29837
> URL: https://issues.apache.org/jira/browse/SPARK-29837
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Make SparkSQL's *cast to boolean* behavior be consistent with PostgreSQL when 
> spark.sql.dialect is configured as PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.

2019-11-13 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29619:
---
Description: This ticket is related to 
https://issues.apache.org/jira/browse/SPARK-29885 and add try mechanism.

> Add retry times when reading the daemon port.
> -
>
> Key: SPARK-29619
> URL: https://issues.apache.org/jira/browse/SPARK-29619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> This ticket is related to https://issues.apache.org/jira/browse/SPARK-29885 
> and add try mechanism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.

2019-11-13 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29619:
---
Summary: Add retry times when reading the daemon port.  (was: Improve the 
exception message when reading the daemon port and add retry times.)

> Add retry times when reading the daemon port.
> -
>
> Key: SPARK-29619
> URL: https://issues.apache.org/jira/browse/SPARK-29619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> In production environment, my pyspark application occurs an exception and 
> it's message as below:
> {code:java}
> 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at 
> org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745){code}
>  
> At first, I think a physical node has many ports are occupied by a large 
> number of processes.
> But I found the total number of ports in use is only 671.
>  
> {code:java}
> [yarn@r1115 ~]$ netstat -a | wc -l
> 671
> {code}
> I  checked the code of PythonWorkerFactory in line 204 and found:
> {code:java}
> daemon = pb.start()
> val in = new DataInputStream(daemon.getInputStream)
>  try {
>  daemonPort = in.readInt()
>  } catch {
>  case _: EOFException =>
>  throw new SparkException(s"No port number in $daemonModule's stdout")
>  }
> {code}
> I added some code here:
> {code:java}
> logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
> logError("Exit value: ${daemon.exitValue()}")
> {code}
> Then I recurrent the exception and it's message as below:
> {code:java}
> 19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
> alive: false
> 19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
> 19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>

[jira] [Updated] (SPARK-29619) Add retry times when reading the daemon port.

2019-11-13 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29619:
---
Description: (was: In production environment, my pyspark application 
occurs an exception and it's message as below:
{code:java}
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
 

At first, I think a physical node has many ports are occupied by a large number 
of processes.

But I found the total number of ports in use is only 671.

 
{code:java}
[yarn@r1115 ~]$ netstat -a | wc -l
671
{code}
I  checked the code of PythonWorkerFactory in line 204 and found:
{code:java}
daemon = pb.start()
val in = new DataInputStream(daemon.getInputStream)
 try {
 daemonPort = in.readInt()
 } catch {
 case _: EOFException =>
 throw new SparkException(s"No port number in $daemonModule's stdout")
 }
{code}
I added some code here:
{code:java}
logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
logError("Exit value: ${daemon.exitValue()}")
{code}
Then I recurrent the exception and it's message as below:
{code:java}
19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
alive: false
19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at

[jira] [Updated] (SPARK-29885) Improve the exception message when reading the daemon port

2019-11-13 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-29885:
---
Description: 
In production environment, my pyspark application occurs an exception and it's 
message as below:
{code:java}
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745){code}
 

At first, I think a physical node has many ports are occupied by a large number 
of processes.

But I found the total number of ports in use is only 671.

 
{code:java}
[yarn@r1115 ~]$ netstat -a | wc -l
671
{code}
I  checked the code of PythonWorkerFactory in line 204 and found:
{code:java}
daemon = pb.start()
val in = new DataInputStream(daemon.getInputStream)
 try {
 daemonPort = in.readInt()
 } catch {
 case _: EOFException =>
 throw new SparkException(s"No port number in $daemonModule's stdout")
 }
{code}
I added some code here:
{code:java}
logError("Meet EOFException, daemon is alive: ${daemon.isAlive()}")
logError("Exit value: ${daemon.exitValue()}")
{code}
Then I recurrent the exception and it's message as below:
{code:java}
19/10/28 16:15:03 ERROR PythonWorkerFactory: Meet EOFException, daemon is 
alive: false
19/10/28 16:15:03 ERROR PythonWorkerFactory: Exit value: 139
19/10/28 16:15:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
 at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:206)
 at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
 at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
 at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
 at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
 at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
 at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
 at

[jira] [Created] (SPARK-29885) Improve the exception message when reading the daemon port

2019-11-13 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-29885:
--

 Summary: Improve the exception message when reading the daemon port
 Key: SPARK-29885
 URL: https://issues.apache.org/jira/browse/SPARK-29885
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: jiaan.geng
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29649) Stop task set if FileAlreadyExistsException was thrown when writing to output file

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29649.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26312
[https://github.com/apache/spark/pull/26312]

> Stop task set if FileAlreadyExistsException was thrown when writing to output 
> file
> --
>
> Key: SPARK-29649
> URL: https://issues.apache.org/jira/browse/SPARK-29649
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We already know task attempts that do not clean up output files in staging 
> directory can cause job failure (SPARK-27194). There was proposals trying to 
> fix it by changing output filename, or deleting existing output files. These 
> proposals are not reliable completely.
> The difficulty is, as previous failed task attempt wrote the output file, at 
> next task attempt the output file is still under same staging directory, even 
> the output file name is different.
> If the job will go to fail eventually, there is no point to re-run the task 
> until max attempts are reached. For the jobs running a lot of time, 
> re-running the task can waste a lot of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29644.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26301
[https://github.com/apache/spark/pull/26301]

> ShortType is wrongly set as Int in JDBCUtils.scala
> --
>
> Key: SPARK-29644
> URL: https://issues.apache.org/jira/browse/SPARK-29644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shiv Prashant Sood
>Assignee: Shiv Prashant Sood
>Priority: Minor
> Fix For: 3.0.0
>
>
> @maropu pointed out this issue during  [PR 
> 25344|https://github.com/apache/spark/pull/25344]  review discussion.
>  In 
> [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
>  line number 547
> case ShortType =>
>  (stmt: PreparedStatement, row: Row, pos: Int) =>
>  stmt.setInt(pos + 1, row.getShort(pos))
> I dont see any reproducible issue, but this is clearly a problem that must be 
> fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29644) Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29644:
--
Summary: Corrected ShortType and ByteType mapping to SmallInt and TinyInt 
in JDBCUtils  (was: ShortType is wrongly set as Int in JDBCUtils.scala)

> Corrected ShortType and ByteType mapping to SmallInt and TinyInt in JDBCUtils
> -
>
> Key: SPARK-29644
> URL: https://issues.apache.org/jira/browse/SPARK-29644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shiv Prashant Sood
>Assignee: Shiv Prashant Sood
>Priority: Minor
> Fix For: 3.0.0
>
>
> @maropu pointed out this issue during  [PR 
> 25344|https://github.com/apache/spark/pull/25344]  review discussion.
>  In 
> [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
>  line number 547
> case ShortType =>
>  (stmt: PreparedStatement, row: Row, pos: Int) =>
>  stmt.setInt(pos + 1, row.getShort(pos))
> I dont see any reproducible issue, but this is clearly a problem that must be 
> fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29644) ShortType is wrongly set as Int in JDBCUtils.scala

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29644:
-

Assignee: Shiv Prashant Sood

> ShortType is wrongly set as Int in JDBCUtils.scala
> --
>
> Key: SPARK-29644
> URL: https://issues.apache.org/jira/browse/SPARK-29644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Shiv Prashant Sood
>Assignee: Shiv Prashant Sood
>Priority: Minor
>
> @maropu pointed out this issue during  [PR 
> 25344|https://github.com/apache/spark/pull/25344]  review discussion.
>  In 
> [JDBCUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala]
>  line number 547
> case ShortType =>
>  (stmt: PreparedStatement, row: Row, pos: Int) =>
>  stmt.setInt(pos + 1, row.getShort(pos))
> I dont see any reproducible issue, but this is clearly a problem that must be 
> fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29287) Executors should not receive any offers before they are actually constructed

2019-11-13 Thread Xingbo Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-29287.
--
Fix Version/s: 3.0.0
 Assignee: Kent Yao
   Resolution: Done

Fixed by https://github.com/apache/spark/pull/25964

> Executors should not receive any offers before they are actually constructed
> -
>
> Key: SPARK-29287
> URL: https://issues.apache.org/jira/browse/SPARK-29287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> The executors send RegisterExecutor messages to the driver when onStart.
> The driver put the executor data in “the ready to serve map” if it could be, 
> then send RegisteredExecutor back to the executor.  The driver now can make 
> an offer to this executor.
> But the executor is not fully constructed yet. When it received 
> RegisteredExecutor, it start to construct itself, initializing block manager, 
> maybe register to the local shuffle server in the way of retrying, then start 
> the heart beating to driver ... 
> The task allocated here may fail if the executor fails to start or cannot get 
> heart beating to the driver in time.
> Sometimes, even worse, when dynamic allocation and blacklisting is enabled 
> and when the runtime executor number down to min executor setting, and those 
> executors receive tasks before fully constructed and if any error happens, 
> the application may be blocked or tear down. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29778.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26474
[https://github.com/apache/spark/pull/26474]

> saveAsTable append mode is not passing writer options
> -
>
> Key: SPARK-29778
> URL: https://issues.apache.org/jira/browse/SPARK-29778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Wesley Hoffman
>Priority: Critical
> Fix For: 3.0.0
>
>
> There was an oversight where AppendData is not getting the WriterOptions in 
> saveAsTable. 
> [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29778) saveAsTable append mode is not passing writer options

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29778:
-

Assignee: Wesley Hoffman

> saveAsTable append mode is not passing writer options
> -
>
> Key: SPARK-29778
> URL: https://issues.apache.org/jira/browse/SPARK-29778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Wesley Hoffman
>Priority: Critical
>
> There was an oversight where AppendData is not getting the WriterOptions in 
> saveAsTable. 
> [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24203) Make executor's bindAddress configurable

2019-11-13 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24203.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26331
[https://github.com/apache/spark/pull/26331]

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Nishchal Venkataramana
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-11-13 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973715#comment-16973715
 ] 

koert kuipers commented on SPARK-28945:
---

i understand there is a great deal of complexity in the committer and this 
might require more work to get it right

but its still unclear to me if the committer is doing anything at all in case 
of dynamic partition overwrite.
what do i lose by disabling all committer activity (committer.setupJob, 
committer.commitJob, etc.) when dynamicPartitionOverwrite is true? and if i 
lose nothing, is that a good thing, or does that mean i should be worried about 
the current state?

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-13 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973695#comment-16973695
 ] 

Maciej Szymkiewicz edited comment on SPARK-29748 at 11/13/19 9:13 PM:
--

While this is a step in the right direction I think it justifies a broader 
discussion about {{Row}} purpose, API, and behavior guarantees. Especially if 
we're going to introduce diverging implementations, with {{Row}} and 
{{LegacyRow}}.

Over the years Spark code have accumulated a lot of conflicting behaviors and 
special cases related to {{Row}}:
 * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not.
 * Sometimes there are treated as ordered products ({{tuples}}), sometimes as 
unordered dictionaries.
 * We provide efficient access only by position, but the primary access method 
is by name.
 * etc.

Some of the unusual properties, are well documented (but still confusing), 
other are not. For example objects that are indistinguishable using public API

 
{code:python}
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

a = Row(x1=1, x2="foo")
b = Row("x1", "x2")(1, "foo")

a == b
# True 
type(a) == type(b)
#True

list(a.__fields__) == list(b.__fields__)  # Not really public, but just to make 
a point
{code}
cannot be substituted in practice.
{code:python}
schema = StructType([
StructField("x2", StringType()),
StructField("x1", IntegerType())])

spark.createDataFrame([a], schema)  


# DataFrame[x2: string, x1: int]

spark.createDataFrame([b], schema) 
# TypeError Traceback (most recent call last)
# ...
# TypeError: field x1: IntegerType can not accept object 'foo' in type 
{code}
To make things even worse the primary (while I don't have hard data here, but 
it is common both in the internal API as well as the code I've seen in the 
wild) access method - by name - is _O(M)_  where is the width of schema.

So if we're going to modify the core behavior (sorting) it makes sense to 
rethink the whole design.

Since the schema is carried around with each object and pass over the wire we 
might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something 
around these lines:
{code:python}
import sys
from collections import OrderedDict

class Row:
slots = ["_store"]

def __init__(self, *args, **kwargs):
if args and kwargs:
raise ValueError("Can not use both args "
 "and kwargs to create Row")
if args:
self._store = OrderedDict.fromkeys(args)
else:
self._store = OrderedDict(kwargs)

def __getattr__(self, x):
return self._store[x]

def __getitem__(self, x):
if isinstance(x, int):
return list(self._store.values())[x]
else:
return self._store[x]

def __iter__(self):
return iter(self._store.values())

def __repr__(self):
return "Row({})".format(", ".join(
"{}={}".format(k, v) for k, v in self._store.items()
))

def __len__(self):
return len(self._store)

def __call__(self, *args):
if len(args) > len(self):
raise ValueError("Can not create Row with fields %s, expected %d 
values "
 "but got %s" % (self, len(self), args))

self._store.update(zip(self._store.keys(), args))
return self

def __eq__(self, other):
return isinstance(other, Row) and self._store == other._store

@property
def _fields(self):
return self._store.keys()

@staticmethod
def _conv(obj):
if isinstance(obj, Row):
return obj.asDict(True)
elif isinstance(obj, list):
return [conv(o) for o in obj]
elif isinstance(obj, dict):
return dict((k, conv(v)) for k, v in obj.items())
else:
return obj

def asDict(self, recursive=False):
if recursive:
result = OrderedDict.fromkeys(self._fields)
for key in self._fields:
result[key] = Row._conv(self._store[key])
return result
else:
return self._store

@classmethod
def  from_dict(cls, d):
if sys.version_info >= (3, 6):
if not(isinstance(d, dict)):
raise ValueError(
"from_dict requires dict but got {}".format(
type(d)))

else:
if not(isinstance(d, OrderedDict)):
raise ValueError(
"from_dict requires collections.OrderedDict {}".format(
type(d)))
return cls(**d)
{code}
If we're committed to {{Row}} being a {{tuple}} (with _O(1)_ by index

[jira] [Comment Edited] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-13 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973695#comment-16973695
 ] 

Maciej Szymkiewicz edited comment on SPARK-29748 at 11/13/19 9:11 PM:
--

While this is a step in the right direction I think it justifies a broader 
discussion about {{Row}} purpose, API, and behavior guarantees. Especially if 
we're going to introduce diverging implementations, with {{Row}} and 
{{LegacyRow}}.

Over the years Spark code have accumulated a lot of conflicting behaviors and 
special cases related to {{Row}}:
 * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not.
 * Sometimes there are treated as ordered products ({{tuples}}), sometimes as 
unordered dictionaries.
 * We provide efficient access only by position, but the primary access method 
is by name.
 * etc.

Some of the unusual properties, are well documented (but still confusing), 
other are not. For example objects that are indistinguishable using public API

 
{code:python}
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

a = Row(x1=1, x2="foo")
b = Row("x1", "x2")(1, "foo")

a == b
# True 
type(a) == type(b)
#True

list(a.__fields__) == list(b.__fields__)  # Not really public, but just to make 
a point
{code}
cannot be substituted in practice.
{code:python}
schema = StructType([
StructField("x2", StringType()),
StructField("x1", IntegerType())])

spark.createDataFrame([a], schema)  


# DataFrame[x2: string, x1: int]

spark.createDataFrame([b], schema) 
# TypeError Traceback (most recent call last)
# ...
# TypeError: field x1: IntegerType can not accept object 'foo' in type 
{code}
To make things even worse the primary (while I don't have hard data here, but 
it is common both in the internal API as well as the code I've seen in the 
wild) access method - by name - is _O(M)_  where is the width of schema.

So if we're going to modify the core behavior (sorting) it makes sense to 
rethink the whole design.

Since the schema is carried around with each object and pass over the wire we 
might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something 
around these lines:
{code:python}
import sys
from collections import OrderedDict

class Row:
slots = ["_store"]

def __init__(self, *args, **kwargs):
if args and kwargs:
raise ValueError("Can not use both args "
 "and kwargs to create Row")
if args:
self._store = OrderedDict.fromkeys(args)
else:
self._store = OrderedDict(kwargs)

def __getattr__(self, x):
return self._store[x]

def __getitem__(self, x):
if isinstance(x, int):
return list(self._store.values())[x]
else:
return self._store[x]

def __iter__(self):
return iter(self._store.values())

def __repr__(self):
return "Row({})".format(", ".join(
"{}={}".format(k, v) for k, v in self._store.items()
))

def __len__(self):
return len(self._store)

def __call__(self, *args):
if len(args) > len(self):
raise ValueError("Can not create Row with fields %s, expected %d 
values "
 "but got %s" % (self, len(self), args))

self._store.update(zip(self._store.keys(), args))
return self

def __eq__(self, other):
return isinstance(other, Row) and self._store == other._store

@property
def _fields(self):
return self._store.keys()

@staticmethod
def _conv(obj):
if isinstance(obj, Row):
return obj.asDict(True)
elif isinstance(obj, list):
return [conv(o) for o in obj]
elif isinstance(obj, dict):
return dict((k, conv(v)) for k, v in obj.items())
else:
return obj

def asDict(self, recursive=False):
if recursive:
result = OrderedDict.fromkeys(self._fields)
for key in self._fields:
result[key] = Row._conv(self._store[key])
return result
else:
return self._store

@classmethod
def  from_dict(cls, d):
if sys.version_info >= (3, 6):
if not(isinstance(d, dict)):
raise ValueError(
"from_dict requires dict but got {}".format(
type(d)))

else:
if not(isinstance(d, OrderedDict)):
raise ValueError(
"from_dict requires collections.OrderedDict {}".format(
type(d)))
return cls(**d)
{code}
If we're committed to {{Row}} being a {{tuple}} (with _O(1)_ by index

[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-11-13 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973695#comment-16973695
 ] 

Maciej Szymkiewicz commented on SPARK-29748:


While this is a step in the right direction I think it justifies a broader 
discussion about {{Row}} purpose, API, and behavior guarantees. Especially if 
we're going to introduce diverging implementations, with {{Row}} and 
{{LegacyRow}}.

Over the years Spark code have accumulated a lot of conflicting behaviors and 
special cases related to {{Row}}:
 * Sometimes {{Row}} are reordered (subject of this JIRA), sometimes are not.
 * Sometimes there are treated as ordered products ({{tuples}}), sometimes as 
unordered dictionaries.
 * We provide efficient access only by position, but the primary access method 
is by name.
 * etc.

Some of the unusual properties, are well documented (but still confusing), 
other are not. For example objects that are indistinguishable using public API

 
{code:python}
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

a = Row(x1=1, x2="foo")
b = Row("x1", "x2")(1, "foo")

a == b
# True 
type(a) == type(b)
#True

list(a.__fields__) == list(b.__fields__)  # Not really public, but just to make 
a point
{code}
cannot be substituted in practice.
{code:python}
schema = StructType([
StructField("x2", StringType()),
StructField("x1", IntegerType())])

spark.createDataFrame([a], schema)  


# DataFrame[x2: string, x1: int]

spark.createDataFrame([b], schema) 
# TypeError Traceback (most recent call last)
# ...
# TypeError: field x1: IntegerType can not accept object 'foo' in type 
{code}
To make things even worse the primary (while I don't have hard data here, but 
it is common both in the internal API as well as the code I've seen in the 
wild) access method - by name - is _O(M)_  where is the width of schema.

So if we're going to modify the core behavior (sorting) it makes sense to 
rethink the whole design.

Since the schema is carried around with each object and pass over the wire we 
might as well convert {{Row}} into a proxy of {{OrderedDict}} getting something 
around these lines:
{code:python}
import sys
from collections import OrderedDict

class Row:
slots = ["_store"]

def __init__(self, *args, **kwargs):
if args and kwargs:
raise ValueError("Can not use both args "
 "and kwargs to create Row")
if args:
self._store = OrderedDict.fromkeys(args)
else:
self._store = OrderedDict(kwargs)

def __getattr__(self, x):
return self._store[x]

def __getitem__(self, x):
if isinstance(x, int):
return list(self._store.values())[x]
else:
return self._store[x]

def __iter__(self):
return iter(self._store.values())

def __repr__(self):
return "Row({})".format(", ".join(
"{}={}".format(k, v) for k, v in self._store.items()
))

def __len__(self):
return len(self._store)

def __call__(self, *args):
if len(args) > len(self):
raise ValueError("Can not create Row with fields %s, expected %d 
values "
 "but got %s" % (self, len(self), args))

self._store.update(zip(self._store.keys(), args))
return self

def __eq__(self, other):
return isinstance(other, Row) and self._store == other._store

@property
def _fields(self):
return self._store.keys()

@staticmethod
def _conv(obj):
if isinstance(obj, Row):
return obj.asDict(True)
elif isinstance(obj, list):
return [conv(o) for o in obj]
elif isinstance(obj, dict):
return dict((k, conv(v)) for k, v in obj.items())
else:
return obj

def asDict(self, recursive=False):
if recursive:
result = OrderedDict.fromkeys(self._fields)
for key in self._fields:
result[key] = Row._conv(self._store[key])
return result
else:
return self._store

@classmethod
def  from_dict(cls, d):
if sys.version_info >= (3, 6):
if not(isinstance(d, dict)):
raise ValueError(
"from_dict requires dict but got {}".format(
type(d)))

else:
if not(isinstance(d, OrderedDict)):
raise ValueError(
"from_dict requires collections.OrderedDict {}".format(
type(d)))
return cls(**d)
{code}
If we're committed to {{Row}} being a {{tuple}} (with _O(1)_ by index access) 
we could actually try to hack {{namedtuple}}:

[jira] [Updated] (SPARK-29884) spark-submit to kuberentes can not parse valid ca certificate

2019-11-13 Thread Jeremy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy updated SPARK-29884:
---
Summary: spark-submit to kuberentes can not parse valid ca certificate  
(was: spark-Submit to kuberentes can not parse valid ca certificate)

> spark-submit to kuberentes can not parse valid ca certificate
> -
>
> Key: SPARK-29884
> URL: https://issues.apache.org/jira/browse/SPARK-29884
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
> Environment: A kuberentes cluster that has been in use for over 2 
> years and handles large amounts of production payloads.
>Reporter: Jeremy
>Priority: Major
>
> spark submit can not be used to to schedule to kuberentes with oauth token 
> and cacert
> {code:java}
> spark-submit \
> --deploy-mode cluster \
> --class org.apache.spark.examples.SparkPi \
> --master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
> --conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf 
> spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
>  \
> --conf spark.kubernetes.namespace=here-olp-3dds-sit \
> --conf spark.executor.instances=1 \
> --conf spark.app.name=spark-pi \
> --conf 
> spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
>  \
> --conf 
> spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
>  \
> local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
> {code}
> returns
> {code:java}
> log4j:WARN No appenders could be found for logger 
> (io.fabric8.kubernetes.client.Config).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
>   at 
> org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
>   at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.security.cert.CertificateException: Could not parse 
> certificate: java.io.IOException: Empty input
>   at 
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
>   at 
> java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78)
>   ... 13 more
> Caused by: java.io.IOException: Empty input
>   at 
> sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
>   ... 19 more
> {code}
> The cacert and token are both valid and work even with curl
> {code:java}
>

[jira] [Updated] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate

2019-11-13 Thread Jeremy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy updated SPARK-29884:
---
Description: 
spark submit can not be used to to schedule to kuberentes with oauth token and 
cacert
{code:java}
spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
--conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf 
spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
 \
--conf spark.kubernetes.namespace=here-olp-3dds-sit \
--conf spark.executor.instances=1 \
--conf spark.app.name=spark-pi \
--conf 
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
 \
--conf 
spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
{code}
returns
{code:java}
log4j:WARN No appenders could be found for logger 
(io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
at 
org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.security.cert.CertificateException: Could not parse 
certificate: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
at 
java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78)
... 13 more
Caused by: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
... 19 more
{code}
The cacert and token are both valid and work even with curl
{code:java}
curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H 
"Authorization: bearer $TOKEN" -v 
https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods
 -o out
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
  0 00 00 0  0  0 --:--:-- --:--:-- --:--:-- 0* 
  Trying 10.117.233.37:443...
* TCP_NODELAY set
* Connected to api.borg-dev-1-aws-eu-west-1.k8s.in.here.com (10.117.233.37) 
port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS

[jira] [Updated] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate

2019-11-13 Thread Jeremy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy updated SPARK-29884:
---
Description: 
spark submit can not be used to to schedule to kuberentes with oauth token and 
cacert
{code:java}
spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
--conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf 
spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
 \
--conf spark.kubernetes.namespace=here-olp-3dds-sit \
--conf spark.executor.instances=1 \
--conf spark.app.name=spark-pi \
--conf 
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
 \
--conf 
spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
{code}
returns
{code:java}
log4j:WARN No appenders could be found for logger 
(io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
at 
org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.security.cert.CertificateException: Could not parse 
certificate: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
at 
java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78)
... 13 more
Caused by: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
... 19 more
{code}
The cacert and token are both valid and work even with curl
{code:java}
curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H 
"Authorization: bearer $TOKEN" -v 
https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods
 -o out
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
  0 00 00 0  0  0 --:--:-- --:--:-- --:--:-- 0* 
  Trying 10.117.233.37:443...
* TCP_NODELAY set
* Connected to api.borg-dev-1-aws-eu-west-1.k8s.in.here.com (10.117.233.37) 
port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
  CApath: none
} [5 bytes data]
* TLSv1.3 (OUT), TLS

[jira] [Created] (SPARK-29884) spark-Submit to kuberentes can not parse valid ca certificate

2019-11-13 Thread Jeremy (Jira)

Jeremy created SPARK-29884:
--

 Summary: spark-Submit to kuberentes can not parse valid ca 
certificate
 Key: SPARK-29884
 URL: https://issues.apache.org/jira/browse/SPARK-29884
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.4
 Environment: A kuberentes cluster that has been in use for over 2 
years and handles large amounts of production payloads.
Reporter: Jeremy


spark submit can not be used to to schedule to kuberentes with oauth token and 
cacert
{code:java}
spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com \
--conf spark.kubernetes.authenticate.submission.oauthToken=$TOKEN \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf 
spark.kubernetes.authenticate.submission.caCertFile=/home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt
 \
--conf spark.kubernetes.namespace=here-olp-3dds-sit \
--conf spark.executor.instances=1 \
--conf spark.app.name=spark-pi \
--conf 
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
 \
--conf 
spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
 \
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.5.0.jar
{code}
returns
{code:java}
log4j:WARN No appenders could be found for logger 
(io.fabric8.kubernetes.client.Config).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:183)
at 
org.apache.spark.deploy.k8s.SparkKubernetesClientFactory$.createKubernetesClient(SparkKubernetesClientFactory.scala:84)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$4.apply(KubernetesClientApplication.scala:235)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2542)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:241)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:204)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.security.cert.CertificateException: Could not parse 
certificate: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:110)
at 
java.security.cert.CertificateFactory.generateCertificate(CertificateFactory.java:339)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:104)
at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:197)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:78)
... 13 more
Caused by: java.io.IOException: Empty input
at 
sun.security.provider.X509Factory.engineGenerateCertificate(X509Factory.java:106)
... 19 more
{code}
The cacert and token are both valid and work even with curl
{code:java}
curl --cacert /home/jeremybr/.kube/borg-dev-1-aws-eu-west-1.crt -H 
"Authorization: bearer $TOKEN" -v 
https://api.borg-dev-1-aws-eu-west-1.k8s.in.here.com/api/v1/namespaces/here-olp-3dds-sit/pods
 -o out
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
  0 00 00 0  0  0 --:--:-- --:--:-- --:--:-- 0* 
  Trying 10.117.233.37:443...
*

[jira] [Updated] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29152:
--
Summary: Spark Executor Plugin API shutdown is not proper when dynamic 
allocation enabled  (was: Spark Executor Plugin API shutdown is not proper when 
dynamic allocation enabled[SPARK-24918])

> Spark Executor Plugin API shutdown is not proper when dynamic allocation 
> enabled
> 
>
> Key: SPARK-29152
> URL: https://issues.apache.org/jira/browse/SPARK-29152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Major
>
> *Issue Description*
> Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
> allocation enabled .Plugin's shutdown method is not processed when dynamic 
> allocation is enabled and *executors become dead* after inactive time.
> *Test Precondition*
> 1. Create a plugin and make a jar named SparkExecutorplugin.jar
> import org.apache.spark.ExecutorPlugin;
> public class ExecutoTest1 implements ExecutorPlugin{
> public void init(){
> System.out.println("Executor Plugin Initialised.");
> }
> public void shutdown(){
> System.out.println("Executor plugin closed successfully.");
> }
> }
> 2. Create the  jars with the same and put it in folder /spark/examples/jars
> *Test Steps*
> 1. launch bin/spark-sql with dynamic allocation enabled
> ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  --jars 
> /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=2 --conf 
> spark.dynamicAllocation.minExecutors=1
> 2 create a table , insert the data and select * from tablename
> 3.Check the spark UI Jobs tab/SQL tab
> 4. Check all Executors(executor tab will give all executors details) 
> application log file for Executor plugin Initialization and Shutdown messages 
> or operations.
> Example 
> /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
>  stdout
> 5. Wait for the executor to be dead after the inactive time and check the 
> same container log 
> 6. Kill the spark sql and check the container log  for executor plugin 
> shutdown.
> *Expect Output*
> 1. Job should be success. Create table ,insert and select query should be 
> success.
> 2.While running query All Executors  log should contain the executor plugin 
> Init messages or operations.
> "Executor Plugin Initialised.
> 3.Once the executors are dead ,shutdown message should be there in log file.
> “ Executor plugin closed successfully.
> 4.Once the sql application closed ,shutdown message should be there in log.
> “ Executor plugin closed successfully". 
> *Actual Output*
> Shutdown message is not called when executor is dead after inactive time.
> *Observation*
> Without dynamic allocation Executor plugin is working fine. But after 
> enabling dynamic allocation,Executor shutdown is not processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29152) Spark Executor Plugin API shutdown is not proper when dynamic allocation enabled

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29152:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Spark Executor Plugin API shutdown is not proper when dynamic allocation 
> enabled
> 
>
> Key: SPARK-29152
> URL: https://issues.apache.org/jira/browse/SPARK-29152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: jobit mathew
>Priority: Major
>
> *Issue Description*
> Spark Executor Plugin API *shutdown handling is not proper*, when dynamic 
> allocation enabled .Plugin's shutdown method is not processed when dynamic 
> allocation is enabled and *executors become dead* after inactive time.
> *Test Precondition*
> 1. Create a plugin and make a jar named SparkExecutorplugin.jar
> import org.apache.spark.ExecutorPlugin;
> public class ExecutoTest1 implements ExecutorPlugin{
> public void init(){
> System.out.println("Executor Plugin Initialised.");
> }
> public void shutdown(){
> System.out.println("Executor plugin closed successfully.");
> }
> }
> 2. Create the  jars with the same and put it in folder /spark/examples/jars
> *Test Steps*
> 1. launch bin/spark-sql with dynamic allocation enabled
> ./spark-sql --master yarn --conf spark.executor.plugins=ExecutoTest1  --jars 
> /opt/HA/C10/install/spark/spark/examples/jars/SparkExecutorPlugin.jar --conf 
> spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=2 --conf 
> spark.dynamicAllocation.minExecutors=1
> 2 create a table , insert the data and select * from tablename
> 3.Check the spark UI Jobs tab/SQL tab
> 4. Check all Executors(executor tab will give all executors details) 
> application log file for Executor plugin Initialization and Shutdown messages 
> or operations.
> Example 
> /yarn/logdir/application_1567156749079_0025/container_e02_1567156749079_0025_01_05/
>  stdout
> 5. Wait for the executor to be dead after the inactive time and check the 
> same container log 
> 6. Kill the spark sql and check the container log  for executor plugin 
> shutdown.
> *Expect Output*
> 1. Job should be success. Create table ,insert and select query should be 
> success.
> 2.While running query All Executors  log should contain the executor plugin 
> Init messages or operations.
> "Executor Plugin Initialised.
> 3.Once the executors are dead ,shutdown message should be there in log file.
> “ Executor plugin closed successfully.
> 4.Once the sql application closed ,shutdown message should be there in log.
> “ Executor plugin closed successfully". 
> *Actual Output*
> Shutdown message is not called when executor is dead after inactive time.
> *Observation*
> Without dynamic allocation Executor plugin is working fine. But after 
> enabling dynamic allocation,Executor shutdown is not processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29878) Improper cache strategies in GraphX

2019-11-13 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973637#comment-16973637
 ] 

Aman Omer commented on SPARK-29878:
---

[https://github.com/apache/spark/pull/7469] this PR had introduced the cache() 
on newEdges and vertices

> Improper cache strategies in GraphX
> ---
>
> Key: SPARK-29878
> URL: https://issues.apache.org/jira/browse/SPARK-29878
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> I have run examples.graphx.SSPExample and looked through the RDD dependency 
> graphs as well as persist operations. There are some improper cache 
> strategies in GraphX. The same situations also exist when I run 
> ConnectedComponentsExample.
> 1.  vertices.cache() and newEdges.cache() are unnecessary
> In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this 
> method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), 
> and RDD vertices/newEdges are cached in apply(). But these two RDDs are not 
> directly used anymore (their children RDDs has been cached) in SSPExample, so 
> the persists can be unnecessary here. 
> However, the other examples may need these two persists, so I think they 
> cannot be simply removed. It might be hard to fix.
> {code:scala}
>   def apply[VD: ClassTag, ED: ClassTag](
>   vertices: VertexRDD[VD],
>   edges: EdgeRDD[ED]): GraphImpl[VD, ED] = {
> vertices.cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> // Convert the vertex partitions in edges to the correct type
> val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]]
>   .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD])
>   .cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> GraphImpl.fromExistingRDDs(vertices, newEdges)
>   }
> {code}
> 2. Missing persist on newEdges
> SSSPExample will invoke pregel to do execution. Pregel will ultilize 
> ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use 
> by multiple actions in Pregel. So newEdges should be persisted.
> Same as the above issue, this issue is also found in 
> ConnectedComponentsExample. It is also hard to fix, because the persist added 
> may be unnecessary for other examples.
> {code:scala}
> // Pregel.scala
> // compute the messages
> var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // 
> newEdges is created here
> val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
>   checkpointInterval, graph.vertices.sparkContext)
> messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
> var activeMessages = messages.count() // The first time use newEdges
> ...
> while (activeMessages > 0 && i < maxIterations) {
>   // Receive the messages and update the vertices.
>   prevG = g
>   g = g.joinVertices(messages)(vprog) // Generate g will depends on 
> newEdges
>   ...
>   activeMessages = messages.count() // The second action to use newEdges. 
> newEdges should be unpersisted after this instruction.
> {code}
> {code:scala}
> // ReplicatedVertexView.scala
>   def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: 
> Boolean): Unit = {
>   ...
>val newEdges = 
> edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) {
> (ePartIter, shippedVertsIter) => ePartIter.map {
>   case (pid, edgePartition) =>
> (pid, 
> edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator)))
> }
>   })
>   edges = newEdges // newEdges should be persisted
>   hasSrcId = includeSrc
>   hasDstId = includeDst
> }
>   }
> {code}
> As I don't have much knowledge about Graphx, so I don't know how to fix these 
> issues well.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29883) Improve error messages when function name is an alias

2019-11-13 Thread Aman Omer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973633#comment-16973633
 ] 

Aman Omer commented on SPARK-29883:
---

Working on this.

> Improve error messages when function name is an alias
> -
>
> Key: SPARK-29883
> URL: https://issues.apache.org/jira/browse/SPARK-29883
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
>  
> A general issue in error message when the function name is just an alias name 
> of the actual built-in function. For example, every is an alias of bool_and 
> in Spark 3.0 
> {code:java}
> cannot resolve 'every('true')' due to data type mismatch: Input to function 
> 'every' should have been boolean, but it's [string].; line 1 pos 7 
> {code}
> {code:java}
> cannot resolve 'bool_and('true')' due to data type mismatch: Input to 
> function 'bool_and' should have been boolean, but it's [string].; line 1 pos 
> 7{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29883) Improve error messages when function name is an alias

2019-11-13 Thread Xiao Li (Jira)

Xiao Li created SPARK-29883:
---

 Summary: Improve error messages when function name is an alias
 Key: SPARK-29883
 URL: https://issues.apache.org/jira/browse/SPARK-29883
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


 

A general issue in error message when the function name is just an alias name 
of the actual built-in function. For example, every is an alias of bool_and in 
Spark 3.0 
{code:java}
cannot resolve 'every('true')' due to data type mismatch: Input to function 
'every' should have been boolean, but it's [string].; line 1 pos 7 
{code}
{code:java}
cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 
'bool_and' should have been boolean, but it's [string].; line 1 pos 7{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29882) SPARK on Kubernetes is Broken for SPARK with no Hadoop release

2019-11-13 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29882.

   Fix Version/s: (was: 2.4.4)
Target Version/s:   (was: 2.4.4)
  Resolution: Duplicate

> SPARK on Kubernetes is Broken for SPARK with no Hadoop release
> --
>
> Key: SPARK-29882
> URL: https://issues.apache.org/jira/browse/SPARK-29882
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
> Environment: h3. How was this patch tested?
> Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to 
> {{-cp }} param of entrypoint.sh enables launching the executors correctly.
>Reporter: Shahin Shakeri
>Priority: Major
>
> h3. What changes were proposed in this pull request?
> Include {{$SPARK_DIST_CLASSPATH}} in class path when launching 
> {{CoarseGrainedExecutorBackend}} on Kubernetes executors using the provided 
> {{entrypoint.sh}}
> h3. Why are the changes needed?
> For user provided Hadoop (3.2.1 in this example) {{$SPARK_DIST_CLASSPATH}} 
> contains the required jars.
> h3. Does this PR introduce any user-facing change?
> no
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29875:
-

Assignee: Hyukjin Kwon

> Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
> --
>
> Key: SPARK-29875
> URL: https://issues.apache.org/jira/browse/SPARK-29875
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of 
> warnings as below:
> {code}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
> def add_one(x):
> return x + 1
> spark.range(100).select(add_one("id")).collect()
> {code}
> {code}
> UserWarning: pyarrow.open_stream is deprecated, please use 
> pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29875.
---
Fix Version/s: 2.4.5
   Resolution: Fixed

Issue resolved by pull request 26501
[https://github.com/apache/spark/pull/26501]

> Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x
> --
>
> Key: SPARK-29875
> URL: https://issues.apache.org/jira/browse/SPARK-29875
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.5
>
>
> In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of 
> warnings as below:
> {code}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
> def add_one(x):
> return x + 1
> spark.range(100).select(add_one("id")).collect()
> {code}
> {code}
> UserWarning: pyarrow.open_stream is deprecated, please use 
> pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> /usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
> pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
>   warnings.warn("pyarrow.open_stream is deprecated, please use "
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2019-11-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23151.
---
Resolution: Duplicate

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2019-11-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973546#comment-16973546
 ] 

Dongjoon Hyun commented on SPARK-23151:
---

Yes. We did as we shipped in 3.0-preview release.

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29149) Update YARN cluster manager For Stage Level Scheduling

2019-11-13 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-29149:
-

Assignee: Thomas Graves

> Update YARN cluster manager For Stage Level Scheduling
> --
>
> Key: SPARK-29149
> URL: https://issues.apache.org/jira/browse/SPARK-29149
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For Stage Level Scheduling, we need to update the YARN allocator to handle 
> requesting executors for multiple ResourceProfiles.
>  * The container requests have to be updated to be based on a number of 
> containers per ResourceProfile.  This is a larger change then you might 
> expect because on YARN you can’t ask for different container sizes within the 
> same YARN container priority.  So we will have to ask for containers at 
> different priorities to be able to get different container sizes. Other YARN 
> applications like Tez handle this now so it shouldn’t be a big deal just a 
> matter of mapping stages to different priorities.
>  * The allocation response from YARN has to match the containers to a 
> resource profile.
>  * We need to launch the container with additional parameters so the executor 
> knows its resource profile to report back to the ExecutorMonitor.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29882) SPARK on Kubernetes is Broken for SPARK with no Hadoop release

2019-11-13 Thread Shahin Shakeri (Jira)

Shahin Shakeri created SPARK-29882:
--

 Summary: SPARK on Kubernetes is Broken for SPARK with no Hadoop 
release
 Key: SPARK-29882
 URL: https://issues.apache.org/jira/browse/SPARK-29882
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.4
 Environment: h3. How was this patch tested?

Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to 
{{-cp }} param of entrypoint.sh enables launching the executors correctly.
Reporter: Shahin Shakeri
 Fix For: 2.4.4


h3. What changes were proposed in this pull request?

Include {{$SPARK_DIST_CLASSPATH}} in class path when launching 
{{CoarseGrainedExecutorBackend}} on Kubernetes executors using the provided 
{{entrypoint.sh}}
h3. Why are the changes needed?

For user provided Hadoop (3.2.1 in this example) {{$SPARK_DIST_CLASSPATH}} 
contains the required jars.
h3. Does this PR introduce any user-facing change?

no

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29872) Improper cache strategy in examples

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29872.
--
Resolution: Won't Fix

> Improper cache strategy in examples
> ---
>
> Key: SPARK-29872
> URL: https://issues.apache.org/jira/browse/SPARK-29872
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Minor
>
> 1. Improper cache in examples.SparkTC
> The RDD edges should be cached because it is used multiple times in while 
> loop. And it should be unpersisted before the last action tc.count(), because 
> tc has been persisted.
> On the other hand, many tc objects is cached in while loop but never 
> uncached, which will waste memory.
> {code:scala}
> val edges = tc.map(x => (x._2, x._1)) // Edges should be cached
> // This join is iterated until a fixed point is reached.
> var oldCount = 0L
> var nextCount = tc.count()
> do { 
>   oldCount = nextCount
>   // Perform the join, obtaining an RDD of (y, (z, x)) pairs,
>   // then project the result to obtain the new (x, z) paths.
>   tc = tc.union(tc.join(edges).map(x => (x._2._2, 
> x._2._1))).distinct().cache()
>   nextCount = tc.count()
> } while (nextCount != oldCount)
> println(s"TC has ${tc.count()} edges.")
> {code}
> 2. Cache needed in examples.ml.LogisticRegressionSummary
> The DataFrame fMeasure should be cached.
> {code:scala}
> // Set the model threshold to maximize F-Measure
> val fMeasure = trainingSummary.fMeasureByThreshold // fMeasures should be 
> cached
> val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
> val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
>   .select("threshold").head().getDouble(0)
> lrModel.setThreshold(bestThreshold)
> {code}
> 3. Cache needed in examples.sql.SparkSQLExample
> {code:scala}
> val peopleDF = spark.sparkContext
>   .textFile("examples/src/main/resources/people.txt")
>   .map(_.split(","))
>   .map(attributes => Person(attributes(0), attributes(1).trim.toInt)) // 
> This RDD should be cahced
>   .toDF()
> // Register the DataFrame as a temporary view
> peopleDF.createOrReplaceTempView("people")
> val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age 
> BETWEEN 13 AND 19")
> teenagersDF.map(teenager => "Name: " + teenager(0)).show()
> teenagersDF.map(teenager => "Name: " + 
> teenager.getAs[String]("name")).show()
> implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, 
> Any]]
> teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", 
> "age"))).collect()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24353) Add support for pod affinity/anti-affinity

2019-11-13 Thread Rafael Felix Correa (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973392#comment-16973392
 ] 

Rafael Felix Correa edited comment on SPARK-24353 at 11/13/19 2:35 PM:
---

Hi all,

I gave this problem a shot in my PR 
[https://github.com/apache/spark/pull/26505], which implements tolerations it 
in a slightly different way than the proposed one due to the key field being 
optional (therefore it cannot be used as a unique identifier). It doesn't 
entirely close this issue because pod affinity/anti-affinity is not covered in 
my PR.

I have a custom build with this patch already working internally here in OLX, 
please have a look at the PR if you have time. Hope it helps more people as 
well :) 

 

Cheers,


was (Author: rafaelfc):
Hi all,

I gave this problem a shot in my PR 
[https://github.com/apache/spark/pull/26505], which implements tolerations it 
in a slightly different way than the proposed one due to the key field being 
optional (therefore it cannot be used as a unique identifier).

I have a custom build with this patch already working internally here in OLX, 
please have a look at the PR if you have time. Hope it helps more people as 
well :) 

 

Cheers,

> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity
>  * Toleration/taints
>  * Inter-Pod affinity/anti-affinity
> Note that nodeSelector will be deprecated in the future.
> Design doc: 
> [https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU|https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU/edit?usp=sharing]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24353) Add support for pod affinity/anti-affinity

2019-11-13 Thread Rafael Felix Correa (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973392#comment-16973392
 ] 

Rafael Felix Correa commented on SPARK-24353:
-

Hi all,

I gave this problem a shot in my PR 
[https://github.com/apache/spark/pull/26505], which implements tolerations it 
in a slightly different way than the proposed one due to the key field being 
optional (therefore it cannot be used as a unique identifier).

I have a custom build with this patch already working internally here in OLX, 
please have a look at the PR if you have time. Hope it helps more people as 
well :) 

 

Cheers,

> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity
>  * Toleration/taints
>  * Inter-Pod affinity/anti-affinity
> Note that nodeSelector will be deprecated in the future.
> Design doc: 
> [https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU|https://docs.google.com/document/d/1izk75I4A0I-nJaE57m7wkpgUZTXM0o6-c0Tb-NxdOMU/edit?usp=sharing]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29808) StopWordsRemover should support multi-cols

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29808.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26480
[https://github.com/apache/spark/pull/26480]

> StopWordsRemover should support multi-cols
> --
>
> Key: SPARK-29808
> URL: https://issues.apache.org/jira/browse/SPARK-29808
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> As a basic Transformer, StopWordsRemover should support multi-cols.
> Param {color:#93a6f5}stopWords{color} can be applied across all columns.
> {color:#93a6f5} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29823.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26483
[https://github.com/apache/spark/pull/26483]

> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Assignee: Aman Omer
>Priority: Major
> Fix For: 3.0.0
>
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  _{color:#de350b}norms{color}_.
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29823:


Assignee: Aman Omer

> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Assignee: Aman Omer
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  _{color:#de350b}norms{color}_.
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29823:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

These aren't bugs.

> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 3.0.0
>
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  _{color:#de350b}norms{color}_.
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28168) release-build.sh for Hadoop-3.2

2019-11-13 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-28168.
-
Resolution: Duplicate

The issue fixed by SPARK-29608.

> release-build.sh for Hadoop-3.2
> ---
>
> Key: SPARK-28168
> URL: https://issues.apache.org/jira/browse/SPARK-28168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hadoop 3.2 support was added. We might have to add Hadoop-3.2 in 
> release-build.sh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2019-11-13 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973372#comment-16973372
 ] 

Yuming Wang commented on SPARK-23151:
-

[~dongjoon] Is this issue fixed by SPARK-29608?

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29881) Introduce API for manually breaking up dataset plan

2019-11-13 Thread Devyn Cairns (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devyn Cairns updated SPARK-29881:
-
Description: 
I have an interesting situation where I'm calling functions that are relatively 
expensive from Spark SQL, and then using the result several times in a loop 
through {{transform}}.

Although the WholeStageCodegen is usually helpful, it always calls expressions 
as they're used, which means that in the case of, for example:

{{SELECT transform(sequence(0, 32), x -> expensive_result * x)}}
 {{FROM (}}
 {{  SELECT expensive_operation(foo) AS expensive_result FROM source}}
 {{)}}

the expensive_operation function will almost certainly be called 32 times for 
each source row, without any explicit way to cache that value intermediately.

I've found a workaround for now is to insert something like {{.filter \{ _ => 
true }}} in the middle, which will create a barrier to whole-stage codegen 
without much negative impact, aside from preventing other optimizations like 
PushDown. This does indeed produce the intended result and expensive_operation 
is only run once.

But it would be great to have an API on Dataset like {{.barrier()}} to 
introduce an explicit barrier to whole-stage codegen without adding any 
additional behavior or getting in the way of any PushDown optimizations.

  was:
I have an interesting situation where I'm calling functions that are relatively 
expensive from Spark SQL, and then using the result several times in a loop 
through {{transform}}.

Although the WholeStageCodegen is usually helpful, it always calls expressions 
as they're used, which means that in the case of, for example:

{{SELECT transform(sequence(0, 32), x -> expensive_result * x)}}
{{FROM (}}
{{  SELECT expensive_operation(foo) AS expensive_result FROM source}}
{{)}}

the expensive_operation function will almost certainly be called 32 times for 
each source row, without any explicit way to cache that value intermediately.

I've found a workaround for now is to insert something like {{.filter \{ _ => 
true }}} in the middle, which will create a barrier to whole-stage codegen 
without much negative impact, aside from preventing other optimizations like 
PushDown. This does indeed produce the intended result and expensive_operation 
is only run once.

But it would be great to have an API on Dataset like {{.barrier()}} to 
introduce an explicit barrier to whole-stage codegen without adding any 
additional behavior or getting in the way of any PushDown optimizations.


> Introduce API for manually breaking up dataset plan
> ---
>
> Key: SPARK-29881
> URL: https://issues.apache.org/jira/browse/SPARK-29881
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Devyn Cairns
>Priority: Trivial
>
> I have an interesting situation where I'm calling functions that are 
> relatively expensive from Spark SQL, and then using the result several times 
> in a loop through {{transform}}.
> Although the WholeStageCodegen is usually helpful, it always calls 
> expressions as they're used, which means that in the case of, for example:
> {{SELECT transform(sequence(0, 32), x -> expensive_result * x)}}
>  {{FROM (}}
>  {{  SELECT expensive_operation(foo) AS expensive_result FROM source}}
>  {{)}}
> the expensive_operation function will almost certainly be called 32 times for 
> each source row, without any explicit way to cache that value intermediately.
> I've found a workaround for now is to insert something like {{.filter \{ _ => 
> true }}} in the middle, which will create a barrier to whole-stage codegen 
> without much negative impact, aside from preventing other optimizations like 
> PushDown. This does indeed produce the intended result and 
> expensive_operation is only run once.
> But it would be great to have an API on Dataset like {{.barrier()}} to 
> introduce an explicit barrier to whole-stage codegen without adding any 
> additional behavior or getting in the way of any PushDown optimizations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29881) Introduce API for manually breaking up dataset plan

2019-11-13 Thread Devyn Cairns (Jira)

Devyn Cairns created SPARK-29881:


 Summary: Introduce API for manually breaking up dataset plan
 Key: SPARK-29881
 URL: https://issues.apache.org/jira/browse/SPARK-29881
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.4
Reporter: Devyn Cairns


I have an interesting situation where I'm calling functions that are relatively 
expensive from Spark SQL, and then using the result several times in a loop 
through {{transform}}.

Although the WholeStageCodegen is usually helpful, it always calls expressions 
as they're used, which means that in the case of, for example:

{{SELECT transform(sequence(0, 32), x -> expensive_result * x)}}
{{FROM (}}
{{  SELECT expensive_operation(foo) AS expensive_result FROM source}}
{{)}}

the expensive_operation function will almost certainly be called 32 times for 
each source row, without any explicit way to cache that value intermediately.

I've found a workaround for now is to insert something like {{.filter \{ _ => 
true }}} in the middle, which will create a barrier to whole-stage codegen 
without much negative impact, aside from preventing other optimizations like 
PushDown. This does indeed produce the intended result and expensive_operation 
is only run once.

But it would be great to have an API on Dataset like {{.barrier()}} to 
introduce an explicit barrier to whole-stage codegen without adding any 
additional behavior or getting in the way of any PushDown optimizations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29863) rename EveryAgg/AnyAgg to BoolAnd/BoolOr

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29863.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26486
[https://github.com/apache/spark/pull/26486]

> rename EveryAgg/AnyAgg to BoolAnd/BoolOr
> 
>
> Key: SPARK-29863
> URL: https://issues.apache.org/jira/browse/SPARK-29863
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29753) refine the default catalog config

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29753.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26395
[https://github.com/apache/spark/pull/26395]

> refine the default catalog config
> -
>
> Key: SPARK-29753
> URL: https://issues.apache.org/jira/browse/SPARK-29753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29877) static PageRank allow checkPoint from previous computations

2019-11-13 Thread Joan Fontanals (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joan Fontanals updated SPARK-29877:
---
Priority: Major  (was: Minor)

> static PageRank allow checkPoint from previous computations
> ---
>
> Key: SPARK-29877
> URL: https://issues.apache.org/jira/browse/SPARK-29877
> Project: Spark
>  Issue Type: Wish
>  Components: GraphX
>Affects Versions: 2.3.0
>Reporter: Joan Fontanals
>Priority: Major
>  Labels: graphx, pagerank
> Fix For: 2.3.0
>
>
> It would be really helpful to have the possibility, when computing 
> staticPageRank to use a previous computation as a checkpoint to continue the 
> iterations.
> I have done a small code proposal, but there is a problem because at the end 
> of the iterations a normalization step is performed.
> Therefore, if we use this normalized graph as a checkpoint for a new set of 
> iterations, the algorithm will not restart in a coherent way.
> I created a branch in my fork: 
> [https://github.com/JoanFM/spark/tree/pageRank_checkPoint]
> I hope you can consider this feature, and if you do that you can implement 
> yourself or propose me a way to handle this identified problem
> Thank you very much,
> Best regards,
> Joan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29880) Handle submit exception when target hadoop cluster is Federation

2019-11-13 Thread zhoukang (Jira)

zhoukang created SPARK-29880:


 Summary: Handle submit exception when target hadoop cluster is 
Federation
 Key: SPARK-29880
 URL: https://issues.apache.org/jira/browse/SPARK-29880
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: zhoukang


When we submit application to federation yarn cluster. Since 
getYarnClusterMetrics is not implemented. The submission will exit with failure.
{code:java}
def submitApplication(): ApplicationId = {
ResourceRequestHelper.validateResources(sparkConf)

var appId: ApplicationId = null
try {
  launcherBackend.connect()
  yarnClient.init(hadoopConf)
  yarnClient.start()

 logInfo("Requesting a new application from cluster with %d NodeManagers"
  .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29879) Support unbounded waiting for RPC reply

2019-11-13 Thread wenxuanguan (Jira)

wenxuanguan created SPARK-29879:
---

 Summary: Support unbounded waiting for RPC reply
 Key: SPARK-29879
 URL: https://issues.apache.org/jira/browse/SPARK-29879
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wenxuanguan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29835.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26464
[https://github.com/apache/spark/pull/26464]

> Remove the unnecessary conversion from Statement to LogicalPlan for 
> DELETE/UPDATE
> -
>
> Key: SPARK-29835
> URL: https://issues.apache.org/jira/browse/SPARK-29835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> The current parse and analyze flow for DELETE is: 1, the SQL string will be 
> firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be 
> converted to `DeleteFromTable`. However, the SQL string can be parsed to 
> `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be 
> redundant.
> It is the same for UPDATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29835:
---

Assignee: Xianyin Xin

> Remove the unnecessary conversion from Statement to LogicalPlan for 
> DELETE/UPDATE
> -
>
> Key: SPARK-29835
> URL: https://issues.apache.org/jira/browse/SPARK-29835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
>
> The current parse and analyze flow for DELETE is: 1, the SQL string will be 
> firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be 
> converted to `DeleteFromTable`. However, the SQL string can be parsed to 
> `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be 
> redundant.
> It is the same for UPDATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29866) Upper case enum values

2019-11-13 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29866.
--
Resolution: Won't Fix

> Upper case enum values
> --
>
> Key: SPARK-29866
> URL: https://issues.apache.org/jira/browse/SPARK-29866
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Unify naming of enum values and upper case their names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29878) Improper cache strategies in GraphX

2019-11-13 Thread Dong Wang (Jira)

Dong Wang created SPARK-29878:
-

 Summary: Improper cache strategies in GraphX
 Key: SPARK-29878
 URL: https://issues.apache.org/jira/browse/SPARK-29878
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 3.0.0
Reporter: Dong Wang


I have run examples.graphx.SSPExample and looked through the RDD dependency 
graphs as well as persist operations. There are some improper cache strategies 
in GraphX. The same situations also exist when I run ConnectedComponentsExample.

1.  vertices.cache() and newEdges.cache() are unnecessary
In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this 
method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), 
and RDD vertices/newEdges are cached in apply(). But these two RDDs are not 
directly used anymore (their children RDDs has been cached) in SSPExample, so 
the persists can be unnecessary here. 
However, the other examples may need these two persists, so I think they cannot 
be simply removed. It might be hard to fix.
{code:scala}
  def apply[VD: ClassTag, ED: ClassTag](
  vertices: VertexRDD[VD],
  edges: EdgeRDD[ED]): GraphImpl[VD, ED] = {
vertices.cache() // It is unnecessary for SSPExample and 
ConnectedComponentsExample
// Convert the vertex partitions in edges to the correct type
val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]]
  .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD])
  .cache() // It is unnecessary for SSPExample and 
ConnectedComponentsExample
GraphImpl.fromExistingRDDs(vertices, newEdges)
  }
{code}

2. Missing persist on newEdges
SSSPExample will invoke pregel to do execution. Pregel will ultilize 
ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use 
by multiple actions in Pregel. So newEdges should be persisted.
Same as the above issue, this issue is also found in 
ConnectedComponentsExample. It is also hard to fix, because the persist added 
may be unnecessary for other examples.
{code:scala}
// Pregel.scala
// compute the messages
var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // 
newEdges is created here
val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
  checkpointInterval, graph.vertices.sparkContext)
messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
var activeMessages = messages.count() // The first time use newEdges
...
while (activeMessages > 0 && i < maxIterations) {
  // Receive the messages and update the vertices.
  prevG = g
  g = g.joinVertices(messages)(vprog) // Generate g will depends on newEdges
  ...
  activeMessages = messages.count() // The second action to use newEdges. 
newEdges should be unpersisted after this instruction.
{code}
{code:scala}
// ReplicatedVertexView.scala
  def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: 
Boolean): Unit = {
  ...
   val newEdges = 
edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) {
(ePartIter, shippedVertsIter) => ePartIter.map {
  case (pid, edgePartition) =>
(pid, 
edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator)))
}
  })
  edges = newEdges // newEdges should be persisted
  hasSrcId = includeSrc
  hasDstId = includeDst
}
  }
{code}
As I don't have much knowledge about Graphx, so I don't know how to fix these 
issues well.

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29851:
---

Assignee: Terry Kim

> V2 Catalog: Default behavior of dropping namespace is cascading
> ---
>
> Key: SPARK-29851
> URL: https://issues.apache.org/jira/browse/SPARK-29851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Instead of introducing additional 'cascade' option to dropNamespace(), the 
> default behavior of dropping a namespace will be cascading. Now, to implement 
> the cascade option, Spark side needs to ensure a namespace is empty before 
> calling dropNamespace().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29851) V2 Catalog: Default behavior of dropping namespace is cascading

2019-11-13 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29851.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26476
[https://github.com/apache/spark/pull/26476]

> V2 Catalog: Default behavior of dropping namespace is cascading
> ---
>
> Key: SPARK-29851
> URL: https://issues.apache.org/jira/browse/SPARK-29851
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Instead of introducing additional 'cascade' option to dropNamespace(), the 
> default behavior of dropping a namespace will be cascading. Now, to implement 
> the cascade option, Spark side needs to ensure a namespace is empty before 
> calling dropNamespace().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29877) static PageRank allow checkPoint from previous computations

2019-11-13 Thread Joan Fontanals (Jira)

Joan Fontanals created SPARK-29877:
--

 Summary: static PageRank allow checkPoint from previous 
computations
 Key: SPARK-29877
 URL: https://issues.apache.org/jira/browse/SPARK-29877
 Project: Spark
  Issue Type: Wish
  Components: GraphX
Affects Versions: 2.3.0
Reporter: Joan Fontanals
 Fix For: 2.3.0


It would be really helpful to have the possibility, when computing 
staticPageRank to use a previous computation as a checkpoint to continue the 
iterations.

I have done a small code proposal, but there is a problem because at the end of 
the iterations a normalization step is performed.

Therefore, if we use this normalized graph as a checkpoint for a new set of 
iterations, the algorithm will not restart in a coherent way.

I created a branch in my fork: 
[https://github.com/JoanFM/spark/tree/pageRank_checkPoint]

I hope you can consider this feature, and if you do that you can implement 
yourself or propose me a way to handle this identified problem

Thank you very much,

Best regards,

Joan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29876) Delete/archive file source completed files in separate thread

2019-11-13 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-29876:
--
Summary: Delete/archive file source completed files in separate thread  
(was: Delete/archive file source data in separate thread)

> Delete/archive file source completed files in separate thread
> -
>
> Key: SPARK-29876
> URL: https://issues.apache.org/jira/browse/SPARK-29876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> SPARK-20568 added the possibility to clean up completed files in streaming 
> query. Deleting/archiving uses the main thread which can slow down 
> processing. It would be good to do this on separate thread(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29876) Delete/archive file source data in separate thread

2019-11-13 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973216#comment-16973216
 ] 

Gabor Somogyi commented on SPARK-29876:
---

I'm working on this.

> Delete/archive file source data in separate thread
> --
>
> Key: SPARK-29876
> URL: https://issues.apache.org/jira/browse/SPARK-29876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> SPARK-20568 added the possibility to clean up completed files in streaming 
> query. Deleting/archiving uses the main thread which can slow down 
> processing. It would be good to do this on separate thread(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29876) Delete/archive file source data in separate thread

2019-11-13 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-29876:
-

 Summary: Delete/archive file source data in separate thread
 Key: SPARK-29876
 URL: https://issues.apache.org/jira/browse/SPARK-29876
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


SPARK-20568 added the possibility to clean up completed files in streaming 
query. Deleting/archiving uses the main thread which can slow down processing. 
It would be good to do this on separate thread(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29875) Avoid to use deprecated pyarrow.open_stream API in Spark 2.4.x

2019-11-13 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-29875:


 Summary: Avoid to use deprecated pyarrow.open_stream API in Spark 
2.4.x
 Key: SPARK-29875
 URL: https://issues.apache.org/jira/browse/SPARK-29875
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.4
Reporter: Hyukjin Kwon


In Spark 2.4.x, if we use PyArrow higher then 0.12.0, it shows a bunch of 
warnings as below:

{code}
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("integer", PandasUDFType.SCALAR)  # doctest: +SKIP
def add_one(x):
return x + 1

spark.range(100).select(add_one("id")).collect()
{code}

{code}
UserWarning: pyarrow.open_stream is deprecated, please use 
pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
/usr/local/lib/python3.7/site-packages/pyarrow/__init__.py:157: UserWarning: 
pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream
  warnings.warn("pyarrow.open_stream is deprecated, please use "
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29710) Seeing offsets not resetting even when reset policy is configured explicitly

2019-11-13 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-29710.
---
Resolution: Incomplete

Please attach logs or any further information and re-open it. I'm closing it 
for now.

> Seeing offsets not resetting even when reset policy is configured explicitly
> 
>
> Key: SPARK-29710
> URL: https://issues.apache.org/jira/browse/SPARK-29710
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.1
> Environment: Window10 , eclipse neos
>Reporter: Shyam
>Priority: Major
>
>  
>  even after setting *"auto.offset.reset" to "latest"*  I am getting below 
> error
>  
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
> range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168}org.apache.kafka.clients.consumer.OffsetOutOfRangeException:
>  Offsets out of range with no configured reset policy for partitions: 
> \{COMPANY_TRANSACTIONS_INBOUND-16=168} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.throwIfOffsetOutOfRange(Fetcher.java:348)
>  at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:396)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:999)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937) 
> at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
>  at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
>  at 
> org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
>  
> [https://stackoverflow.com/questions/58653885/even-after-setting-auto-offset-reset-to-latest-getting-error-offsetoutofrang]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29874) Optimize Dataset.isEmpty()

2019-11-13 Thread angerszhu (Jira)

angerszhu created SPARK-29874:
-

 Summary: Optimize Dataset.isEmpty()
 Key: SPARK-29874
 URL: https://issues.apache.org/jira/browse/SPARK-29874
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29873) Support `--import` directive to load queries from another test case in SQLQueryTestSuite

2019-11-13 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-29873:


 Summary: Support `--import` directive to load queries from another 
test case in SQLQueryTestSuite
 Key: SPARK-29873
 URL: https://issues.apache.org/jira/browse/SPARK-29873
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


To reduce duplicate test queries in `SQLQueryTestSuite`, this ticket intends to 
support `--import` directive to load queries from another test case in 
SQLQueryTestSuite.

This fix comes from the @cloud-fan suggestion in 
https://github.com/apache/spark/pull/26479#discussion_r345086978



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo