[jira] [Resolved] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27193.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24133
[https://github.com/apache/spark/pull/24133]

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Trivial
> Fix For: 3.0.0
>
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> \* Codegend pipeline for stage (id=1)
> \* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> \* +- *(1) Filter (id#0L = 1)
> \*+- *(1) Range (0, 10, step=1, splits=4)
> \*/
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27193:
---

Assignee: wuyi

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Trivial
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> \* Codegend pipeline for stage (id=1)
> \* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> \* +- *(1) Filter (id#0L = 1)
> \*+- *(1) Range (0, 10, step=1, splits=4)
> \*/
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27162.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24094
[https://github.com/apache/spark/pull/24094]

> Add new method getOriginalMap in CaseInsensitiveStringMap
> -
>
> Key: SPARK-27162
> URL: https://issues.apache.org/jira/browse/SPARK-27162
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, DataFrameReader/DataFrameReader supports setting Hadoop 
> configurations via method `.option()`. 
> E.g.
> ```
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   val df = spark.range(2)
>   df.write.orc(path + "/p=1")
>   df.write.orc(path + "/p=2")
>   assert(spark.read.orc(path).count() === 4)
>   val extraOptions = Map(
> "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
> "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
>   )
>   assert(spark.read.options(extraOptions).orc(path).count() === 2)
> }
> ```
> While Hadoop Configurations are case sensitive, the current data source V2 
> APIs are using `CaseInsensitiveStringMap` in TableProvider. 
> To create Hadoop configurations correctly, I suggest adding a method 
> `getOriginalMap` in `CaseInsensitiveStringMap`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27162:
---

Assignee: Gengliang Wang

> Add new method getOriginalMap in CaseInsensitiveStringMap
> -
>
> Key: SPARK-27162
> URL: https://issues.apache.org/jira/browse/SPARK-27162
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, DataFrameReader/DataFrameReader supports setting Hadoop 
> configurations via method `.option()`. 
> E.g.
> ```
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   val df = spark.range(2)
>   df.write.orc(path + "/p=1")
>   df.write.orc(path + "/p=2")
>   assert(spark.read.orc(path).count() === 4)
>   val extraOptions = Map(
> "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
> "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
>   )
>   assert(spark.read.options(extraOptions).orc(path).count() === 2)
> }
> ```
> While Hadoop Configurations are case sensitive, the current data source V2 
> APIs are using `CaseInsensitiveStringMap` in TableProvider. 
> To create Hadoop configurations correctly, I suggest adding a method 
> `getOriginalMap` in `CaseInsensitiveStringMap`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27198) Heartbeat interval mismatch in driver and executor

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27198:


Assignee: Apache Spark

> Heartbeat interval mismatch in driver and executor
> --
>
> Key: SPARK-27198
> URL: https://issues.apache.org/jira/browse/SPARK-27198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Ajith S
>Assignee: Apache Spark
>Priority: Major
>
> When heartbeat interval is configured via *spark.executor.heartbeatInterval* 
> without specifying units, we have time mismatched between driver(considers in 
> seconds) and executor(considers as milliseconds)
>  
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613]
> vs
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27198) Heartbeat interval mismatch in driver and executor

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27198:


Assignee: (was: Apache Spark)

> Heartbeat interval mismatch in driver and executor
> --
>
> Key: SPARK-27198
> URL: https://issues.apache.org/jira/browse/SPARK-27198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Ajith S
>Priority: Major
>
> When heartbeat interval is configured via *spark.executor.heartbeatInterval* 
> without specifying units, we have time mismatched between driver(considers in 
> seconds) and executor(considers as milliseconds)
>  
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613]
> vs
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27198) Heartbeat interval mismatch in driver and executor

2019-03-18 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795666#comment-16795666
 ] 

Ajith S commented on SPARK-27198:
-

will be working on this

> Heartbeat interval mismatch in driver and executor
> --
>
> Key: SPARK-27198
> URL: https://issues.apache.org/jira/browse/SPARK-27198
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Ajith S
>Priority: Major
>
> When heartbeat interval is configured via *spark.executor.heartbeatInterval* 
> without specifying units, we have time mismatched between driver(considers in 
> seconds) and executor(considers as milliseconds)
>  
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613]
> vs
> [https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27198) Heartbeat interval mismatch in driver and executor

2019-03-18 Thread Ajith S (JIRA)
Ajith S created SPARK-27198:
---

 Summary: Heartbeat interval mismatch in driver and executor
 Key: SPARK-27198
 URL: https://issues.apache.org/jira/browse/SPARK-27198
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.3
Reporter: Ajith S


When heartbeat interval is configured via *spark.executor.heartbeatInterval* 
without specifying units, we have time mismatched between driver(considers in 
seconds) and executor(considers as milliseconds)

 
[https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/SparkConf.scala#L613]

vs

[https://github.com/apache/spark/blob/v2.4.1-rc8/core/src/main/scala/org/apache/spark/executor/Executor.scala#L858]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27195.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24135

> Add AvroReadSchemaSuite
> ---
>
> Key: SPARK-27195
> URL: https://issues.apache.org/jira/browse/SPARK-27195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> The reader schema is said to be evolved (or projected) when it changed after 
> the data is written by writers. Apache Spark file-based data sources have a 
> test coverage for that. This issue aims to add `AvroReadSchemaSuite` to 
> ensure the minimal consistency among file-based data sources and prevent a 
> future regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27195:
-

Assignee: Dongjoon Hyun

> Add AvroReadSchemaSuite
> ---
>
> Key: SPARK-27195
> URL: https://issues.apache.org/jira/browse/SPARK-27195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> The reader schema is said to be evolved (or projected) when it changed after 
> the data is written by writers. Apache Spark file-based data sources have a 
> test coverage for that. This issue aims to add `AvroReadSchemaSuite` to 
> ensure the minimal consistency among file-based data sources and prevent a 
> future regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27195:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25603

> Add AvroReadSchemaSuite
> ---
>
> Key: SPARK-27195
> URL: https://issues.apache.org/jira/browse/SPARK-27195
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> The reader schema is said to be evolved (or projected) when it changed after 
> the data is written by writers. Apache Spark file-based data sources have a 
> test coverage for that. This issue aims to add `AvroReadSchemaSuite` to 
> ensure the minimal consistency among file-based data sources and prevent a 
> future regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27197:


Assignee: Apache Spark

> Add ReadNestedSchemaTest for file-based data sources
> 
>
> Key: SPARK-27197
> URL: https://issues.apache.org/jira/browse/SPARK-27197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue adds a test coverage into the schema evolution suite for adding 
> nested column and hiding nested columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources

2019-03-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27197:
-

Assignee: Dongjoon Hyun

> Add ReadNestedSchemaTest for file-based data sources
> 
>
> Key: SPARK-27197
> URL: https://issues.apache.org/jira/browse/SPARK-27197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue adds a test coverage into the schema evolution suite for adding 
> nested column and hiding nested columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27197:


Assignee: (was: Apache Spark)

> Add ReadNestedSchemaTest for file-based data sources
> 
>
> Key: SPARK-27197
> URL: https://issues.apache.org/jira/browse/SPARK-27197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue adds a test coverage into the schema evolution suite for adding 
> nested column and hiding nested columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27197) Add ReadNestedSchemaTest for file-based data sources

2019-03-18 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27197:
-

 Summary: Add ReadNestedSchemaTest for file-based data sources
 Key: SPARK-27197
 URL: https://issues.apache.org/jira/browse/SPARK-27197
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue adds a test coverage into the schema evolution suite for adding 
nested column and hiding nested columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-03-18 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795575#comment-16795575
 ] 

Ajith S commented on SPARK-27194:
-

Currently looks like from logs the file name for task 200.0 and 
200.1(reattempt) expected file name to be, 
part-00200-blah-blah.c000.snappy.parquet. (refer 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol#getFilename)

May be we should have taskId_attemptId in the part file name so that rerun 
tasks do not conflict with older failed tasks.

cc [~srowen] [~cloud_fan] [~dongjoon] any thoughts.?

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client 17.161.235.91 already exists
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
>  
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-

[jira] [Updated] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation

2019-03-18 Thread Lijia Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lijia Liu updated SPARK-27192:
--
Description: 
When use dynamic executor allocation, if we set spark.executor.cores small than 
 spark.task.cpus, exception will be thrown as follows:

'''spark.executor.cores must not be < spark.task.cpus'''

But, if dynamic executor allocation not enabled, spark will hang when submit 
new job for TaskSchedulerImpl will not schedule a task in a executor which 
available cores is small than 

spark.task.cpus.See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]

So, when start task scheduler, spark.task.cpus should be check.

reproduce

$SPARK_HOME/bin/spark-shell --conf spark.task.cpus=2  --master local[1]

scala> sc.parallelize(1 to 9).collect

  was:
When use dynamic executor allocation, if we set spark.executor.cores small than 
 spark.task.cpus, exception will be thrown as follows:

'''spark.executor.cores must not be < spark.task.cpus'''

But, if dynamic executor allocation not enabled, spark will hang when submit 
new job for TaskSchedulerImpl will not schedule a task in a executor which 
available cores is small than 

spark.task.cpus.See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]

So, when start task scheduler, spark.task.cpus should be check.


> spark.task.cpus should be less or equal than spark.task.cpus when use static 
> executor allocation
> 
>
> Key: SPARK-27192
> URL: https://issues.apache.org/jira/browse/SPARK-27192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Lijia Liu
>Priority: Major
>
> When use dynamic executor allocation, if we set spark.executor.cores small 
> than  spark.task.cpus, exception will be thrown as follows:
> '''spark.executor.cores must not be < spark.task.cpus'''
> But, if dynamic executor allocation not enabled, spark will hang when submit 
> new job for TaskSchedulerImpl will not schedule a task in a executor which 
> available cores is small than 
> spark.task.cpus.See 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]
> So, when start task scheduler, spark.task.cpus should be check.
> reproduce
> $SPARK_HOME/bin/spark-shell --conf spark.task.cpus=2  --master local[1]
> scala> sc.parallelize(1 to 9).collect



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27188) FileStreamSink: provide a new option to have retention on output files

2019-03-18 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-27188:
-
Summary: FileStreamSink: provide a new option to have retention on output 
files  (was: FileStreamSink: provide a new option to disable metadata log)

> FileStreamSink: provide a new option to have retention on output files
> --
>
> Key: SPARK-27188
> URL: https://issues.apache.org/jira/browse/SPARK-27188
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-24295 we indicated various end users are struggling with dealing 
> with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
> readers which leverage metadata log to determine which files are safely read 
> (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.
> While we may be able to deal with checking deleted output files in 
> FileStreamSink and get rid of them when compacting metadata, that operation 
> would take additional overhead for running query. (I'll try to address this 
> via another issue though.)
> Back to the issue, 'exactly-once' via leveraging metadata is only possible 
> when output directory is being read by Spark, and for other cases it should 
> provide less guarantee. I think we could provide this as a workaround to 
> mitigate such issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-03-18 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795519#comment-16795519
 ] 

Jungtaek Lim commented on SPARK-24295:
--

[~iqbal_khattra] [~alfredo-gimenez-bv]

I would be really appreciated if you could review SPARK-27188 and see whether 
it works for your cases. Thanks in advance!

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27188) FileStreamSink: provide a new option to have retention on output files

2019-03-18 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-27188:
-
Description: 
>From SPARK-24295 we indicated various end users are struggling with dealing 
>with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
>readers which leverage metadata log to determine which files are safely read 
>(to ensure 'exactly-once'), pruning metadata log is not trivial to implement.

While we may be able to deal with checking deleted output files in 
FileStreamSink and get rid of them when compacting metadata, that operation 
would take additional overhead for running query. (I'll try to address this via 
another issue though.)

We can still get time-to-live (TTL) of output files from end users, and filter 
out files in metadata so that metadata is not growing linearly. Also filtered 
out files will be no longer seen in reader queries which leverage 
File(Stream)Source.

  was:
>From SPARK-24295 we indicated various end users are struggling with dealing 
>with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
>readers which leverage metadata log to determine which files are safely read 
>(to ensure 'exactly-once'), pruning metadata log is not trivial to implement.

While we may be able to deal with checking deleted output files in 
FileStreamSink and get rid of them when compacting metadata, that operation 
would take additional overhead for running query. (I'll try to address this via 
another issue though.)

Back to the issue, 'exactly-once' via leveraging metadata is only possible when 
output directory is being read by Spark, and for other cases it should provide 
less guarantee. I think we could provide this as a workaround to mitigate such 
issue.


> FileStreamSink: provide a new option to have retention on output files
> --
>
> Key: SPARK-27188
> URL: https://issues.apache.org/jira/browse/SPARK-27188
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-24295 we indicated various end users are struggling with dealing 
> with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
> readers which leverage metadata log to determine which files are safely read 
> (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.
> While we may be able to deal with checking deleted output files in 
> FileStreamSink and get rid of them when compacting metadata, that operation 
> would take additional overhead for running query. (I'll try to address this 
> via another issue though.)
> We can still get time-to-live (TTL) of output files from end users, and 
> filter out files in metadata so that metadata is not growing linearly. Also 
> filtered out files will be no longer seen in reader queries which leverage 
> File(Stream)Source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile

2019-03-18 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795494#comment-16795494
 ] 

shane knapp commented on SPARK-27178:
-

[~vanzin] just merged https://github.com/apache/spark/pull/24137

marking as resolved...  we should be g2g for now.  i'll create a new jira to 
discuss the potential pinning of these dockerfiles to a specific image version.

> k8s test failing due to missing nss library in dockerfile
> -
>
> Key: SPARK-27178
> URL: https://issues.apache.org/jira/browse/SPARK-27178
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> while performing some tests on our existing minikube and k8s infrastructure, 
> i noticed that the integration tests were failing.  i dug in and discovered 
> the following message buried at the end of the stacktrace:
> {noformat}
>   Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
>   at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
>   at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218)
>   ... 81 more
> {noformat}
> after i added the 'nss' package to 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, 
> everything worked.
> i will also check and see if this is failing on 2.4...
> tbh, i have no idea why this literally started failing today and not earlier. 
>  the only recent change to this file that i can find is 
> https://issues.apache.org/jira/browse/SPARK-26995



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27178) k8s test failing due to missing nss library in dockerfile

2019-03-18 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795494#comment-16795494
 ] 

shane knapp edited comment on SPARK-27178 at 3/18/19 11:54 PM:
---

[~vanzin] just merged https://github.com/apache/spark/pull/24137

we should be g2g for now.  i'll create a new jira to discuss the potential 
pinning of these dockerfiles to a specific image version.


was (Author: shaneknapp):
[~vanzin] just merged https://github.com/apache/spark/pull/24137

marking as resolved...  we should be g2g for now.  i'll create a new jira to 
discuss the potential pinning of these dockerfiles to a specific image version.

> k8s test failing due to missing nss library in dockerfile
> -
>
> Key: SPARK-27178
> URL: https://issues.apache.org/jira/browse/SPARK-27178
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> while performing some tests on our existing minikube and k8s infrastructure, 
> i noticed that the integration tests were failing.  i dug in and discovered 
> the following message buried at the end of the stacktrace:
> {noformat}
>   Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
>   at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
>   at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218)
>   ... 81 more
> {noformat}
> after i added the 'nss' package to 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, 
> everything worked.
> i will also check and see if this is failing on 2.4...
> tbh, i have no idea why this literally started failing today and not earlier. 
>  the only recent change to this file that i can find is 
> https://issues.apache.org/jira/browse/SPARK-26995



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27178) k8s test failing due to missing nss library in dockerfile

2019-03-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27178.

   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.2

> k8s test failing due to missing nss library in dockerfile
> -
>
> Key: SPARK-27178
> URL: https://issues.apache.org/jira/browse/SPARK-27178
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> while performing some tests on our existing minikube and k8s infrastructure, 
> i noticed that the integration tests were failing.  i dug in and discovered 
> the following message buried at the end of the stacktrace:
> {noformat}
>   Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
>   at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
>   at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218)
>   ... 81 more
> {noformat}
> after i added the 'nss' package to 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, 
> everything worked.
> i will also check and see if this is failing on 2.4...
> tbh, i have no idea why this literally started failing today and not earlier. 
>  the only recent change to this file that i can find is 
> https://issues.apache.org/jira/browse/SPARK-26995



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile

2019-03-18 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795487#comment-16795487
 ] 

shane knapp commented on SPARK-27178:
-

https://github.com/apache/spark/pull/24111 merged to master.  holding off on 
the 2.4 fix.

> k8s test failing due to missing nss library in dockerfile
> -
>
> Key: SPARK-27178
> URL: https://issues.apache.org/jira/browse/SPARK-27178
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> while performing some tests on our existing minikube and k8s infrastructure, 
> i noticed that the integration tests were failing.  i dug in and discovered 
> the following message buried at the end of the stacktrace:
> {noformat}
>   Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
>   at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
>   at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218)
>   ... 81 more
> {noformat}
> after i added the 'nss' package to 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, 
> everything worked.
> i will also check and see if this is failing on 2.4...
> tbh, i have no idea why this literally started failing today and not earlier. 
>  the only recent change to this file that i can find is 
> https://issues.apache.org/jira/browse/SPARK-26995



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20787) PySpark can't handle datetimes before 1900

2019-03-18 Thread AdrianC (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795477#comment-16795477
 ] 

AdrianC commented on SPARK-20787:
-

Hi,

We are seeing this issue in pyspark 2.2.1/ python 2.7 while trying to port a 
legacy system to pyspark (AWS Glue). 

I'm trying to follow the pull request comments, but am I to understand that 
this issue is not being fixed right now?

> PySpark can't handle datetimes before 1900
> --
>
> Key: SPARK-20787
> URL: https://issues.apache.org/jira/browse/SPARK-20787
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Keith Bourgoin
>Priority: Major
>
> When trying to put a datetime before 1900 into a DataFrame, it throws an 
> error because of the use of time.mktime.
> {code}
> Python 2.7.13 (default, Mar  8 2017, 17:29:55)
> Type "copyright", "credits" or "license" for more information.
> IPython 5.3.0 -- An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help  -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Mar  8 2017 17:29:55)
> SparkSession available as 'spark'.
> In [1]: import datetime as dt
> In [2]: 
> sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count()
> 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 174, in main
> process()
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 169, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 268, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, 
> in toInternal
> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
>   File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, 
> in 
> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", 
> line 436, in toInternal
> return self.dataType.toInternal(obj)
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", 
> line 191, in toInternal
> else time.mktime(dt.timetuple()))
> ValueError: year out of range
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.

[jira] [Created] (SPARK-27196) Beginning offset 115204574 is after the ending offset 115204516 for topic

2019-03-18 Thread Prasanna Talakanti (JIRA)
Prasanna Talakanti created SPARK-27196:
--

 Summary: Beginning offset 115204574 is after the ending offset 
115204516 for topic 
 Key: SPARK-27196
 URL: https://issues.apache.org/jira/browse/SPARK-27196
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
 Environment: Spark : 2.3.0

Sparks Kafka: spark-streaming-kafka-0-10_2.3.0

Kafka Client: org.apache.kafka.kafka-clients: 0.11.0.1
Reporter: Prasanna Talakanti


We are getting this issue in production and Sparks consumer dying because of 
Off Set issue.

We observed the following error in Kafka Broker

--

[2019-03-18 14:40:14,100] WARN Unable to reconnect to ZooKeeper service, 
session 0x1692e9ff4410004 has expired (org.apache.zookeeper.ClientCnxn)
[2019-03-18 14:40:14,100] INFO Unable to reconnect to ZooKeeper service, 
session 0x1692e9ff4410004 has expired, closing socket connection 
(org.apache.zook
eeper.ClientCnxn)

---

 

Sparks Job died with the following error:

ERROR 2019-03-18 07:40:57,178 7924 org.apache.spark.executor.Executor [Executor 
task launch worker for task 16] Exception in task 27.0 in stage 0.0 (TID 16)
java.lang.AssertionError: assertion failed: Beginning offset 115204574 is after 
the ending offset 115204516 for topic  partition 37. You either 
provided an invalid fromOffset, or the Kafka topic has been damaged
 at scala.Predef$.assert(Predef.scala:170)
 at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:175)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27178) k8s test failing due to missing nss library in dockerfile

2019-03-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795414#comment-16795414
 ] 

Apache Spark commented on SPARK-27178:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/24137

> k8s test failing due to missing nss library in dockerfile
> -
>
> Key: SPARK-27178
> URL: https://issues.apache.org/jira/browse/SPARK-27178
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> while performing some tests on our existing minikube and k8s infrastructure, 
> i noticed that the integration tests were failing.  i dug in and discovered 
> the following message buried at the end of the stacktrace:
> {noformat}
>   Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
>   at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
>   at sun.security.pkcs11.SunPKCS11.(SunPKCS11.java:218)
>   ... 81 more
> {noformat}
> after i added the 'nss' package to 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile, 
> everything worked.
> i will also check and see if this is failing on 2.4...
> tbh, i have no idea why this literally started failing today and not earlier. 
>  the only recent change to this file that i can find is 
> https://issues.apache.org/jira/browse/SPARK-26995



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user

2019-03-18 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795382#comment-16795382
 ] 

nirav patel commented on SPARK-26844:
-

[~hyukjin.kwon] Yes, Paruqet file is corrupt. It has newline character in field 
value somewhere. Is it possible for spark to provide more information when it 
fails to read such file?

> Parquet Reader exception - ArrayIndexOutOfBound should give more information 
> to user
> 
>
> Key: SPARK-26844
> URL: https://issues.apache.org/jira/browse/SPARK-26844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: nirav patel
>Priority: Minor
>
> I get following error while reading parquet file which has primitive 
> datatypes (INT32, binary)
>  
>  
> spark.read.format("parquet").load(path).show() // error happens here
>  
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)
> at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  
>  
> Point if  ArrayIndexOutOfBoundsException raised on a column/field spark 
> should say what particular column/field it is. it helps in troubleshoot.
>  
> e.g. I get following error while reading same file using Drill reader.
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
> Error reading page data File: /.../../part-00016-0-m-00016.parquet 
> *Column: GROUP_NAME* Row Group Start: 5539 Fragment 0:0 
> I also get more specific information in Drillbit.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user

2019-03-18 Thread nirav patel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-26844:

Description: 
I get following error while reading parquet file which has primitive datatypes 
(INT32, binary)

 Parquet file is potentially corrupt. It has newline character in some field 
value.

 

spark.read.format("parquet").load(path).show() // error happens here

 

Caused by: java.lang.ArrayIndexOutOfBoundsException

at java.lang.System.arraycopy(Native Method)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)

at 
org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)

at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)

at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)

at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)

at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)

at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:108)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

 

 

 

Point if  ArrayIndexOutOfBoundsException raised on a column/field spark should 
say what particular column/field it is. it helps in troubleshoot.

 

e.g. I get following error while reading same file using Drill reader.

org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error 
reading page data File: /.../../part-00016-0-m-00016.parquet *Column: 
GROUP_NAME* Row Group Start: 5539 Fragment 0:0 

I also get more specific information in Drillbit.log

  was:
I get following error while reading parquet file which has primitive datatypes 
(INT32, binary)

 

 

spark.read.format("parquet").load(path).show() // error happens here

 

Caused by: java.lang.ArrayIndexOutOfBoundsException

at java.lang.System.arraycopy(Native Method)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)

at 
org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.r

[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27088:


Assignee: Apache Spark

> Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in 
> RuleExecutor
> -
>
> Key: SPARK-27088
> URL: https://issues.apache.org/jira/browse/SPARK-27088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> Similar to SPARK-25415, which has made log level for plan changes by each 
> rule configurable, we can make log level for plan changes by each batch 
> configurable too and can reuse the same configuration: 
> "spark.sql.optimizer.planChangeLog.level".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27088:


Assignee: (was: Apache Spark)

> Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in 
> RuleExecutor
> -
>
> Key: SPARK-27088
> URL: https://issues.apache.org/jira/browse/SPARK-27088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Priority: Minor
>
> Similar to SPARK-25415, which has made log level for plan changes by each 
> rule configurable, we can make log level for plan changes by each batch 
> configurable too and can reuse the same configuration: 
> "spark.sql.optimizer.planChangeLog.level".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27195:


Assignee: (was: Apache Spark)

> Add AvroReadSchemaSuite
> ---
>
> Key: SPARK-27195
> URL: https://issues.apache.org/jira/browse/SPARK-27195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> The reader schema is said to be evolved (or projected) when it changed after 
> the data is written by writers. Apache Spark file-based data sources have a 
> test coverage for that. This issue aims to add `AvroReadSchemaSuite` to 
> ensure the minimal consistency among file-based data sources and prevent a 
> future regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27195:


Assignee: Apache Spark

> Add AvroReadSchemaSuite
> ---
>
> Key: SPARK-27195
> URL: https://issues.apache.org/jira/browse/SPARK-27195
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> The reader schema is said to be evolved (or projected) when it changed after 
> the data is written by writers. Apache Spark file-based data sources have a 
> test coverage for that. This issue aims to add `AvroReadSchemaSuite` to 
> ensure the minimal consistency among file-based data sources and prevent a 
> future regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27195) Add AvroReadSchemaSuite

2019-03-18 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27195:
-

 Summary: Add AvroReadSchemaSuite
 Key: SPARK-27195
 URL: https://issues.apache.org/jira/browse/SPARK-27195
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


The reader schema is said to be evolved (or projected) when it changed after 
the data is written by writers. Apache Spark file-based data sources have a 
test coverage for that. This issue aims to add `AvroReadSchemaSuite` to ensure 
the minimal consistency among file-based data sources and prevent a future 
regression in Avro data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-03-18 Thread Reza Safi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Safi updated SPARK-27194:
--
Description: 
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a different executor and fails because the 
part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
already exists:

19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
(TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client 17.161.235.91 already exists

The job fails when the the configured task attempts (spark.task.maxFailures) 
have failed with the same error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in 
stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 
(TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 ...
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client i.p.a.d already exists

 

SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt.

This issue seems to happen when 
spark.sql.sources.partitionOverwriteMode=dynamic. 

 

  was:
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on 

[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-03-18 Thread Reza Safi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Safi updated SPARK-27194:
--
Description: 
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a different executor and fails because the 
part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
already exists:

19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
(TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client 17.161.235.91 already exists

The job fails when the the configured task attempts (spark.task.maxFailures) 
have failed with the same error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in 
stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 
(TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 ...
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client i.p.a.d already exists

 

SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt.

This seems that happens when

spark.sql.sources.partitionOverwriteMode=dynamic. 

 

  was:
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a d

[jira] [Updated] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-03-18 Thread Reza Safi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Safi updated SPARK-27194:
--
Description: 
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a different executor and fails because the 
part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
already exists:

19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
(TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client 17.161.235.91 already exists

The job fails when the the configured task attempts (spark.task.maxFailures) 
have failed with the same error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in 
stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 
(TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
failed while writing rows.
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
 ...
 Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client i.p.a.d already exists

 

SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt.

This seems that happens when dynamicPartitionOverwrite=dynamic. 

 

  was:
When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a different executor

[jira] [Created] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2019-03-18 Thread Reza Safi (JIRA)
Reza Safi created SPARK-27194:
-

 Summary: Job failures when task attempts do not clean up 
spark-staging parquet files
 Key: SPARK-27194
 URL: https://issues.apache.org/jira/browse/SPARK-27194
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.3.2, 2.3.1
Reporter: Reza Safi


When a container fails for some reason (for example when killed by yarn for 
exceeding memory limits), the subsequent task attempts for the tasks that were 
running on that container all fail with a FileAlreadyExistsException. The 
original task attempt does not seem to successfully call abortTask (or at least 
its "best effort" delete is unsuccessful) and clean up the parquet file it was 
writing to, so when later task attempts try to write to the same spark-staging 
directory using the same file name, the job fails.

Here is what transpires in the logs:

The container where task 200.0 is running is killed and the task is lost:

19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 0.0 
(TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 exited 
caused by one of the running tasks) Reason: Container killed by YARN for 
exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.

The task is re-attempted on a different executor and fails because the 
part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
already exists:

19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
(TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client 17.161.235.91 already exists

The job fails when the the configured task attempts (spark.task.maxFailures) 
have failed with the same error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 in 
stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 0.0 
(TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
...
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
/user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
 for client i.p.a.d already exists

 

SPARK-26682 wasn't the root cause here, since there wasn't any stage reattempt.

This seems that happens when dynamicPartitionOverwrite=true. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Mrinal Kanti Sardar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mrinal Kanti Sardar resolved SPARK-27191.
-
   Resolution: Not A Bug
Fix Version/s: 2.3.0

Explained

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
> Fix For: 2.3.0
>
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Mrinal Kanti Sardar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795170#comment-16795170
 ] 

Mrinal Kanti Sardar commented on SPARK-27191:
-

Absolutely. I, somehow, missed `unionByName`. Thanks for the explanation. Will 
close this issue.

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795154#comment-16795154
 ] 

Liang-Chi Hsieh commented on SPARK-27191:
-

Thanks for pining me and giving the answer [~yumwang][~dkbiswal].

As [~dkbiswal] said, {{union}} resolves columns by positions, so the behavior 
in the description is expected. I think the document of {{union}} explains it 
now.

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795166#comment-16795166
 ] 

Dilip Biswal commented on SPARK-27191:
--

[~viirya] Thank you very much.

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795150#comment-16795150
 ] 

Dilip Biswal commented on SPARK-27191:
--

Hello [~mrinal10449],

The Jira you have referred to 
[link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually 
hasn't resulted in a code change. As a fix, [~viirya] has improved the 
documentation of the union API by clarifying that union api resolves the 
columns by their positions and not by name. Here is the link to the 
[PR|https://github.com/apache/spark/pull/19570/files]. The recommended method 
for your use case is to use 'unionByName'.



> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795150#comment-16795150
 ] 

Dilip Biswal edited comment on SPARK-27191 at 3/18/19 4:09 PM:
---

Hello [~mrinal10449],

The Jira you have referred to 
[link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually 
hasn't resulted in a code change. As a fix, [~viirya] has improved the 
documentation of the union API by clarifying that union api resolves the 
columns by their positions and not by name. Here is the link to the 
[PR|https://github.com/apache/spark/pull/19570/files]. The recommended method 
for your use case is to use 'unionByName' instead.




was (Author: dkbiswal):
Hello [~mrinal10449],

The Jira you have referred to 
[link-22335|https://issues.apache.org/jira/browse/SPARK-22335 ], actually 
hasn't resulted in a code change. As a fix, [~viirya] has improved the 
documentation of the union API by clarifying that union api resolves the 
columns by their positions and not by name. Here is the link to the 
[PR|https://github.com/apache/spark/pull/19570/files]. The recommended method 
for your use case is to use 'unionByName'.



> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-27193:

Priority: Trivial  (was: Major)

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Trivial
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> \* Codegend pipeline for stage (id=1)
> \* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> \* +- *(1) Filter (id#0L = 1)
> \*+- *(1) Range (0, 10, step=1, splits=4)
> \*/
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-18 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27112.
--
   Resolution: Fixed
 Assignee: Parth Gandhi
Fix Version/s: 3.0.0

Fixed by https://github.com/apache/spark/pull/24072

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}
>  
> Also make sure that the following configs are set when launching the above 
> spark job:
> *spark.blacklist.enabled=true*
> *spark.blacklist.killBlacklistedExecutors=true*
> *spark.blacklist.application.maxFailedTasksPerExecutor=1*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread wuyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-27193:
-
Description: 
when enable `spark.sql.codegen.comments`,  there will be multiple comment 
lines. However, CodeFormatter can not handle multi comment lines currently:

 
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ /**
\* Codegend pipeline for stage (id=1)
\* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
\* +- *(1) Filter (id#0L = 1)
\*+- *(1) Range (0, 10, step=1, splits=4)
\*/
/* 006 */ // codegenStageId=1
/* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
org.apache.spark.sql.execution.BufferedRowIterator {

  was:
when enable `spark.sql.codegen.comments`,  there will be multiple comment 
lines. However, CodeFormatter can not handle multi comment lines currently:

 
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ /**
* Codegend pipeline for stage (id=1)
* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
* +- *(1) Filter (id#0L = 1)
*+- *(1) Range (0, 10, step=1, splits=4)
*/
/* 006 */ // codegenStageId=1
/* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
org.apache.spark.sql.execution.BufferedRowIterator {


> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> \* Codegend pipeline for stage (id=1)
> \* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> \* +- *(1) Filter (id#0L = 1)
> \*+- *(1) Range (0, 10, step=1, splits=4)
> \*/
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27193:


Assignee: (was: Apache Spark)

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> * Codegend pipeline for stage (id=1)
> * *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> * +- *(1) Filter (id#0L = 1)
> *+- *(1) Range (0, 10, step=1, splits=4)
> */
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27193:


Assignee: Apache Spark

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> * Codegend pipeline for stage (id=1)
> * *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> * +- *(1) Filter (id#0L = 1)
> *+- *(1) Range (0, 10, step=1, splits=4)
> */
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread wuyi (JIRA)
wuyi created SPARK-27193:


 Summary: CodeFormatter should format multi comment lines correctly
 Key: SPARK-27193
 URL: https://issues.apache.org/jira/browse/SPARK-27193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: wuyi


when enable `spark.sql.codegen.comments`,  there will be multiple comment 
lines. However, CodeFormatter can not handle multi comment lines currently:

 
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ /**
* Codegend pipeline for stage (id=1)
* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
* +- *(1) Filter (id#0L = 1)
*+- *(1) Range (0, 10, step=1, splits=4)
*/
/* 006 */ // codegenStageId=1
/* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27189) Add Executor level memory usage metrics to the metrics system

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27189:


Assignee: (was: Apache Spark)

> Add Executor level memory usage metrics to the metrics system
> -
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27189) Add Executor level memory usage metrics to the metrics system

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27189:


Assignee: Apache Spark

> Add Executor level memory usage metrics to the metrics system
> -
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27189) Add Executor level memory usage metrics to the metrics system

2019-03-18 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-27189:

Summary: Add Executor level memory usage metrics to the metrics system  
(was: Add Executor level metrics to the metrics system)

> Add Executor level memory usage metrics to the metrics system
> -
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27192:


Assignee: (was: Apache Spark)

> spark.task.cpus should be less or equal than spark.task.cpus when use static 
> executor allocation
> 
>
> Key: SPARK-27192
> URL: https://issues.apache.org/jira/browse/SPARK-27192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Lijia Liu
>Priority: Major
>
> When use dynamic executor allocation, if we set spark.executor.cores small 
> than  spark.task.cpus, exception will be thrown as follows:
> '''spark.executor.cores must not be < spark.task.cpus'''
> But, if dynamic executor allocation not enabled, spark will hang when submit 
> new job for TaskSchedulerImpl will not schedule a task in a executor which 
> available cores is small than 
> spark.task.cpus.See 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]
> So, when start task scheduler, spark.task.cpus should be check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27192:


Assignee: Apache Spark

> spark.task.cpus should be less or equal than spark.task.cpus when use static 
> executor allocation
> 
>
> Key: SPARK-27192
> URL: https://issues.apache.org/jira/browse/SPARK-27192
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Lijia Liu
>Assignee: Apache Spark
>Priority: Major
>
> When use dynamic executor allocation, if we set spark.executor.cores small 
> than  spark.task.cpus, exception will be thrown as follows:
> '''spark.executor.cores must not be < spark.task.cpus'''
> But, if dynamic executor allocation not enabled, spark will hang when submit 
> new job for TaskSchedulerImpl will not schedule a task in a executor which 
> available cores is small than 
> spark.task.cpus.See 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]
> So, when start task scheduler, spark.task.cpus should be check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation

2019-03-18 Thread Lijia Liu (JIRA)
Lijia Liu created SPARK-27192:
-

 Summary: spark.task.cpus should be less or equal than 
spark.task.cpus when use static executor allocation
 Key: SPARK-27192
 URL: https://issues.apache.org/jira/browse/SPARK-27192
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.0, 2.2.0
Reporter: Lijia Liu


When use dynamic executor allocation, if we set spark.executor.cores small than 
 spark.task.cpus, exception will be thrown as follows:

'''spark.executor.cores must not be < spark.task.cpus'''

But, if dynamic executor allocation not enabled, spark will hang when submit 
new job for TaskSchedulerImpl will not schedule a task in a executor which 
available cores is small than 

spark.task.cpus.See 
[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]

So, when start task scheduler, spark.task.cpus should be check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Mrinal Kanti Sardar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mrinal Kanti Sardar updated SPARK-27191:

Summary: union of dataframes depends on order of the columns in 2.4.0  
(was: union of dataframes depends on order of the columns)

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27191) union of dataframes depends on order of the columns in 2.4.0

2019-03-18 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795018#comment-16795018
 ] 

Yuming Wang commented on SPARK-27191:
-

cc [~viirya]

> union of dataframes depends on order of the columns in 2.4.0
> 
>
> Key: SPARK-27191
> URL: https://issues.apache.org/jira/browse/SPARK-27191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mrinal Kanti Sardar
>Priority: Major
>
> Thought this issue was resolved in 2.3.0 according to 
> https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
> 2.4.0.
> {code:java}
> >>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
> >>> df_1.show()
> +++
> |col1| col2|
> +++
> | 1aa|1bbb|
> +++
> >>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
> >>> df_2.show()
> +++
> | col2|col1|
> +++
> |2bbb| 2aa|
> +++
> >>> df_u = df_1.union(df_2)
> >>> df_u.show()
> +++
> | col1| col2|
> +++
> | 1aa|1bbb|
> |2bbb| 2aa|
> +++
> >>> spark.version
> '2.4.0'
> >>>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27191) union of dataframes depends on order of the columns

2019-03-18 Thread Mrinal Kanti Sardar (JIRA)
Mrinal Kanti Sardar created SPARK-27191:
---

 Summary: union of dataframes depends on order of the columns
 Key: SPARK-27191
 URL: https://issues.apache.org/jira/browse/SPARK-27191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Mrinal Kanti Sardar


Thought this issue was resolved in 2.3.0 according to 
https://issues.apache.org/jira/browse/SPARK-22335 but I still faced this in 
2.4.0.
{code:java}
>>> df_1 = spark.createDataFrame([["1aa", "1bbb"]], ["col1", "col2"])
>>> df_1.show()
+++
|col1| col2|
+++
| 1aa|1bbb|
+++

>>> df_2 = spark.createDataFrame([["2bbb", "2aa"]], ["col2", "col1"])
>>> df_2.show()
+++
| col2|col1|
+++
|2bbb| 2aa|
+++

>>> df_u = df_1.union(df_2)
>>> df_u.show()
+++
| col1| col2|
+++
| 1aa|1bbb|
|2bbb| 2aa|
+++

>>> spark.version
'2.4.0'
>>>
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27190) Add DataSourceV2 capabilities for streaming

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27190:


Assignee: Wenchen Fan  (was: Apache Spark)

> Add DataSourceV2 capabilities for streaming
> ---
>
> Key: SPARK-27190
> URL: https://issues.apache.org/jira/browse/SPARK-27190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27089) Loss of precision during decimal division

2019-03-18 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794984#comment-16794984
 ] 

Marco Gaido commented on SPARK-27089:
-

You can set: {{spark.sql.decimalOperations.allowPrecisionLoss}} to {{false}} 
and you will get the original behavior. Please see SPARK-22036 for more details.

> Loss of precision during decimal division
> -
>
> Key: SPARK-27089
> URL: https://issues.apache.org/jira/browse/SPARK-27089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: ylo0ztlmtusq
>Priority: Major
>
> Spark looses decimal places when dividing decimal numbers.
>  
> Expected behavior (In Spark 2.2.3 or before)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Current behavior (In Spark 2.3.2 and later)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Seems to caused by {{promote_precision(38, 6) }}
>  
> {code:java}
> scala> spark.sql(sql).explain(true)
> == Parsed Logical Plan ==
> Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as 
> decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Analyzed Logical Plan ==
> val: decimal(38,14)
> Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) 
> as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as 
> decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Optimized Logical Plan ==
> Project [0.33 AS val#20]
> +- OneRowRelation
> == Physical Plan ==
> *(1) Project [0.33 AS val#20]
> +- Scan OneRowRelation[]
> {code}
>  
> Source https://stackoverflow.com/q/55046492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27189) Add Executor level metrics to the metrics system

2019-03-18 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-27189:

Attachment: Example_dashboard_Spark_Memory_Metrics.PNG

> Add Executor level metrics to the metrics system
> 
>
> Key: SPARK-27189
> URL: https://issues.apache.org/jira/browse/SPARK-27189
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: Example_dashboard_Spark_Memory_Metrics.PNG
>
>
> This proposes to add instrumentation of memory usage via the Spark 
> Dropwizard/Codahale metrics system. Memory usage metrics are available via 
> the Executor metrics, recently implemented as detailed in 
> https://issues.apache.org/jira/browse/SPARK-23206. 
> Making metrics usage metrics available via the Spark Dropwzard metrics system 
> allow to improve Spark performance dashboards and study memory usage, as in 
> the attached example graph.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27190) Add DataSourceV2 capabilities for streaming

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27190:


Assignee: Apache Spark  (was: Wenchen Fan)

> Add DataSourceV2 capabilities for streaming
> ---
>
> Key: SPARK-27190
> URL: https://issues.apache.org/jira/browse/SPARK-27190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27190) Add DataSourceV2 capabilities for streaming

2019-03-18 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27190:
---

 Summary: Add DataSourceV2 capabilities for streaming
 Key: SPARK-27190
 URL: https://issues.apache.org/jira/browse/SPARK-27190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27189) Add Executor level metrics to the metrics system

2019-03-18 Thread Luca Canali (JIRA)
Luca Canali created SPARK-27189:
---

 Summary: Add Executor level metrics to the metrics system
 Key: SPARK-27189
 URL: https://issues.apache.org/jira/browse/SPARK-27189
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Luca Canali
 Attachments: Example_dashboard_Spark_Memory_Metrics.PNG

This proposes to add instrumentation of memory usage via the Spark 
Dropwizard/Codahale metrics system. Memory usage metrics are available via the 
Executor metrics, recently implemented as detailed in 
https://issues.apache.org/jira/browse/SPARK-23206. 
Making metrics usage metrics available via the Spark Dropwzard metrics system 
allow to improve Spark performance dashboards and study memory usage, as in the 
attached example graph.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21492) Memory leak in SortMergeJoin

2019-03-18 Thread Xiaoju Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794962#comment-16794962
 ] 

Xiaoju Wu commented on SPARK-21492:
---

Any updates? Do you have any discussion on the general fix instead of hack in 
SMJ?

> Memory leak in SortMergeJoin
> 
>
> Key: SPARK-21492
> URL: https://issues.apache.org/jira/browse/SPARK-21492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1, 3.0.0
>Reporter: Zhan Zhang
>Priority: Major
>
> In SortMergeJoin, if the iterator is not exhausted, there will be memory leak 
> caused by the Sort. The memory is not released until the task end, and cannot 
> be used by other operators causing performance drop or OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object

2019-03-18 Thread hehuiyuan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hehuiyuan updated SPARK-27184:
--
Description: 
In the org.apache.spark.internal.config,we define the variables of FILES and 
JARS,we can use them instead of "spark.jars" and "spark.files".

private[spark] val JARS = ConfigBuilder("spark.jars")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

private[spark] val FILES = ConfigBuilder("spark.files")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

 

 

 

  was:
In the org.apache.spark.internal.object,we define the variables of FILES and 
JARS,we can use them instead of "spark.jars" and "spark.files".

private[spark] val JARS = ConfigBuilder("spark.jars")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

private[spark] val FILES = ConfigBuilder("spark.files")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

 

 

 


> Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in 
> config object
> 
>
> Key: SPARK-27184
> URL: https://issues.apache.org/jira/browse/SPARK-27184
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hehuiyuan
>Priority: Minor
>
> In the org.apache.spark.internal.config,we define the variables of FILES and 
> JARS,we can use them instead of "spark.jars" and "spark.files".
> private[spark] val JARS = ConfigBuilder("spark.jars")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
> private[spark] val FILES = ConfigBuilder("spark.files")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27063) Spark on K8S Integration Tests timeouts are too short for some test clusters

2019-03-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27063.
---
Resolution: Won't Fix

> Spark on K8S Integration Tests timeouts are too short for some test clusters
> 
>
> Key: SPARK-27063
> URL: https://issues.apache.org/jira/browse/SPARK-27063
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Priority: Minor
>
> As noted during development for SPARK-26729 there are a couple of integration 
> test timeouts that are too short when running on slower clusters e.g. 
> developers laptops, small CI clusters etc
> [~skonto] confirmed that he has also experienced this behaviour in the 
> discussion on PR [PR 
> 23846|https://github.com/apache/spark/pull/23846#discussion_r262564938]
> We should up the defaults of this timeouts as an initial step and longer term 
> consider making the timeouts themselves configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25837) Web UI does not respect spark.ui.retainedJobs in some instances

2019-03-18 Thread Xiaoju Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794932#comment-16794932
 ] 

Xiaoju Wu commented on SPARK-25837:
---

Did you verify this fix with the reproduce case above? I tried and found the 
issue is still there: the cleanup was still backed up but better than the 
version without this fix.

> Web UI does not respect spark.ui.retainedJobs in some instances
> ---
>
> Key: SPARK-25837
> URL: https://issues.apache.org/jira/browse/SPARK-25837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
> Environment: Reproduction Environment:
> Spark 2.3.1
> Dataproc 1.3-deb9
> 1x master 4 vCPUs, 15 GB
> 2x workers 4 vCPUs, 15 GB
>  
>Reporter: Patrick Brown
>Assignee: Patrick Brown
>Priority: Minor
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png
>
>
> Expected Behavior: Web UI only displays 1 completed job and remains 
> responsive.
> Actual Behavior: Both during job execution and following all job completion 
> for some non short amount of time the UI retains many completed jobs, causing 
> limited responsiveness.
>  
> To reproduce:
>  
>  > spark-shell --conf spark.ui.retainedJobs=1
>   
>  scala> import scala.concurrent._
>  scala> import scala.concurrent.ExecutionContext.Implicits.global
>  scala> for (i <- 0 until 5) { Future
> { println(sc.parallelize(0 until i).collect.length) }
> }
>   
>  
>  
> The attached screenshot shows the state of the webui after running the repro 
> code, you can see the ui is displaying some 43k completed jobs (takes a long 
> time to load) after a few minutes of inactivity this will clear out, however 
> in an application which continues to submit jobs every once in a while, the 
> issue persists.
>  
> The issue seems to appear when running multiple jobs at once as well as in 
> sequence for a while and may as well have something to do with high master 
> CPU usage (thus the collect in the repro code). My rough guess would be 
> whatever is managing clearing out completed jobs gets overwhelmed (on the 
> master during repro htop reported almost full CPU usage across all 4 cores).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26811:
---

Assignee: Ryan Blue

> Add DataSourceV2 capabilities to check support for batch append, overwrite, 
> truncate during analysis.
> -
>
> Key: SPARK-26811
> URL: https://issues.apache.org/jira/browse/SPARK-26811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.

2019-03-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26811.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24012
[https://github.com/apache/spark/pull/24012]

> Add DataSourceV2 capabilities to check support for batch append, overwrite, 
> truncate during analysis.
> -
>
> Key: SPARK-26811
> URL: https://issues.apache.org/jira/browse/SPARK-26811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27186) Optimize SortShuffleWriter writing process

2019-03-18 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27186:

Fix Version/s: (was: 3.0.0)

> Optimize SortShuffleWriter writing process
> --
>
> Key: SPARK-27186
> URL: https://issues.apache.org/jira/browse/SPARK-27186
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Priority: Minor
>
> If the SortShuffleWriter.write method records is empty, it should be return 
> directly and no need to keep running,refer to the process of 
> BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance 
> ExternalSorter and tmp file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27188:


Assignee: Apache Spark

> FileStreamSink: provide a new option to disable metadata log
> 
>
> Key: SPARK-27188
> URL: https://issues.apache.org/jira/browse/SPARK-27188
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> From SPARK-24295 we indicated various end users are struggling with dealing 
> with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
> readers which leverage metadata log to determine which files are safely read 
> (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.
> While we may be able to deal with checking deleted output files in 
> FileStreamSink and get rid of them when compacting metadata, that operation 
> would take additional overhead for running query. (I'll try to address this 
> via another issue though.)
> Back to the issue, 'exactly-once' via leveraging metadata is only possible 
> when output directory is being read by Spark, and for other cases it should 
> provide less guarantee. I think we could provide this as a workaround to 
> mitigate such issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27188:


Assignee: (was: Apache Spark)

> FileStreamSink: provide a new option to disable metadata log
> 
>
> Key: SPARK-27188
> URL: https://issues.apache.org/jira/browse/SPARK-27188
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> From SPARK-24295 we indicated various end users are struggling with dealing 
> with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
> readers which leverage metadata log to determine which files are safely read 
> (to ensure 'exactly-once'), pruning metadata log is not trivial to implement.
> While we may be able to deal with checking deleted output files in 
> FileStreamSink and get rid of them when compacting metadata, that operation 
> would take additional overhead for running query. (I'll try to address this 
> via another issue though.)
> Back to the issue, 'exactly-once' via leveraging metadata is only possible 
> when output directory is being read by Spark, and for other cases it should 
> provide less guarantee. I think we could provide this as a workaround to 
> mitigate such issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API

2019-03-18 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794904#comment-16794904
 ] 

Gabor Somogyi commented on SPARK-27124:
---

[~hyukjin.kwon] thanks for your time in the discussion.

> Expose org.apache.spark.sql.avro.SchemaConverters as developer API
> --
>
> Key: SPARK-27124
> URL: https://issues.apache.org/jira/browse/SPARK-27124
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to 
> convert schema between Spark SQL and avro. This is reachable from scala side 
> but not from pyspark. I suggest to add this as a developer API to ease 
> development for pyspark users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hi Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, however I'm not really sure of it.  Appreciate if someone can 
point in the right direction on where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hi Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, however I'm not really sure of it. And is it safe to move or 
delete this files ? Appreciate if someone can point in the right direction on 
where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hi Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? I'm 
> assuming it could be part of jar file that is being executed whenever a spark 
> job is completed, however I'm not really sure of it.  Appreciate if someone 
> can point in the right direction on where I could check for this files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27188) FileStreamSink: provide a new option to disable metadata log

2019-03-18 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-27188:


 Summary: FileStreamSink: provide a new option to disable metadata 
log
 Key: SPARK-27188
 URL: https://issues.apache.org/jira/browse/SPARK-27188
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


>From SPARK-24295 we indicated various end users are struggling with dealing 
>with huge FileStreamSink metadata log. Unfortunately, given we have arbitrary 
>readers which leverage metadata log to determine which files are safely read 
>(to ensure 'exactly-once'), pruning metadata log is not trivial to implement.

While we may be able to deal with checking deleted output files in 
FileStreamSink and get rid of them when compacting metadata, that operation 
would take additional overhead for running query. (I'll try to address this via 
another issue though.)

Back to the issue, 'exactly-once' via leveraging metadata is only possible when 
output directory is being read by Spark, and for other cases it should provide 
less guarantee. I think we could provide this as a workaround to mitigate such 
issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is running, but i could also be wrong. And is it safe to move or delete 
this files ? Appreciate if someone can point in the right direction on where I 
could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? Appreciate if someone can point 
in the right direction on where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? I'm 
> assuming it could be part of jar file that is being executed whenever a spark 
> job is running, but i could also be wrong. And is it safe to move or delete 
> this files ? Appreciate if someone can point in the right direction on where 
> I could check for this files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hi Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, however I'm not really sure of it. And is it safe to move or 
delete this files ? Appreciate if someone can point in the right direction on 
where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, however I'm not really sure of it. And is it safe to move or 
delete this files ? Appreciate if someone can point in the right direction on 
where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hi Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? I'm 
> assuming it could be part of jar file that is being executed whenever a spark 
> job is completed, however I'm not really sure of it. And is it safe to move 
> or delete this files ? Appreciate if someone can point in the right direction 
> on where I could check for this files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, however I'm not really sure of it. And is it safe to move or 
delete this files ? Appreciate if someone can point in the right direction on 
where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, but i could also be wrong. And is it safe to move or delete 
this files ? Appreciate if someone can point in the right direction on where I 
could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? I'm 
> assuming it could be part of jar file that is being executed whenever a spark 
> job is completed, however I'm not really sure of it. And is it safe to move 
> or delete this files ? Appreciate if someone can point in the right direction 
> on where I could check for this files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is completed, but i could also be wrong. And is it safe to move or delete 
this files ? Appreciate if someone can point in the right direction on where I 
could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? I'm 
assuming it could be part of jar file that is being executed whenever a spark 
job is running, but i could also be wrong. And is it safe to move or delete 
this files ? Appreciate if someone can point in the right direction on where I 
could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? I'm 
> assuming it could be part of jar file that is being executed whenever a spark 
> job is completed, but i could also be wrong. And is it safe to move or delete 
> this files ? Appreciate if someone can point in the right direction on where 
> I could check for this files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27186) Optimize SortShuffleWriter writing process

2019-03-18 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794881#comment-16794881
 ] 

Yuming Wang commented on SPARK-27186:
-

Please avoid to set {{Fix Version/s}} which is usually reserved for committers.

> Optimize SortShuffleWriter writing process
> --
>
> Key: SPARK-27186
> URL: https://issues.apache.org/jira/browse/SPARK-27186
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Priority: Minor
> Fix For: 3.0.0
>
>
> If the SortShuffleWriter.write method records is empty, it should be return 
> directly and no need to keep running,refer to the process of 
> BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance 
> ExternalSorter and tmp file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API

2019-03-18 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794883#comment-16794883
 ] 

Hyukjin Kwon commented on SPARK-27124:
--

Yea, thanks [~gsomogyi]. I'll keep my eyes on mailing list to see if we need to 
discuss about this API later!

> Expose org.apache.spark.sql.avro.SchemaConverters as developer API
> --
>
> Key: SPARK-27124
> URL: https://issues.apache.org/jira/browse/SPARK-27124
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to 
> convert schema between Spark SQL and avro. This is reachable from scala side 
> but not from pyspark. I suggest to add this as a developer API to ease 
> development for pyspark users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? Appreciate if someone can point 
in the right direction on where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

 This files was tag as possible sensitive files and possibly can be be 
exploited, as a precautionary measure can we restrict or remove this file from 
the website.

 Your help is highly appreciated.

 

Best Regards,

JG

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? Appreciate if someone can point 
in the right direction on where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

Your help is highly appreciated.

Best Regards,

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? 
> Previous findings shows that they are part of spark history jar files, but i 
> couldn't exactly pinpoint on what is the exact jar file that this file is 
> coming from. And is it safe to move or delete this files ? Appreciate if 
> someone can point in the right direction on where I could check for this 
> files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
>  This files was tag as possible sensitive files and possibly can be be 
> exploited, as a precautionary measure can we restrict or remove this file 
> from the website.
>  Your help is highly appreciated.
>  
> Best Regards,
> JG
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Issue Type: Bug  (was: Question)

> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? 
> Previous findings shows that they are part of spark history jar files, but i 
> couldn't exactly pinpoint on what is the exact jar file that this file is 
> coming from. And is it safe to move or delete this files ? Appreciate if 
> someone can point in the right direction on where I could check for this 
> files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
> Your help is highly appreciated.
> Best Regards,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? Appreciate if someone can point 
in the right direction on where I could check for this files.

http:///history.7z
 http:///history.bak
 http:///history.bz2
 http:///history.cfg
 http:///history.csv
 http:///history.dump
 http:///history.gz
 http:///history.ini
 http:///history.jar
 http:///history.old
 http:///history.ost
 http:///history.pst
 http:///history.sh
 http:///history.sln
 http:///history.sql
 http:///history.sql.bz2
 http:///history.sql.gz
 http:///history.tar
 http:///history.tar.bz2
 http:///history.tar.gz
 http:///history.war
 http:///history.zip

Your help is highly appreciated.

Best Regards,

 

  was:
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? 

http:///history.7z
http:///history.bak
http:///history.bz2
http:///history.cfg
http:///history.csv
http:///history.dump
http:///history.gz
http:///history.ini
http:///history.jar
http:///history.old
http:///history.ost
http:///history.pst
http:///history.sh
http:///history.sln
http:///history.sql
http:///history.sql.bz2
http:///history.sql.gz
http:///history.tar
http:///history.tar.bz2
http:///history.tar.gz
 http:///history.war
 http:///history.zip

Your help is highly appreciated.

 

Best Regards,

 


> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? 
> Previous findings shows that they are part of spark history jar files, but i 
> couldn't exactly pinpoint on what is the exact jar file that this file is 
> coming from. And is it safe to move or delete this files ? Appreciate if 
> someone can point in the right direction on where I could check for this 
> files.
> http:///history.7z
>  http:///history.bak
>  http:///history.bz2
>  http:///history.cfg
>  http:///history.csv
>  http:///history.dump
>  http:///history.gz
>  http:///history.ini
>  http:///history.jar
>  http:///history.old
>  http:///history.ost
>  http:///history.pst
>  http:///history.sh
>  http:///history.sln
>  http:///history.sql
>  http:///history.sql.bz2
>  http:///history.sql.gz
>  http:///history.tar
>  http:///history.tar.bz2
>  http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
> Your help is highly appreciated.
> Best Regards,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27187:
-
Description: 
Hello Everyone,

 Is there any way I could determine what spark service or component serves the 
following files please refer below and where can I locate this files? Previous 
findings shows that they are part of spark history jar files, but i couldn't 
exactly pinpoint on what is the exact jar file that this file is coming from. 
And is it safe to move or delete this files ? 

http:///history.7z
http:///history.bak
http:///history.bz2
http:///history.cfg
http:///history.csv
http:///history.dump
http:///history.gz
http:///history.ini
http:///history.jar
http:///history.old
http:///history.ost
http:///history.pst
http:///history.sh
http:///history.sln
http:///history.sql
http:///history.sql.bz2
http:///history.sql.gz
http:///history.tar
http:///history.tar.bz2
http:///history.tar.gz
 http:///history.war
 http:///history.zip

Your help is highly appreciated.

 

Best Regards,

 

> What spark jar files serves the following files ..
> --
>
> Key: SPARK-27187
> URL: https://issues.apache.org/jira/browse/SPARK-27187
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Hello Everyone,
>  Is there any way I could determine what spark service or component serves 
> the following files please refer below and where can I locate this files? 
> Previous findings shows that they are part of spark history jar files, but i 
> couldn't exactly pinpoint on what is the exact jar file that this file is 
> coming from. And is it safe to move or delete this files ? 
> http:///history.7z
> http:///history.bak
> http:///history.bz2
> http:///history.cfg
> http:///history.csv
> http:///history.dump
> http:///history.gz
> http:///history.ini
> http:///history.jar
> http:///history.old
> http:///history.ost
> http:///history.pst
> http:///history.sh
> http:///history.sln
> http:///history.sql
> http:///history.sql.bz2
> http:///history.sql.gz
> http:///history.tar
> http:///history.tar.bz2
> http:///history.tar.gz
>  http:///history.war
>  http:///history.zip
> Your help is highly appreciated.
>  
> Best Regards,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27186) Optimize SortShuffleWriter writing process

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27186:


Assignee: Apache Spark

> Optimize SortShuffleWriter writing process
> --
>
> Key: SPARK-27186
> URL: https://issues.apache.org/jira/browse/SPARK-27186
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.0.0
>
>
> If the SortShuffleWriter.write method records is empty, it should be return 
> directly and no need to keep running,refer to the process of 
> BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance 
> ExternalSorter and tmp file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27187) What spark jar files serves the following files ..

2019-03-18 Thread Jerry Garcia (JIRA)
Jerry Garcia created SPARK-27187:


 Summary: What spark jar files serves the following files ..
 Key: SPARK-27187
 URL: https://issues.apache.org/jira/browse/SPARK-27187
 Project: Spark
  Issue Type: Question
  Components: Spark Submit
Affects Versions: 1.6.2
Reporter: Jerry Garcia






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27186) Optimize SortShuffleWriter writing process

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27186:


Assignee: (was: Apache Spark)

> Optimize SortShuffleWriter writing process
> --
>
> Key: SPARK-27186
> URL: https://issues.apache.org/jira/browse/SPARK-27186
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Priority: Minor
> Fix For: 3.0.0
>
>
> If the SortShuffleWriter.write method records is empty, it should be return 
> directly and no need to keep running,refer to the process of 
> BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance 
> ExternalSorter and tmp file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27186) Optimize SortShuffleWriter writing process

2019-03-18 Thread wangjiaochun (JIRA)
wangjiaochun created SPARK-27186:


 Summary: Optimize SortShuffleWriter writing process
 Key: SPARK-27186
 URL: https://issues.apache.org/jira/browse/SPARK-27186
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: wangjiaochun
 Fix For: 3.0.0


If the SortShuffleWriter.write method records is empty, it should be return 
directly and no need to keep running,refer to the process of 
BypassMergeSortShuffleWriter.write. Here are the benefits of create Instance 
ExternalSorter and tmp file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26961:


Assignee: Apache Spark

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2019-03-13-19-53-52-390.png
>
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:534)
>  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>  at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>  at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> "ForkJoinPool-1-worker-57":
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
>  - waiting to lock <0x0005b7991168> 

[jira] [Assigned] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26961:


Assignee: (was: Apache Spark)

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
> Attachments: image-2019-03-13-19-53-52-390.png
>
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:534)
>  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>  at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>  at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> "ForkJoinPool-1-worker-57":
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
>  - waiting to lock <0x0005b7991168> (a 
> org.apache.spark.ut

[jira] [Resolved] (SPARK-27124) Expose org.apache.spark.sql.avro.SchemaConverters as developer API

2019-03-18 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-27124.
---
Resolution: Won't Do

OK, based on the discussion it's won't do. At least users may have some clue 
based on this thread.

> Expose org.apache.spark.sql.avro.SchemaConverters as developer API
> --
>
> Key: SPARK-27124
> URL: https://issues.apache.org/jira/browse/SPARK-27124
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> org.apache.spark.sql.avro.SchemaConverters provides extremely useful APIs to 
> convert schema between Spark SQL and avro. This is reachable from scala side 
> but not from pyspark. I suggest to add this as a developer API to ease 
> development for pyspark users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27185) mapPartition to replace map to speedUp Dataset's toLocalIterator process

2019-03-18 Thread angerszhu (JIRA)
angerszhu created SPARK-27185:
-

 Summary: mapPartition to replace map to speedUp Dataset's 
toLocalIterator process
 Key: SPARK-27185
 URL: https://issues.apache.org/jira/browse/SPARK-27185
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.0.0
Reporter: angerszhu


In my case, I will use DataSet's toLocalIterator function, and I found that 
underlying code can be improved,it can be changed from map to 
mapPartitionsInternal to speed Up the process of  decode data to Internal Row 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort

2019-03-18 Thread Xiaoju Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16794805#comment-16794805
 ] 

Xiaoju Wu commented on SPARK-23375:
---

But one of your test cases is conflict with what I talked about above:

test("sort should not be removed when there is a node which doesn't guarantee 
any order") {
  val orderedPlan = testRelation.select('a, 'b).orderBy('a.asc)
  val groupedAndResorted = orderedPlan.groupBy('a)(sum('a)).orderBy('a.asc)
  val optimized = Optimize.execute(groupedAndResorted.analyze)
  val correctAnswer = groupedAndResorted.analyze
  comparePlans(optimized, correctAnswer)
}

Why you design like this? In my opinion, since Aggregate won't pass up the 
ordering, the below Sort is useless.

 

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27184:


Assignee: (was: Apache Spark)

> Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in 
> config object
> 
>
> Key: SPARK-27184
> URL: https://issues.apache.org/jira/browse/SPARK-27184
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hehuiyuan
>Priority: Minor
>
> In the org.apache.spark.internal.object,we define the variables of FILES and 
> JARS,we can use them instead of "spark.jars" and "spark.files".
> private[spark] val JARS = ConfigBuilder("spark.jars")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
> private[spark] val FILES = ConfigBuilder("spark.files")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object

2019-03-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27184:


Assignee: Apache Spark

> Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in 
> config object
> 
>
> Key: SPARK-27184
> URL: https://issues.apache.org/jira/browse/SPARK-27184
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hehuiyuan
>Assignee: Apache Spark
>Priority: Minor
>
> In the org.apache.spark.internal.object,we define the variables of FILES and 
> JARS,we can use them instead of "spark.jars" and "spark.files".
> private[spark] val JARS = ConfigBuilder("spark.jars")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
> private[spark] val FILES = ConfigBuilder("spark.files")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object

2019-03-18 Thread hehuiyuan (JIRA)
hehuiyuan created SPARK-27184:
-

 Summary: Replace "spark.jars" & "spark.files" with the variables 
of JARS & FILES in config object
 Key: SPARK-27184
 URL: https://issues.apache.org/jira/browse/SPARK-27184
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: hehuiyuan


In the org.apache.spark.internal.object,we define the variables of FILES and 
JARS,we can use them instead of "spark.jars" and "spark.files".

private[spark] val JARS = ConfigBuilder("spark.jars")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

private[spark] val FILES = ConfigBuilder("spark.files")
 .stringConf
 .toSequence
 .createWithDefault(Nil)

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27184) Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in config object

2019-03-18 Thread hehuiyuan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hehuiyuan updated SPARK-27184:
--
External issue URL:   (was: https://aaa)

> Replace "spark.jars" & "spark.files" with the variables of JARS & FILES in 
> config object
> 
>
> Key: SPARK-27184
> URL: https://issues.apache.org/jira/browse/SPARK-27184
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hehuiyuan
>Priority: Minor
>
> In the org.apache.spark.internal.object,we define the variables of FILES and 
> JARS,we can use them instead of "spark.jars" and "spark.files".
> private[spark] val JARS = ConfigBuilder("spark.jars")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
> private[spark] val FILES = ConfigBuilder("spark.files")
>  .stringConf
>  .toSequence
>  .createWithDefault(Nil)
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >