[jira] [Assigned] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23815:


Assignee: (was: Apache Spark)

> Spark writer dynamic partition overwrite mode fails to write output on multi 
> level partition
> 
>
> Key: SPARK-23815
> URL: https://issues.apache.org/jira/browse/SPARK-23815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Minor
>
> Spark introduced new writer mode to overwrite only related partitions in 
> SPARK-20236. While we are using this feature in our production cluster, we 
> found a bug when writing multi-level partitions on HDFS.
> A simple test case to reproduce this issue:
> val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
> df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")
> If HDFS location "/my/hdfs/location" does not exist, there will be no output.
> This seems to be caused by the job commit change in SPARK-20236 in 
> HadoopMapReduceCommitProtocol.
> In the commit job process, the output has been written into staging dir 
> /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
> fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
> /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
> on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
> create directory for more than one level. 
> This does not happen in unit test covered with SPARK-20236 with local file 
> system.
> We are proposing a fix. When cleaning current partition dir 
> /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
> (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
> create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
> exist) so the following rename op can succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23815:


Assignee: Apache Spark

> Spark writer dynamic partition overwrite mode fails to write output on multi 
> level partition
> 
>
> Key: SPARK-23815
> URL: https://issues.apache.org/jira/browse/SPARK-23815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Assignee: Apache Spark
>Priority: Minor
>
> Spark introduced new writer mode to overwrite only related partitions in 
> SPARK-20236. While we are using this feature in our production cluster, we 
> found a bug when writing multi-level partitions on HDFS.
> A simple test case to reproduce this issue:
> val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
> df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")
> If HDFS location "/my/hdfs/location" does not exist, there will be no output.
> This seems to be caused by the job commit change in SPARK-20236 in 
> HadoopMapReduceCommitProtocol.
> In the commit job process, the output has been written into staging dir 
> /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
> fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
> /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
> on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
> create directory for more than one level. 
> This does not happen in unit test covered with SPARK-20236 with local file 
> system.
> We are proposing a fix. When cleaning current partition dir 
> /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
> (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
> create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
> exist) so the following rename op can succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418452#comment-16418452
 ] 

Apache Spark commented on SPARK-23815:
--

User 'fangshil' has created a pull request for this issue:
https://github.com/apache/spark/pull/20931

> Spark writer dynamic partition overwrite mode fails to write output on multi 
> level partition
> 
>
> Key: SPARK-23815
> URL: https://issues.apache.org/jira/browse/SPARK-23815
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Minor
>
> Spark introduced new writer mode to overwrite only related partitions in 
> SPARK-20236. While we are using this feature in our production cluster, we 
> found a bug when writing multi-level partitions on HDFS.
> A simple test case to reproduce this issue:
> val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
> df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")
> If HDFS location "/my/hdfs/location" does not exist, there will be no output.
> This seems to be caused by the job commit change in SPARK-20236 in 
> HadoopMapReduceCommitProtocol.
> In the commit job process, the output has been written into staging dir 
> /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
> fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
> /my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
> on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
> create directory for more than one level. 
> This does not happen in unit test covered with SPARK-20236 with local file 
> system.
> We are proposing a fix. When cleaning current partition dir 
> /my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
> (because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
> create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
> exist) so the following rename op can succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23815) Spark writer dynamic partition overwrite mode fails to write output on multi level partition

2018-03-28 Thread Fangshi Li (JIRA)
Fangshi Li created SPARK-23815:
--

 Summary: Spark writer dynamic partition overwrite mode fails to 
write output on multi level partition
 Key: SPARK-23815
 URL: https://issues.apache.org/jira/browse/SPARK-23815
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Fangshi Li


Spark introduced new writer mode to overwrite only related partitions in 
SPARK-20236. While we are using this feature in our production cluster, we 
found a bug when writing multi-level partitions on HDFS.


A simple test case to reproduce this issue:
val df = Seq(("1","2","3")).toDF("col1", "col2","col3")
df.write.partitionBy("col1","col2").mode("overwrite").save("/my/hdfs/location")


If HDFS location "/my/hdfs/location" does not exist, there will be no output.


This seems to be caused by the job commit change in SPARK-20236 in 
HadoopMapReduceCommitProtocol.


In the commit job process, the output has been written into staging dir 
/my/hdfs/location/.spark-staging.xxx/col1=1/col2=2, and then the code calls 
fs.rename to rename /my/hdfs/location/.spark-staging.xxx/col1=1/col2=2 to 
/my/hdfs/location/col1=1/col2=2. However, in our case the operation will fail 
on HDFS because /my/hdfs/location/col1=1 does not exists. HDFS rename can not 
create directory for more than one level. 

This does not happen in unit test covered with SPARK-20236 with local file 
system.


We are proposing a fix. When cleaning current partition dir 
/my/hdfs/location/col1=1/col2=2 before the rename op, if the delete op fails 
(because /my/hdfs/location/col1=1/col2=2 may not exist), we call mkdirs op to 
create the parent dir /my/hdfs/location/col1=1 (if the parent dir does not 
exist) so the following rename op can succeed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418434#comment-16418434
 ] 

Li Yuanjian commented on SPARK-23811:
-

 

The scenario can be reproduced by below test case added in 
`{{DAGSchedulerSuite`}}
{code:java}
/**
 * This tests the case where origin task success after speculative task got 
FetchFailed
 * before.
 */
test("[SPARK-23811] Fetch failed task should kill other attempt") {
  // Create 3 RDDs with shuffle dependencies on each other: rddA <--- rddB <--- 
rddC
  val rddA = new MyRDD(sc, 2, Nil)
  val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
  val shuffleIdA = shuffleDepA.shuffleId

  val rddB = new MyRDD(sc, 2, List(shuffleDepA), tracker = mapOutputTracker)
  val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))

  val rddC = new MyRDD(sc, 2, List(shuffleDepB), tracker = mapOutputTracker)

  submit(rddC, Array(0, 1))

  // Complete both tasks in rddA.
  assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0)
  complete(taskSets(0), Seq(
(Success, makeMapStatus("hostA", 2)),
(Success, makeMapStatus("hostB", 2

  // The first task success
  runEvent(makeCompletionEvent(
taskSets(1).tasks(0), Success, makeMapStatus("hostB", 2)))

  // The second task's speculative attempt fails first, but task self still 
running.
  // This may caused by ExecutorLost.
  runEvent(makeCompletionEvent(
taskSets(1).tasks(1),
FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0, "ignored"),
null))
  // Check currently missing partition
  assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size 
=== 1)
  val missingPartition = 
mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get(0)

  // The second result task self success soon
  runEvent(makeCompletionEvent(
taskSets(1).tasks(1), Success, makeMapStatus("hostB", 2)))
  // No missing partitions here, this will cause child stage never succeed
  assert(mapOutputTracker.findMissingPartitions(shuffleDepB.shuffleId).get.size 
=== 0)
}
{code}
 

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23811:


Assignee: (was: Apache Spark)

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418431#comment-16418431
 ] 

Apache Spark commented on SPARK-23811:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/20930

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23811:


Assignee: Apache Spark

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Assignee: Apache Spark
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23812) DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4

2018-03-28 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418430#comment-16418430
 ] 

Yuming Wang commented on SPARK-23812:
-

[~wangtao93] Would you like to work on this?

> DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4
> --
>
> Key: SPARK-23812
> URL: https://issues.apache.org/jira/browse/SPARK-23812
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: wangtao93
>Priority: Minor
>
> dfs command has been supported,but SqlBase.g4 also put it in 
> unsupportedHiveNativeCommands .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2018-03-28 Thread bharath kumar avusherla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-23814:

Description: 
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("s3n://Directory/") function. 
It is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't have colon in it. 
But when both are present (colon in file name and new line in the data), it's 
not working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at org.apache.hadoop.fs.Path.initialize(Path.java:205)

  at org.apache.hadoop.fs.Path.(Path.java:171)

  at org.apache.hadoop.fs.Path.(Path.java:93)

  at org.apache.hadoop.fs.Globber.glob(Globber.java:253)

  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)

  at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)

  at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)

  at 
org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at scala.Option.orElse(Option.scala:289)

  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)

  ... 48 elided

Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz

  at java.net.URI.checkPath(URI.java:1823)

  at java.net.URI.(URI.java:745)

  at org.apache.hadoop.fs.Path.initialize(Path.java:202)

  ... 86 more
{quote}

  was:
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("sn://Directory/") function. It 
is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't have colon in it. 
But when both are present (colon in file name and new line in the data), it's 
not working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at 

[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2018-03-28 Thread bharath kumar avusherla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-23814:

Description: 
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("s3n://DirectoryPath/") 
function. It is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't have colon in it. 
But when both are present (colon in file name and new line in the data), it's 
not working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at org.apache.hadoop.fs.Path.initialize(Path.java:205)

  at org.apache.hadoop.fs.Path.(Path.java:171)

  at org.apache.hadoop.fs.Path.(Path.java:93)

  at org.apache.hadoop.fs.Globber.glob(Globber.java:253)

  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)

  at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)

  at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)

  at 
org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at scala.Option.orElse(Option.scala:289)

  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)

  ... 48 elided

Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz

  at java.net.URI.checkPath(URI.java:1823)

  at java.net.URI.(URI.java:745)

  at org.apache.hadoop.fs.Path.initialize(Path.java:202)

  ... 86 more
{quote}

  was:
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("s3n://Directory/") function. 
It is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't have colon in it. 
But when both are present (colon in file name and new line in the data), it's 
not working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

 

[jira] [Updated] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2018-03-28 Thread bharath kumar avusherla (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla updated SPARK-23814:

Description: 
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("sn://Directory/") function. It 
is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't have colon in it. 
But when both are present (colon in file name and new line in the data), it's 
not working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at org.apache.hadoop.fs.Path.initialize(Path.java:205)

  at org.apache.hadoop.fs.Path.(Path.java:171)

  at org.apache.hadoop.fs.Path.(Path.java:93)

  at org.apache.hadoop.fs.Globber.glob(Globber.java:253)

  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)

  at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)

  at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)

  at 
org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at scala.Option.orElse(Option.scala:289)

  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)

  ... 48 elided

Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz

  at java.net.URI.checkPath(URI.java:1823)

  at java.net.URI.(URI.java:745)

  at org.apache.hadoop.fs.Path.initialize(Path.java:202)

  ... 86 more
{quote}

  was:
When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("sn://Directory/") function. It 
is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't contain colon in 
it. But when both are present (colon in file name and new line in the data), it 
snot working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at 

[jira] [Created] (SPARK-23814) Couldn't read file with colon in name and new line character in one of the field.

2018-03-28 Thread bharath kumar avusherla (JIRA)
bharath kumar avusherla created SPARK-23814:
---

 Summary: Couldn't read file with colon in name and new line 
character in one of the field.
 Key: SPARK-23814
 URL: https://issues.apache.org/jira/browse/SPARK-23814
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell
Affects Versions: 2.2.0
Reporter: bharath kumar avusherla


When the file name has colon and new line character in data, while reading 
using spark.read.option("multiLine","true").csv("sn://Directory/") function. It 
is throwing *"**java.lang.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz"* error. If we remove the 
option("multiLine","true"), it is working just fine though the file name has 
colon in it. It is working fine, If i apply this option 
*option("multiLine","true")* on any other file which doesn't contain colon in 
it. But when both are present (colon in file name and new line in the data), it 
snot working.
{quote}java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Relative path in absolute URI: 2017-08-01T00:00:00Z.csv.gz

  at org.apache.hadoop.fs.Path.initialize(Path.java:205)

  at org.apache.hadoop.fs.Path.(Path.java:171)

  at org.apache.hadoop.fs.Path.(Path.java:93)

  at org.apache.hadoop.fs.Globber.glob(Globber.java:253)

  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)

  at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)

  at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)

  at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:46)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)

  at scala.Option.getOrElse(Option.scala:121)

  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)

  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)

  at 
org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:224)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)

  at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:177)

  at scala.Option.orElse(Option.scala:289)

  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:176)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)

  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)

  ... 48 elided

Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
2017-08-01T00:00:00Z.csv.gz

  at java.net.URI.checkPath(URI.java:1823)

  at java.net.URI.(URI.java:745)

  at org.apache.hadoop.fs.Path.initialize(Path.java:202)

  ... 86 more
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2018-03-28 Thread yuliang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418410#comment-16418410
 ] 

yuliang commented on SPARK-5928:


!image-2018-03-29-11-52-32-075.png|width=741,height=189!
 
>From the picture, The shuffle read size is over 2GB, but job not failed.

I wonder if anyone can answer me?

 

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Major
> Attachments: image-2018-03-29-11-52-32-075.png
>
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   

[jira] [Updated] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23811:

Attachment: 2.png

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png, 2.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Li Yuanjian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Yuanjian updated SPARK-23811:

Attachment: 1.png

> Same tasks' FetchFailed event comes before Success will cause child stage 
> never succeed
> ---
>
> Key: SPARK-23811
> URL: https://issues.apache.org/jira/browse/SPARK-23811
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Li Yuanjian
>Priority: Major
> Attachments: 1.png
>
>
> This is a bug caused by abnormal scenario describe below:
>  # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
>  # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` 
> , shuffleStatus changed.
>  # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
>  # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
> succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2018-03-28 Thread yuliang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuliang updated SPARK-5928:
---
Attachment: image-2018-03-29-11-52-32-075.png

> Remote Shuffle Blocks cannot be more than 2 GB
> --
>
> Key: SPARK-5928
> URL: https://issues.apache.org/jira/browse/SPARK-5928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Priority: Major
> Attachments: image-2018-03-29-11-52-32-075.png
>
>
> If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
> exception.  The tasks get retried a few times and then eventually the job 
> fails.
> Here is an example program which can cause the exception:
> {code}
> val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =>
>   val n = 3e3.toInt
>   val arr = new Array[Byte](n)
>   //need to make sure the array doesn't compress to something small
>   scala.util.Random.nextBytes(arr)
>   arr
> }
> rdd.map { x => (1, x)}.groupByKey().count()
> {code}
> Note that you can't trigger this exception in local mode, it only happens on 
> remote fetches.   I triggered these exceptions running with 
> {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
> {noformat}
> 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
> imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
> imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
> org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
> 2147483647: 3021252889 - discarded
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
> length exceeds 2147483647: 3021252889 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>  

[jira] [Updated] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()

2018-03-28 Thread Cui Xixin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cui Xixin updated SPARK-23813:
--
Description: 
 i am working on replacing hivesql with sparksql now, i found parse_url perform 
differently, 
 for example:
  "select parse_url('http://spark.apache.org/path?query=1 =1','HOST') from 
dual limit 1"
       in hive,the result is "spark.apache.org" ,but in sparksql is "null", 
then i fount SPARK-16826, 
 https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been 
changed for better performence, but also lead to the difference. in fact, the 
main reason is the 'space' after "query=1", so if we can fix it, for example, 
encode the sql before new URI() and decode the result?

  was:
 i am working on replacing hivesql with sparksql now, i fount parse_url perform 
differently, 
for example:
 "select parse_url('http://spark.apache.org/path?query=1 =1','HOST') from 
dual limit 1"
      in hive,the result is "spark.apache.org" ,but in sparksql is "null", then 
i fount SPARK-16826, 
https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been 
changed for better performence, but also lead to the difference. in fact, the 
main reason is the 'space' after "query=1", so if we can fix it, for example, 
encode the sql before new URI() and decode the result?


> [SparkSQL] the result is different between hive and spark when use 
> PARSE_URL() 
> ---
>
> Key: SPARK-23813
> URL: https://issues.apache.org/jira/browse/SPARK-23813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Cui Xixin
>Priority: Minor
>
>  i am working on replacing hivesql with sparksql now, i found parse_url 
> perform differently, 
>  for example:
>   "select parse_url('http://spark.apache.org/path?query=1 =1','HOST') 
> from dual limit 1"
>        in hive,the result is "spark.apache.org" ,but in sparksql is "null", 
> then i fount SPARK-16826, 
>  https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has 
> been changed for better performence, but also lead to the difference. in 
> fact, the main reason is the 'space' after "query=1", so if we can fix it, 
> for example, encode the sql before new URI() and decode the result?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()

2018-03-28 Thread Cui Xixin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cui Xixin updated SPARK-23813:
--
Description: 
 i am working on replacing hivesql with sparksql now, i fount parse_url perform 
differently, 
for example:
 "select parse_url('http://spark.apache.org/path?query=1 =1','HOST') from 
dual limit 1"
      in hive,the result is "spark.apache.org" ,but in sparksql is "null", then 
i fount SPARK-16826, 
https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has been 
changed for better performence, but also lead to the difference. in fact, the 
main reason is the 'space' after "query=1", so if we can fix it, for example, 
encode the sql before new URI() and decode the result?

> [SparkSQL] the result is different between hive and spark when use 
> PARSE_URL() 
> ---
>
> Key: SPARK-23813
> URL: https://issues.apache.org/jira/browse/SPARK-23813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Cui Xixin
>Priority: Minor
>
>  i am working on replacing hivesql with sparksql now, i fount parse_url 
> perform differently, 
> for example:
>  "select parse_url('http://spark.apache.org/path?query=1 =1','HOST') 
> from dual limit 1"
>       in hive,the result is "spark.apache.org" ,but in sparksql is "null", 
> then i fount SPARK-16826, 
> https://issues.apache.org/jira/browse/SPARK-16826 ,the implementation has 
> been changed for better performence, but also lead to the difference. in 
> fact, the main reason is the 'space' after "query=1", so if we can fix it, 
> for example, encode the sql before new URI() and decode the result?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23813) [SparkSQL] the result is different between hive and spark when use PARSE_URL()

2018-03-28 Thread Cui Xixin (JIRA)
Cui Xixin created SPARK-23813:
-

 Summary: [SparkSQL] the result is different between hive and spark 
when use PARSE_URL() 
 Key: SPARK-23813
 URL: https://issues.apache.org/jira/browse/SPARK-23813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Cui Xixin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23812) DFS should be removed from unsupportedHiveNativeCommands in SqlBase.g4

2018-03-28 Thread wangtao93 (JIRA)
wangtao93 created SPARK-23812:
-

 Summary: DFS should be removed from unsupportedHiveNativeCommands 
in SqlBase.g4
 Key: SPARK-23812
 URL: https://issues.apache.org/jira/browse/SPARK-23812
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: wangtao93


dfs command has been supported,but SqlBase.g4 also put it in 
unsupportedHiveNativeCommands .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23811) Same tasks' FetchFailed event comes before Success will cause child stage never succeed

2018-03-28 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-23811:
---

 Summary: Same tasks' FetchFailed event comes before Success will 
cause child stage never succeed
 Key: SPARK-23811
 URL: https://issues.apache.org/jira/browse/SPARK-23811
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.0
Reporter: Li Yuanjian


This is a bug caused by abnormal scenario describe below:
 # ShuffleMapTask 1.0 running, this task will fetch data from ExecutorA
 # ExecutorA Lost, trigger `mapOutputTracker.removeOutputsOnExecutor(execId)` , 
shuffleStatus changed.
 # Speculative ShuffleMapTask 1.1 start, got a FetchFailed immediately.
 # ShuffleMapTask 1 is the last task of its stage, so this stage will never 
succeed because of there's no missing task DagScheduler can get.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-28 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418298#comment-16418298
 ] 

Saisai Shao commented on SPARK-23807:
-

Hi [~ste...@apache.org], this is dup of SPARK-23534, I think we can convert 
this as a subtask of SPARK-23534, what do you think?

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23675) Title add spark logo, use spark logo image

2018-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23675:
-

Assignee: guoxiaolongzte

> Title add spark logo, use spark logo image
> --
>
> Key: SPARK-23675
> URL: https://issues.apache.org/jira/browse/SPARK-23675
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: flink.png, kafka.png, nifi.png, spark_fix_after.png, 
> spark_fix_before.png, storm.png, storm.png, yarn.png, yarn.png
>
>
> Title add spark logo, use spark logo image. reference other big data system 
> ui, so i think spark should add it.
> spark fix before: !spark_fix_before.png!
>  
> spark fix after: !spark_fix_after.png!
>  
> reference kafka ui: !kafka.png!
>  
> reference storm ui: !storm.png!
>  
> reference yarn ui: !yarn.png!
>  
> reference nifi ui: !nifi.png!
>  
> reference flink ui: !flink.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23675) Title add spark logo, use spark logo image

2018-03-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23675.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20818
[https://github.com/apache/spark/pull/20818]

> Title add spark logo, use spark logo image
> --
>
> Key: SPARK-23675
> URL: https://issues.apache.org/jira/browse/SPARK-23675
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: flink.png, kafka.png, nifi.png, spark_fix_after.png, 
> spark_fix_before.png, storm.png, storm.png, yarn.png, yarn.png
>
>
> Title add spark logo, use spark logo image. reference other big data system 
> ui, so i think spark should add it.
> spark fix before: !spark_fix_before.png!
>  
> spark fix after: !spark_fix_after.png!
>  
> reference kafka ui: !kafka.png!
>  
> reference storm ui: !storm.png!
>  
> reference yarn ui: !yarn.png!
>  
> reference nifi ui: !nifi.png!
>  
> reference flink ui: !flink.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql

2018-03-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418282#comment-16418282
 ] 

Maryann Xue commented on SPARK-20169:
-

This was related to the use of "withColumnRenamed", which will add a "Project" 
on top of the original relation. And later on, when getting the 
"outputPartitioning" of the "Project", it returned HashPartitioning("dst") 
while it should return HashPartitioning("src") since it had been renamed. As a 
result, an "Exchange" node was missing between the two outmost HashAggregate 
nodes, and that's why the aggregate result was not correct.
{code:java}
test("SPARK-20169") {
  val e = Seq((1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (4, 1)).toDF("src", "dst")
  val r = Seq((1), (2), (3), (4)).toDF("src")
  val r1 = e.join(r, "src" :: 
Nil).groupBy("dst").count().withColumnRenamed("dst", "src")
  val jr = e.join(r1, "src" :: Nil)
  val r2 = jr.groupBy("dst").count
  r2.explain()
  r2.show
}
{code}

The physical plan resulted from this bug turned out to be
{code}
== Physical Plan ==
*(2) HashAggregate(keys=[dst#181], functions=[count(1)])
+- *(2) HashAggregate(keys=[dst#181], functions=[partial_count(1)])
   +- *(2) Project [dst#181]
  +- *(2) BroadcastHashJoin [src#180], [src#197], Inner, BuildLeft
 :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
 :  +- LocalTableScan [src#180, dst#181]
 +- *(2) HashAggregate(keys=[dst#181], functions=[])
+- Exchange hashpartitioning(dst#181, 5)
   +- *(1) HashAggregate(keys=[dst#181], functions=[])
  +- *(1) Project [dst#181]
 +- *(1) BroadcastHashJoin [src#180], [src#187], Inner, 
BuildRight
:- LocalTableScan [src#180, dst#181]
+- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- LocalTableScan [src#187]
{code}

The correct physical plan should be
{code}
== Physical Plan ==
*(3) HashAggregate(keys=[dst#181], functions=[count(1)])
+- Exchange hashpartitioning(dst#181, 5)
   +- *(2) HashAggregate(keys=[dst#181], functions=[partial_count(1)])
  +- *(2) Project [dst#181]
 +- *(2) BroadcastHashJoin [src#180], [src#197], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:  +- LocalTableScan [src#180, dst#181]
+- *(2) HashAggregate(keys=[dst#181], functions=[])
   +- Exchange hashpartitioning(dst#181, 5)
  +- *(1) HashAggregate(keys=[dst#181], functions=[])
 +- *(1) Project [dst#181]
+- *(1) BroadcastHashJoin [src#180], [src#187], Inner, 
BuildRight
   :- LocalTableScan [src#180, dst#181]
   +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  +- LocalTableScan [src#187]
{code}

This happens to the same issue as SPARK-23368. The PR for SPARK-23368 fixes it.

> Groupby Bug with Sparksql
> -
>
> Key: SPARK-20169
> URL: https://issues.apache.org/jira/browse/SPARK-20169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bin Wu
>Priority: Major
>
> We find a potential bug in Catalyst optimizer which cannot correctly 
> process "groupby". You can reproduce it by following simple example:
> =
> from pyspark.sql.functions import *
> #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
> e = spark.read.csv("graph.csv", header=True)
> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
> jr = e.join(r1, 'src')
> jr.show()
> r2 = jr.groupBy('dst').count()
> r2.show()
> =
> FYI, "graph.csv" contains exactly the same data as the commented line.
> You can find that jr is:
> |src|dst|count|
> |  3|  1|1|
> |  1|  4|3|
> |  1|  3|3|
> |  1|  2|3|
> |  4|  1|1|
> |  2|  1|1|
> But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
> |dst|count|
> |  1|1|
> |  4|1|
> |  3|1|
> |  2|1|
> |  1|1|
> |  1|1|
> If we build jr directly from raw data (commented line), this error will not 
> show up.  So 
> we suspect  that there is a bug in the Catalyst optimizer when multiple joins 
> and groupBy's 
> are being optimized. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23772:


Assignee: Apache Spark

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23772:


Assignee: (was: Apache Spark)

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418280#comment-16418280
 ] 

Apache Spark commented on SPARK-23772:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20929

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better

2018-03-28 Thread dciborow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dciborow updated SPARK-23810:
-
Comment: was deleted

(was: test comment to see if i get an email...)

> Matrix Multiplication is so bad, file I/O to local python is better
> ---
>
> Key: SPARK-23810
> URL: https://issues.apache.org/jira/browse/SPARK-23810
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: dciborow
>Priority: Minor
>
> I am trying to multiple two matrices. One is 130k by 30k. The second is 30k 
> by 30k.
> Running this leads to hearbeat timeout, Java Heap Space and Garage collection 
> errors.
> {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}}
> {{I have also tried the following which will fail on the toLocalMatrix call. 
> }}
> val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix()
>  val itemMatrix = new 
> CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix()
> val itemMatrixBC = session.sparkContext.broadcast(itemMatrix)
>  val userToItemMatrix = userMatrix
>  .multiply(itemMatrixBC.value)
>  .rows.map(index => (index.index.toInt, index.vector))
>  
> I instead have gotten this operation "working", by saving the inputs 
> dataframes to parquet(which start as DataFrames before the .rdd call to get 
> them to work with the matrix types), and then loading them into 
> python/pandas, using numpy for the matrix mulplication, saving back to 
> parquet, and rereading back into spark.
>  
> Python -
> import pandas as pd
> import numpy as np
> X = pd.read_parquet('./items-parquet', engine='pyarrow')
> #Xp = np.stack(X.jaccardList)
> Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID)
> Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1))
> Xpp = Xrows.join(Xp).fillna(0)
> Y = pd.read_parquet('./users-parquet',engine='pyarrow')
> Yp = np.stack(Y.flatList)
> Z = np.matmul(Yp, Xpp)
> Zp = pd.DataFrame(Z)
> Zp.columns = list(map(str, Zp.columns))
> Zpp = pd.DataFrame()
> Zpp['id'] = Zp.index
> Zpp['ratings'] = Zp.values.tolist()
> Zpp.to_parquet("sampleout.parquet",engine='pyarrow')
>  
> Scala -
> import sys.process._
>  val result = "python matmul.py".!
>  val pythonOutput = 
> userDataFrame.sparkSession.read.parquet("./sampleout.parquet")
>  
> I can provide code, and the data to repo. But could use some instructions how 
> to set that up. This is based on the MovieLens 20mil dataset, or I can 
> provide access to my data in Azure. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better

2018-03-28 Thread dciborow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dciborow updated SPARK-23810:
-
Priority: Minor  (was: Major)

> Matrix Multiplication is so bad, file I/O to local python is better
> ---
>
> Key: SPARK-23810
> URL: https://issues.apache.org/jira/browse/SPARK-23810
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: dciborow
>Priority: Minor
>
> I am trying to multiple two matrices. One is 130k by 30k. The second is 30k 
> by 30k.
> Running this leads to hearbeat timeout, Java Heap Space and Garage collection 
> errors.
> {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}}
> {{I have also tried the following which will fail on the toLocalMatrix call. 
> }}
> val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix()
>  val itemMatrix = new 
> CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix()
> val itemMatrixBC = session.sparkContext.broadcast(itemMatrix)
>  val userToItemMatrix = userMatrix
>  .multiply(itemMatrixBC.value)
>  .rows.map(index => (index.index.toInt, index.vector))
>  
> I instead have gotten this operation "working", by saving the inputs 
> dataframes to parquet(which start as DataFrames before the .rdd call to get 
> them to work with the matrix types), and then loading them into 
> python/pandas, using numpy for the matrix mulplication, saving back to 
> parquet, and rereading back into spark.
>  
> Python -
> import pandas as pd
> import numpy as np
> X = pd.read_parquet('./items-parquet', engine='pyarrow')
> #Xp = np.stack(X.jaccardList)
> Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID)
> Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1))
> Xpp = Xrows.join(Xp).fillna(0)
> Y = pd.read_parquet('./users-parquet',engine='pyarrow')
> Yp = np.stack(Y.flatList)
> Z = np.matmul(Yp, Xpp)
> Zp = pd.DataFrame(Z)
> Zp.columns = list(map(str, Zp.columns))
> Zpp = pd.DataFrame()
> Zpp['id'] = Zp.index
> Zpp['ratings'] = Zp.values.tolist()
> Zpp.to_parquet("sampleout.parquet",engine='pyarrow')
>  
> Scala -
> import sys.process._
>  val result = "python matmul.py".!
>  val pythonOutput = 
> userDataFrame.sparkSession.read.parquet("./sampleout.parquet")
>  
> I can provide code, and the data to repo. But could use some instructions how 
> to set that up. This is based on the MovieLens 20mil dataset, or I can 
> provide access to my data in Azure. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better

2018-03-28 Thread dciborow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418244#comment-16418244
 ] 

dciborow commented on SPARK-23810:
--

test comment to see if i get an email...

> Matrix Multiplication is so bad, file I/O to local python is better
> ---
>
> Key: SPARK-23810
> URL: https://issues.apache.org/jira/browse/SPARK-23810
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: dciborow
>Priority: Major
>
> I am trying to multiple two matrices. One is 130k by 30k. The second is 30k 
> by 30k.
> Running this leads to hearbeat timeout, Java Heap Space and Garage collection 
> errors.
> {{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}}
> {{I have also tried the following which will fail on the toLocalMatrix call. 
> }}
> val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix()
>  val itemMatrix = new 
> CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix()
> val itemMatrixBC = session.sparkContext.broadcast(itemMatrix)
>  val userToItemMatrix = userMatrix
>  .multiply(itemMatrixBC.value)
>  .rows.map(index => (index.index.toInt, index.vector))
>  
> I instead have gotten this operation "working", by saving the inputs 
> dataframes to parquet(which start as DataFrames before the .rdd call to get 
> them to work with the matrix types), and then loading them into 
> python/pandas, using numpy for the matrix mulplication, saving back to 
> parquet, and rereading back into spark.
>  
> Python -
> import pandas as pd
> import numpy as np
> X = pd.read_parquet('./items-parquet', engine='pyarrow')
> #Xp = np.stack(X.jaccardList)
> Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID)
> Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1))
> Xpp = Xrows.join(Xp).fillna(0)
> Y = pd.read_parquet('./users-parquet',engine='pyarrow')
> Yp = np.stack(Y.flatList)
> Z = np.matmul(Yp, Xpp)
> Zp = pd.DataFrame(Z)
> Zp.columns = list(map(str, Zp.columns))
> Zpp = pd.DataFrame()
> Zpp['id'] = Zp.index
> Zpp['ratings'] = Zp.values.tolist()
> Zpp.to_parquet("sampleout.parquet",engine='pyarrow')
>  
> Scala -
> import sys.process._
>  val result = "python matmul.py".!
>  val pythonOutput = 
> userDataFrame.sparkSession.read.parquet("./sampleout.parquet")
>  
> I can provide code, and the data to repo. But could use some instructions how 
> to set that up. This is based on the MovieLens 20mil dataset, or I can 
> provide access to my data in Azure. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23810) Matrix Multiplication is so bad, file I/O to local python is better

2018-03-28 Thread dciborow (JIRA)
dciborow created SPARK-23810:


 Summary: Matrix Multiplication is so bad, file I/O to local python 
is better
 Key: SPARK-23810
 URL: https://issues.apache.org/jira/browse/SPARK-23810
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.0
Reporter: dciborow


I am trying to multiple two matrices. One is 130k by 30k. The second is 30k by 
30k.

Running this leads to hearbeat timeout, Java Heap Space and Garage collection 
errors.

{{rdd.toBlockMatrix.multiply(rightRdd.toBlockMatrix).toIndexedRowMatrix()}}

{{I have also tried the following which will fail on the toLocalMatrix call. }}

val userMatrix = new CoordinateMatrix(userRDD).toIndexedRowMatrix()
 val itemMatrix = new CoordinateMatrix(itemRDD).toBlockMatrix().toLocalMatrix()

val itemMatrixBC = session.sparkContext.broadcast(itemMatrix)
 val userToItemMatrix = userMatrix
 .multiply(itemMatrixBC.value)
 .rows.map(index => (index.index.toInt, index.vector))

 

I instead have gotten this operation "working", by saving the inputs dataframes 
to parquet(which start as DataFrames before the .rdd call to get them to work 
with the matrix types), and then loading them into python/pandas, using numpy 
for the matrix mulplication, saving back to parquet, and rereading back into 
spark.

 

Python -

import pandas as pd
import numpy as np

X = pd.read_parquet('./items-parquet', engine='pyarrow')
#Xp = np.stack(X.jaccardList)
Xp = pd.DataFrame(np.stack(X.jaccardList), X.itemID)
Xrows = pd.DataFrame(index=range(0, X.itemID.max()+1))
Xpp = Xrows.join(Xp).fillna(0)

Y = pd.read_parquet('./users-parquet',engine='pyarrow')
Yp = np.stack(Y.flatList)

Z = np.matmul(Yp, Xpp)
Zp = pd.DataFrame(Z)
Zp.columns = list(map(str, Zp.columns))

Zpp = pd.DataFrame()
Zpp['id'] = Zp.index
Zpp['ratings'] = Zp.values.tolist()

Zpp.to_parquet("sampleout.parquet",engine='pyarrow')

 

Scala -

import sys.process._
 val result = "python matmul.py".!
 val pythonOutput = 
userDataFrame.sparkSession.read.parquet("./sampleout.parquet")

 

I can provide code, and the data to repo. But could use some instructions how 
to set that up. This is based on the MovieLens 20mil dataset, or I can provide 
access to my data in Azure. 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23809) Active SparkSession should be set by getOrCreate

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418211#comment-16418211
 ] 

Apache Spark commented on SPARK-23809:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/20927

> Active SparkSession should be set by getOrCreate
> 
>
> Key: SPARK-23809
> URL: https://issues.apache.org/jira/browse/SPARK-23809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Eric Liang
>Priority: Minor
>
> Currently, the active spark session is set inconsistently (e.g., in 
> createDataFrame, prior to query execution). Many places in spark also 
> incorrectly query active session when they should be calling 
> activeSession.getOrElse(defaultSession).
> The semantics here can be cleaned up if we also set the active session when 
> the default session is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23809) Active SparkSession should be set by getOrCreate

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23809:


Assignee: Apache Spark

> Active SparkSession should be set by getOrCreate
> 
>
> Key: SPARK-23809
> URL: https://issues.apache.org/jira/browse/SPARK-23809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the active spark session is set inconsistently (e.g., in 
> createDataFrame, prior to query execution). Many places in spark also 
> incorrectly query active session when they should be calling 
> activeSession.getOrElse(defaultSession).
> The semantics here can be cleaned up if we also set the active session when 
> the default session is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23809) Active SparkSession should be set by getOrCreate

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23809:


Assignee: (was: Apache Spark)

> Active SparkSession should be set by getOrCreate
> 
>
> Key: SPARK-23809
> URL: https://issues.apache.org/jira/browse/SPARK-23809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Eric Liang
>Priority: Minor
>
> Currently, the active spark session is set inconsistently (e.g., in 
> createDataFrame, prior to query execution). Many places in spark also 
> incorrectly query active session when they should be calling 
> activeSession.getOrElse(defaultSession).
> The semantics here can be cleaned up if we also set the active session when 
> the default session is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23809) Active SparkSession should be set by getOrCreate

2018-03-28 Thread Eric Liang (JIRA)
Eric Liang created SPARK-23809:
--

 Summary: Active SparkSession should be set by getOrCreate
 Key: SPARK-23809
 URL: https://issues.apache.org/jira/browse/SPARK-23809
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Eric Liang


Currently, the active spark session is set inconsistently (e.g., in 
createDataFrame, prior to query execution). Many places in spark also 
incorrectly query active session when they should be calling 
activeSession.getOrElse(defaultSession).

The semantics here can be cleaned up if we also set the active session when the 
default session is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23808) Test spark sessions should set default session

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418142#comment-16418142
 ] 

Apache Spark commented on SPARK-23808:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20926

> Test spark sessions should set default session
> --
>
> Key: SPARK-23808
> URL: https://issues.apache.org/jira/browse/SPARK-23808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> SparkSession.getOrCreate() ensures that the session it returns is set as a 
> default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts 
> around this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23808) Test spark sessions should set default session

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23808:


Assignee: Apache Spark

> Test spark sessions should set default session
> --
>
> Key: SPARK-23808
> URL: https://issues.apache.org/jira/browse/SPARK-23808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> SparkSession.getOrCreate() ensures that the session it returns is set as a 
> default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts 
> around this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23808) Test spark sessions should set default session

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23808:


Assignee: (was: Apache Spark)

> Test spark sessions should set default session
> --
>
> Key: SPARK-23808
> URL: https://issues.apache.org/jira/browse/SPARK-23808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> SparkSession.getOrCreate() ensures that the session it returns is set as a 
> default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts 
> around this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23808) Test spark sessions should set default session

2018-03-28 Thread Jose Torres (JIRA)
Jose Torres created SPARK-23808:
---

 Summary: Test spark sessions should set default session
 Key: SPARK-23808
 URL: https://issues.apache.org/jira/browse/SPARK-23808
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jose Torres


SparkSession.getOrCreate() ensures that the session it returns is set as a 
default. Test code (TestSparkSession and TestHiveSparkSession) shortcuts around 
this method, and thus a default is never set. We need to set it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418087#comment-16418087
 ] 

Apache Spark commented on SPARK-22941:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20925

> Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
> ---
>
> Key: SPARK-22941
> URL: https://issues.apache.org/jira/browse/SPARK-22941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
> SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
> to the app, {{SparkSubmit}} will print errors to the output and call 
> {{System.exit}}, which is not very user friendly in this code path.
> We should modify {{SparkSubmit}} to be more friendly when called this way, 
> while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22941:


Assignee: Apache Spark

> Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
> ---
>
> Key: SPARK-22941
> URL: https://issues.apache.org/jira/browse/SPARK-22941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
> SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
> to the app, {{SparkSubmit}} will print errors to the output and call 
> {{System.exit}}, which is not very user friendly in this code path.
> We should modify {{SparkSubmit}} to be more friendly when called this way, 
> while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22941) Allow SparkSubmit to throw exceptions instead of exiting / printing errors.

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22941:


Assignee: (was: Apache Spark)

> Allow SparkSubmit to throw exceptions instead of exiting / printing errors.
> ---
>
> Key: SPARK-22941
> URL: https://issues.apache.org/jira/browse/SPARK-22941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> {{SparkSubmit}} now can be called by the {{SparkLauncher}} library (see 
> SPARK-11035). But if the caller provides incorrect or inconsistent parameters 
> to the app, {{SparkSubmit}} will print errors to the output and call 
> {{System.exit}}, which is not very user friendly in this code path.
> We should modify {{SparkSubmit}} to be more friendly when called this way, 
> while still maintaining the old behavior when called from the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23806:


Assignee: Apache Spark

> Broadcast. unpersist can cause fatal exception when used with dynamic 
> allocation
> 
>
> Key: SPARK-23806
> URL: https://issues.apache.org/jira/browse/SPARK-23806
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Major
>
> Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this 
> could also apply to Broadcast.unpersist.
>  
> 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
> org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
> org.apache.spark.SparkException: Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>  at 
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
>  at 
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) 
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
>  at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>  at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) 
> Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to 
> /10.10.10.10:53804: java.nio.channels.ClosedChannelException at 
> org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
>  at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) 
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
>  at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  at java.lang.Thread.run(Thread.java:745) Caused by: 
> java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23806:


Assignee: (was: Apache Spark)

> Broadcast. unpersist can cause fatal exception when used with dynamic 
> allocation
> 
>
> Key: SPARK-23806
> URL: https://issues.apache.org/jira/browse/SPARK-23806
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this 
> could also apply to Broadcast.unpersist.
>  
> 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
> org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
> org.apache.spark.SparkException: Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>  at 
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
>  at 
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) 
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
>  at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>  at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) 
> Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to 
> /10.10.10.10:53804: java.nio.channels.ClosedChannelException at 
> org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
>  at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) 
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
>  at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  at java.lang.Thread.run(Thread.java:745) Caused by: 
> java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417881#comment-16417881
 ] 

Apache Spark commented on SPARK-23806:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/20924

> Broadcast. unpersist can cause fatal exception when used with dynamic 
> allocation
> 
>
> Key: SPARK-23806
> URL: https://issues.apache.org/jira/browse/SPARK-23806
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Priority: Major
>
> Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this 
> could also apply to Broadcast.unpersist.
>  
> 2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
> org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
> org.apache.spark.SparkException: Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>  at 
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
>  at 
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) 
> at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
>  at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
>  at scala.Option.foreach(Option.scala:257) at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
>  at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) 
> Caused by: java.io.IOException: Failed to send RPC 7228115282075984867 to 
> /10.10.10.10:53804: java.nio.channels.ClosedChannelException at 
> org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
>  at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
>  at 
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) 
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
>  at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
>  at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>  at java.lang.Thread.run(Thread.java:745) Caused by: 
> java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-23806:
--
Description: 
Very similar to https://issues.apache.org/jira/browse/SPARK-22618 . But this 
could also apply to Broadcast.unpersist.

 

2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
org.apache.spark.SparkException: Exception thrown in awaitResult: at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
 at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
 at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
 at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) Caused 
by: java.io.IOException: Failed to send RPC 7228115282075984867 to 
/10.10.10.10:53804: java.nio.channels.ClosedChannelException at 
org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
 at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) 
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
 at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) 
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
 at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
 at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
 at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
 at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
 at java.lang.Thread.run(Thread.java:745) Caused by: 
java.nio.channels.ClosedChannelException

  was:
Very similar to https://issues.apache.org/jira/browse/SPARK-2261 . But this 
could also apply to Broadcast.unpersist.

 

2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
org.apache.spark.SparkException: Exception thrown in awaitResult: at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
 at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
 at 

[jira] [Commented] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417864#comment-16417864
 ] 

Apache Spark commented on SPARK-23807:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/20923

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23807:


Assignee: Apache Spark

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23807:


Assignee: (was: Apache Spark)

> Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and 
> binding
> ---
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23096) Migrate rate source to v2

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417848#comment-16417848
 ] 

Apache Spark commented on SPARK-23096:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20922

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23096) Migrate rate source to v2

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417787#comment-16417787
 ] 

Apache Spark commented on SPARK-23096:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20921

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417751#comment-16417751
 ] 

Marcelo Vanzin commented on SPARK-23782:


All the information you're trying to protect can be found through other means. 
Looking at the even log dir in HDFS; looking at the cluster manager's UI; just 
running "ps" on the cluster machines.

Different users seeing different listings can be confusing. "Hey I was trying 
to find your jobs for blah on the SHS and I don't see them."

Again, knowing the existence of users and the fact that they run jobs is not a 
security problem. You cannot see what those jobs do.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups, cloud-storage artifacts and binding

2018-03-28 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-23807:
--

 Summary: Add Hadoop 3 profile with relevant POM fix ups, 
cloud-storage artifacts and binding
 Key: SPARK-23807
 URL: https://issues.apache.org/jira/browse/SPARK-23807
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Steve Loughran


Hadoop 3, and particular Hadoop 3.1 adds:
 * Java 8 as the minimum (and currently sole) supported Java version
 * A new "hadoop-cloud-storage" module intended to be a minimal dependency POM 
for all the cloud connectors in the version of hadoop built against
 * The ability to declare a committer for any FileOutputFormat which supercedes 
the classic FileOutputCommitter -in both a job and for a specific FS URI
 * A shaded client JAR, though not yet one complete enough for spark.
 * Lots of other features and fixes.

The basic work of building spark with hadoop 3 is one of just doing the build 
with {{-Dhadoop.version=3.x.y}}; however that
 * Doesn't build on SBT (dependency resolution of zookeeper JAR)
 * Misses the new cloud features

The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
artifact, instead of relying on curator to pull it in; this needs a profile to 
declare the right ZK version, obviously..

To use the cloud features spark the hadoop-3 profile should declare that the 
spark-hadoop-cloud module depends on —and only on— the 
hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
storage, and a source package which is only built and tested when build against 
Hadoop 3.1+

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23806) Broadcast. unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-23806:
-

 Summary: Broadcast. unpersist can cause fatal exception when used 
with dynamic allocation
 Key: SPARK-23806
 URL: https://issues.apache.org/jira/browse/SPARK-23806
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Thomas Graves


Very similar to https://issues.apache.org/jira/browse/SPARK-2261 . But this 
could also apply to Broadcast.unpersist.

 

2018-03-24 05:29:17,836 [Spark Context Cleaner] ERROR 
org.apache.spark.ContextCleaner - Error cleaning broadcast 85710 
org.apache.spark.SparkException: Exception thrown in awaitResult: at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:152)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
 at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
 at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
 at scala.Option.foreach(Option.scala:257) at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1286) at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
 at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) Caused 
by: java.io.IOException: Failed to send RPC 7228115282075984867 to 
/10.10.10.10:53804: java.nio.channels.ClosedChannelException at 
org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
 at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
 at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122) 
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
 at 
io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738) 
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
 at 
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
 at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
 at 
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
 at 
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
 at java.lang.Thread.run(Thread.java:745) Caused by: 
java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-28 Thread Nathan Kleyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417519#comment-16417519
 ] 

Nathan Kleyn edited comment on SPARK-23801 at 3/28/18 3:14 PM:
---

[~kiszk] Sure, we'll have our best go at trying to narrow down the job to 
something that fails - many thanks for the speedy reply!

Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?


was (Author: nathankleyn):
[~kiszk] We'll have our best go at trying to narrow down the job to something 
that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 

[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-28 Thread Nathan Kleyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417519#comment-16417519
 ] 

Nathan Kleyn commented on SPARK-23801:
--

[~kiszk] We'll have our best go at trying to narrow down the job to something 
that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 

[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 3:00 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

Before worker dies:

!screen_shot_2018-03-20_at_15.23.29.png!

Heap dump of worker running for some time:

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> 

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: screen_shot_2018-03-20_at_15.23.29.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png, 
> screen_shot_2018-03-20_at_15.23.29.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:56 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_ coming from:
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_s coming from: 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> 

[jira] [Comment Edited] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev edited comment on SPARK-23682 at 3/28/18 2:55 PM:
--

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_
{code:java}
private lazy val loadedMaps = new mutable.HashMap[Long, MapType]{code}
_,_ which appears not to clean up _UnsafeRow_s coming from: 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!


was (Author: akorzhuev):
I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_

 

_private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_

 

_,_ which appears not to clean up _UnsafeRow_s coming from:

 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> 

[jira] [Commented] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417474#comment-16417474
 ] 

Andrew Korzhuev commented on SPARK-23682:
-

I can confirm that _HDFSBackedStateStoreProvider_ is leaking memory. Tested on:
 * AWS S3 checkpoint
 * Spark 2.3.0 on k8s
 * Structured stream - stream join

I managed to track the leak down to

[https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L252]_:_

 

_private lazy val loadedMaps = new mutable.HashMap[Long, MapType]_

 

_,_ which appears not to clean up _UnsafeRow_s coming from:

 
{code:java}
type MapType = java.util.concurrent.ConcurrentHashMap[UnsafeRow, 
UnsafeRow]{code}
I noticed that memory leaks slower if data is buffered to disk:
{code:java}
spark.hadoop.fs.s3a.fast.upload   true
spark.hadoop.fs.s3a.fast.upload.bufferdisk
{code}
It also seems that the state persisted to S3 is never cleaned up, as both 
number of objects and volume grows indefinitely.

!Screen Shot 2018-03-28 at 16.44.20.png!

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> 

[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2018-03-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417466#comment-16417466
 ] 

Thomas Graves commented on SPARK-22618:
---

I'll file a separate Jira for it and put up a pr

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Assignee: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$.com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially(SuiteKickoff.scala:78)
>  

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Screen 
> Shot 2018-03-28 at 16.44.20.png, Screen Shot 2018-03-28 at 16.44.20.png, 
> Spark executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (SPARK-23682) Memory issue with Spark structured streaming

2018-03-28 Thread Andrew Korzhuev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Korzhuev updated SPARK-23682:

Attachment: Screen Shot 2018-03-28 at 16.44.20.png

> Memory issue with Spark structured streaming
> 
>
> Key: SPARK-23682
> URL: https://issues.apache.org/jira/browse/SPARK-23682
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
> Environment: EMR 5.9.0 with Spark 2.2.0 and Hadoop 2.7.3
> |spark.blacklist.decommissioning.enabled|true|
> |spark.blacklist.decommissioning.timeout|1h|
> |spark.cleaner.periodicGC.interval|10min|
> |spark.default.parallelism|18|
> |spark.dynamicAllocation.enabled|false|
> |spark.eventLog.enabled|true|
> |spark.executor.cores|3|
> |spark.executor.extraJavaOptions|-verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'|
> |spark.executor.id|driver|
> |spark.executor.instances|3|
> |spark.executor.memory|22G|
> |spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version|2|
> |spark.hadoop.parquet.enable.summary-metadata|false|
> |spark.hadoop.yarn.timeline-service.enabled|false|
> |spark.jars| |
> |spark.master|yarn|
> |spark.memory.fraction|0.9|
> |spark.memory.storageFraction|0.3|
> |spark.memory.useLegacyMode|false|
> |spark.rdd.compress|true|
> |spark.resourceManager.cleanupExpiredHost|true|
> |spark.scheduler.mode|FIFO|
> |spark.serializer|org.apache.spark.serializer.KryoSerializer|
> |spark.shuffle.service.enabled|true|
> |spark.speculation|false|
> |spark.sql.parquet.filterPushdown|true|
> |spark.sql.parquet.mergeSchema|false|
> |spark.sql.warehouse.dir|hdfs:///user/spark/warehouse|
> |spark.stage.attempt.ignoreOnDecommissionFetchFailure|true|
> |spark.submit.deployMode|client|
> |spark.yarn.am.cores|1|
> |spark.yarn.am.memory|2G|
> |spark.yarn.am.memoryOverhead|1G|
> |spark.yarn.executor.memoryOverhead|3G|
>Reporter: Yuriy Bondaruk
>Priority: Major
>  Labels: Memory, memory, memory-leak
> Attachments: Screen Shot 2018-03-07 at 21.52.17.png, Screen Shot 
> 2018-03-10 at 18.53.49.png, Screen Shot 2018-03-28 at 16.44.20.png, Spark 
> executors GC time.png, image-2018-03-22-14-46-31-960.png
>
>
> It seems like there is an issue with memory in structured streaming. A stream 
> with aggregation (dropDuplicates()) and data partitioning constantly 
> increases memory usage and finally executors fails with exit code 137:
> {quote}ExecutorLostFailure (executor 2 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> container_1520214726510_0001_01_03 on host: 
> ip-10-0-1-153.us-west-2.compute.internal. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal{quote}
> Stream creating looks something like this:
> {quote}session
> .readStream()
> .schema(inputSchema)
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv("s3://test-bucket/input")
> .as(Encoders.bean(TestRecord.class))
> .flatMap(mf, Encoders.bean(TestRecord.class))
> .dropDuplicates("testId", "testName")
> .withColumn("year", 
> functions.date_format(dataset.col("testTimestamp").cast(DataTypes.DateType), 
> ""))
> .writeStream()
> .option("path", "s3://test-bucket/output")
> .option("checkpointLocation", "s3://test-bucket/checkpoint")
> .trigger(Trigger.ProcessingTime(60, TimeUnit.SECONDS))
> .partitionBy("year")
> .format("parquet")
> .outputMode(OutputMode.Append())
> .queryName("test-stream")
> .start();{quote}
> Analyzing the heap dump I found that most of the memory used by 
> {{org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider}}
>  that is referenced from 
> [StateStore|https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala#L196]
>  
> On the first glance it looks normal since that is how Spark keeps aggregation 
> keys in memory. However I did my testing by renaming files in source folder, 
> so that they could be picked up by spark again. Since input records are the 
> same all further rows should be rejected as duplicates and memory consumption 
> shouldn't increase but it's not true. Moreover, GC time took more than 30% of 
> total processing time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417429#comment-16417429
 ] 

Apache Spark commented on SPARK-23040:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/20920

> BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator 
> or ordering is specified
> 
>
> Key: SPARK-23040
> URL: https://issues.apache.org/jira/browse/SPARK-23040
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example, if ordering is specified, the returned iterator is an 
> CompletionIterator
> {code:java}
> dep.keyOrdering match {
>   case Some(keyOrd: Ordering[K]) =>
> // Create an ExternalSorter to sort the data.
> val sorter =
>   new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), 
> serializer = dep.serializer)
> sorter.insertAll(aggregatedIter)
> context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
> context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
> 
> context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
> CompletionIterator[Product2[K, C], Iterator[Product2[K, 
> C]]](sorter.iterator, sorter.stop())
>   case None =>
> aggregatedIter
> }
> {code}
> However the sorter would consume(in sorter.insertAll) the 
> aggregatedIter(which may be interruptible), then creates an iterator which 
> isn't interruptible.
> The problem with this is that Spark task cannot be cancelled due to stage 
> fail(without interruptThread enabled, which is disabled by default), which 
> wasting executor resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x

2018-03-28 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417293#comment-16417293
 ] 

Darek commented on SPARK-19076:
---

It's done, passed all the tests, just needs to be merged in, not sure who can 
merge it, I have been asking for a while now, but no one is willing to step in 
and merge it. If you know know anyone who can merge it, it would be great help. 
Thanks

> Upgrade Hive dependence to Hive 2.x
> ---
>
> Key: SPARK-19076
> URL: https://issues.apache.org/jira/browse/SPARK-19076
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dapeng Sun
>Priority: Major
>
> Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive 
> 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0  also released 
> for a long time, at Spark side, it is better to support Hive 2.0 and above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x

2018-03-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417291#comment-16417291
 ] 

Maciej Bryński commented on SPARK-19076:


So it's not resolved but in progress

> Upgrade Hive dependence to Hive 2.x
> ---
>
> Key: SPARK-19076
> URL: https://issues.apache.org/jira/browse/SPARK-19076
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dapeng Sun
>Priority: Major
>
> Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive 
> 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0  also released 
> for a long time, at Spark side, it is better to support Hive 2.0 and above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x

2018-03-28 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417284#comment-16417284
 ] 

Darek commented on SPARK-19076:
---

[PR 20659|https://github.com/apache/spark/pull/20659] for this issue already 
exists, just needs to be merged into master.

> Upgrade Hive dependence to Hive 2.x
> ---
>
> Key: SPARK-19076
> URL: https://issues.apache.org/jira/browse/SPARK-19076
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dapeng Sun
>Priority: Major
>
> Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive 
> 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0  also released 
> for a long time, at Spark side, it is better to support Hive 2.0 and above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23765) Supports line separator for json datasource

2018-03-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23765.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20877
[https://github.com/apache/spark/pull/20877]

> Supports line separator for json datasource
> ---
>
> Key: SPARK-23765
> URL: https://issues.apache.org/jira/browse/SPARK-23765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Same as its parent but this JIRA is specific to JSON datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23765) Supports line separator for json datasource

2018-03-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23765:
---

Assignee: Hyukjin Kwon

> Supports line separator for json datasource
> ---
>
> Key: SPARK-23765
> URL: https://issues.apache.org/jira/browse/SPARK-23765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Same as its parent but this JIRA is specific to JSON datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23565) Improved error message for when the number of sources for a query changes

2018-03-28 Thread Patrick McGloin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417206#comment-16417206
 ] 

Patrick McGloin commented on SPARK-23565:
-

I would like to work on this.

> Improved error message for when the number of sources for a query changes
> -
>
> Key: SPARK-23565
> URL: https://issues.apache.org/jira/browse/SPARK-23565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> If you change the number of sources for a Structured Streaming query then you 
> will get an assertion error as the number of sources in the checkpoint does 
> not match the number of sources in the query that is starting.  This can 
> happen if, for example, you add a union to the input of the query.  This is 
> of course correct but the error is a bit cryptic and requires investigation.
> Suggestion for a more informative error message =>
> The number of sources for this query has changed.  There are [x] sources in 
> the checkpoint offsets and now there are [y] sources requested by the query.  
> Cannot continue.
> This is the current message.
> 02-03-2018 13:14:22 ERROR StreamExecution:91 - Query ORPositionsState to 
> Kafka [id = 35f71e63-dbd0-49e9-98b2-a4c72a7da80e, runId = 
> d4439aca-549c-4ef6-872e-29fbfde1df78] terminated with error 
> java.lang.AssertionError: assertion failed at 
> scala.Predef$.assert(Predef.scala:156) at 
> org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:38)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$populateStartOffsets(StreamExecution.scala:429)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:297)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23782) SHS should not show applications to user without read permission

2018-03-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417149#comment-16417149
 ] 

Marco Gaido commented on SPARK-23782:
-

[~vanzin] sorry but I cannot see any usability issue with having different 
users listing different things. I think this is normal in every system: one 
sees only what he can. This is normal in every system where users are present, 
otherwise it is useless to have users, if they can all see and do the same 
things.

I commented on the PR since some people commented there too, with some use 
cases I consider critical for this.
Thanks.

> SHS should not show applications to user without read permission
> 
>
> Key: SPARK-23782
> URL: https://issues.apache.org/jira/browse/SPARK-23782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> The History Server shows all the applications to all the users, even though 
> they have no permission to read them. They cannot read the details of the 
> applications they cannot access, but still anybody can list all the 
> applications submitted by all users.
> For instance, if we have an admin user {{admin}} and two normal users {{u1}} 
> and {{u2}}, and each of them submitted one application, all of them can see 
> in the main page of SHS:
> ||App ID||App Name|| ... ||Spark User|| ... ||
> |app-123456789|The Admin App| .. |admin| ... |
> |app-123456790|u1 secret app| .. |u1| ... |
> |app-123456791|u2 secret app| .. |u2| ... |
> Then clicking on each application, the proper permissions are applied and 
> each user can see only the applications he has the read permission for.
> Instead, each user should see only the applications he has the permission to 
> read and he/she should not be able to see applications he has not the 
> permissions for.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23805) support vector-size validation and Inference

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23805:


Assignee: Apache Spark

> support vector-size validation and Inference
> 
>
> Key: SPARK-23805
> URL: https://issues.apache.org/jira/browse/SPARK-23805
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
> support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If 
> the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
> \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This 
> can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my 
> idea. Since the validation and inference is quite alg-speciafic, we may need 
> to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23805) support vector-size validation and Inference

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23805:


Assignee: (was: Apache Spark)

> support vector-size validation and Inference
> 
>
> Key: SPARK-23805
> URL: https://issues.apache.org/jira/browse/SPARK-23805
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
> support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If 
> the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
> \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This 
> can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my 
> idea. Since the validation and inference is quite alg-speciafic, we may need 
> to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23805) support vector-size validation and Inference

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417088#comment-16417088
 ] 

Apache Spark commented on SPARK-23805:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/20918

> support vector-size validation and Inference
> 
>
> Key: SPARK-23805
> URL: https://issues.apache.org/jira/browse/SPARK-23805
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
> support vector-size validation and inference in algs.
> My thoughts are:
>  * In \{{transformSchema}}, validate the input vector-size if possible. If 
> the input vector-size can be obtained from schema, check it.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> require the vector-size to be no more than 4.
>  ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
> \{{transformSchema}} will require the vector-size to be 10.
>  * In \{{transformSchema}}, inference the output vector-size if possible.
>  ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
> return a schema with output vector-size=4.
>  * In \{{transform}}, inference the output vector-size if possible.
>  * In \{{fit}}, obtain the input vector-size from schema if possible. This 
> can help eliminating redundant \{{first}} jobs.
>  
> Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my 
> idea. Since the validation and inference is quite alg-speciafic, we may need 
> to sperate the task into several small subtasks.
> How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23805) support vector-size validation and Inference

2018-03-28 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-23805:


 Summary: support vector-size validation and Inference
 Key: SPARK-23805
 URL: https://issues.apache.org/jira/browse/SPARK-23805
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: zhengruifeng


I think it maybe miningful to unify the usage of \{{AttributeGroup}} and 
support vector-size validation and inference in algs.

My thoughts are:
 * In \{{transformSchema}}, validate the input vector-size if possible. If the 
input vector-size can be obtained from schema, check it.
 ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will 
require the vector-size to be no more than 4.
 ** Suppose a \{{PCAModel}} trained with vectors of length 10, the 
\{{transformSchema}} will require the vector-size to be 10.
 * In \{{transformSchema}}, inference the output vector-size if possible.
 ** Suppose a \{{PCA}} estimator with k=4, the \{{transformSchema}} will return 
a schema with output vector-size=4.
 ** Suppose a \{{PCAModel}} trained with k=4, the \{{transformSchema}} will 
return a schema with output vector-size=4.
 * In \{{transform}}, inference the output vector-size if possible.
 * In \{{fit}}, obtain the input vector-size from schema if possible. This can 
help eliminating redundant \{{first}} jobs.

 

Current PR only modifies \{{PCA}} and \{{MaxAbsScaler}} to illustrate my idea. 
Since the validation and inference is quite alg-speciafic, we may need to 
sperate the task into several small subtasks.

How do you think about this? [~srowen] [~yanboliang] [~WeichenXu123] [~mlnick]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23705:


Assignee: (was: Apache Spark)

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-03-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23705:


Assignee: Apache Spark

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Assignee: Apache Spark
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings

2018-03-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417064#comment-16417064
 ] 

Apache Spark commented on SPARK-23705:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/20917

> dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
> --
>
> Key: SPARK-23705
> URL: https://issues.apache.org/jira/browse/SPARK-23705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Khoa Tran
>Priority: Minor
>  Labels: beginner, easyfix, features, newbie, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> {code:java}
> // code placeholder
> package org.apache.spark.sql
> .
> .
> .
> class Dataset[T] private[sql](
> .
> .
> .
> def groupBy(col1: String, cols: String*): RelationalGroupedDataset = {
>   val colNames: Seq[String] = col1 +: cols
>   RelationalGroupedDataset(
> toDF(), colNames.map(colName => resolve(colName)), 
> RelationalGroupedDataset.GroupByType)
> }
> {code}
> should append a `.distinct` after `colNames` when used in `groupBy` 
>  
> Not sure if the community agrees with this or it's up to the users to perform 
> the distinct operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org