[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153194#comment-16153194
 ] 

Yanbo Liang commented on SPARK-21727:
-

[~neilalex] Please feel free to take this task. Thanks.

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21913.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> `withDatabase` should drop database with CASCADE
> 
>
> Key: SPARK-21913
> URL: https://issues.apache.org/jira/browse/SPARK-21913
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, it fails if the database is not empty. It would be great if we 
> drop cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21913) `withDatabase` should drop database with CASCADE

2017-09-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21913:

Component/s: (was: SQL)

> `withDatabase` should drop database with CASCADE
> 
>
> Key: SPARK-21913
> URL: https://issues.apache.org/jira/browse/SPARK-21913
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, it fails if the database is not empty. It would be great if we 
> drop cleanly with CASCADE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21921) Remove `spark.sql.parquet.cacheMetadata`

2017-09-05 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21921:
-

 Summary: Remove `spark.sql.parquet.cacheMetadata`
 Key: SPARK-21921
 URL: https://issues.apache.org/jira/browse/SPARK-21921
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun
Priority: Trivial


Since `spark.sql.parquet.cacheMetadata` is not used anymore, we had better 
remove it from SQLConf and documents.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21921) Remove `spark.sql.parquet.cacheMetadata`

2017-09-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21921.
-
Resolution: Duplicate

Oops. I found the old one marked as 'Later'.

> Remove `spark.sql.parquet.cacheMetadata`
> 
>
> Key: SPARK-21921
> URL: https://issues.apache.org/jira/browse/SPARK-21921
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> Since `spark.sql.parquet.cacheMetadata` is not used anymore, we had better 
> remove it from SQLConf and documents.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-05 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153226#comment-16153226
 ] 

Dongjoon Hyun edited comment on SPARK-13656 at 9/5/17 7:41 AM:
---

I think this is the time, `Later`. :)


was (Author: dongjoon):
I think this is the time, `Later`.

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153227#comment-16153227
 ] 

Apache Spark commented on SPARK-13656:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19129

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-13656:
---

I think this is the time, `Later`.

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13656:


Assignee: (was: Apache Spark)

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13656:


Assignee: Apache Spark

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21917:


Assignee: Apache Spark

> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21917:


Assignee: (was: Apache Spark)

> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153260#comment-16153260
 ] 

Apache Spark commented on SPARK-21917:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/19130

> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)
zhoukang created SPARK-21922:


 Summary: When executor failed and task metrics have not send to 
driver,the status will always be 'RUNNING' and the duration will be 
'CurrentTime - launchTime'
 Key: SPARK-21922
 URL: https://issues.apache.org/jira/browse/SPARK-21922
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: zhoukang


As title described,and below is an example:





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Description: 
As title described,and below is an example:
!duration-notfixed.png|Before fixed!


  was:
As title described,and below is an example:




> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: duration-fixed.png, duration-notfixed.png
>
>
> As title described,and below is an example:
> !duration-notfixed.png|Before fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Attachment: duration-fixed.png
duration-notfixed.png

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: duration-fixed.png, duration-notfixed.png
>
>
> As title described,and below is an example:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-05 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153141#comment-16153141
 ] 

Saisai Shao edited comment on SPARK-21917 at 9/5/17 8:36 AM:
-

I'm inclining to choose option 1, the only overhead is resource re-uploading, 
the fix is restricted to SparkSubmit and other codes could be worked 
transparently.

What's your opinion [~tgraves] [~vanzin]?


was (Author: jerryshao):
I'm inclining to choose option 1, the only overhead is resource re-uploading, 
the fix is restricted to SparkSubmit and all other code could be worked 
transparently.

What's your opinion [~tgraves] [~vanzin]?

> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Description: 
As title described,and below is an example:
!duration-notfixed.png|Before fixed!
We can fix the duration time by the modify time of event log:
!duration-fixed.png|After fixed!

  was:
As title described,and below is an example:
!duration-notfixed.png|Before fixed!



> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: duration-fixed.png, duration-notfixed.png
>
>
> As title described,and below is an example:
> !duration-notfixed.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !duration-fixed.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Attachment: (was: duration-fixed.png)

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> As title described,and below is an example:
> !duration-notfixed.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !duration-fixed.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Attachment: (was: duration-notfixed.png)

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> As title described,and below is an example:
> !duration-notfixed.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !duration-fixed.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153367#comment-16153367
 ] 

Sean Owen commented on SPARK-21919:
---

It does look like a problem. From R's survreg I get:

{code}
survreg(formula = Surv(data$label, data$censor) ~ data$feature1 + 
data$feature2, dist = "weibull")
 Value Std. Error   zp
(Intercept)3.29140  0.295 11.1737 5.49e-29
data$feature1 -0.06581  0.245 -0.2688 7.88e-01
data$feature2  0.00327  0.123  0.0265 9.79e-01
Log(scale)-2.20858  0.642 -3.4390 5.84e-04

Scale= 0.11 
{code}

[~yanboliang] I think you originally created this; does it ring any bells?

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
> quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result is:
> Coefficients: [-0.496304411053,0.198452172529]
> Intercept: 2.6380898963056327
> Scale: 1.5472363533632303
> ||label||censor||features  ||prediction   || quantiles ||
> |1.218|1.0   |[1.56,-0.605] |5.718985621018951 | 
> [1.160322990805951,4.99546058340675]|
> |2.949|0.0   |[0.346,2.158] |18.07678210850554 
> |[3.66759199449632,15.789837303662042]|
> |3.627|0.0   |[1.38,0.231]  |7.381908879359964 
> |[1.4977129086101573,6.4480027195054905]|
> |0.273|1.0   |[0.52,1.151]  
> |13.577717814884505|[2.754778414791513,11.859962351993202]|
> |4.199|0.0   |[0.795,-0.226]|9.013087597344805 
> |[1.828662187733188,7.8728164067854856]|
> But if we change the value of all labels as label + 20. as:
> {code}
> training = spark.createDataFrame([
> (21.218, 1.0, Vectors.dense(1.560, -0.605)),
> (22.949, 0.0, Vectors.dense(0.346, 2.158)),
> (23.627, 0.0, Vectors.dense(1.380, 0.231)),
> (20.273, 1.0, Vectors.dense(0.520, 1.151)),
> (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
>  quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result changes to:
> Coefficients: [23.9932020748,3.18105314757]
> Intercept: 7.35052273751137
> Scale: 7698609960.724161
> ||label ||censor||features  ||prediction   ||quantiles||
> |21.218|1.0   |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]|
> |22.949|0.0   |[0.346,2.158] |6.011158613411288E9  |[0.0,0.0]|
> |23.627|0.0   |[1.38,0.231]  |7.7835948690311181E17|[0.0,0.0]|
> |20.273|1.0   |[0.52,1.151]  |1.5880852723124176E10|[0.0,0.0]|
> |24.199|0.0   |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]|
> Can someone please explain this exponential blow up in prediction, as per my 
> understanding prediction in AFT is a prediction of the time when the failure 
> event will occur, not able to understand why it will change exponentially 
> against the value of the label.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21920.
---
Resolution: Invalid

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21920) DataFrame Fail To Find The Column Name

2017-09-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21920:
--
   Flags:   (was: Important)
Priority: Minor  (was: Critical)

Questions should go to the mailing list. Looks like you are selecting a column 
associated with df1/df2 in the joined DF, which is not the same.

> DataFrame Fail To Find The Column Name
> --
>
> Key: SPARK-21920
> URL: https://issues.apache.org/jira/browse/SPARK-21920
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: abhijit nag
>Priority: Minor
>
> I am getting one issue like "sql.AnalysisException: cannot resolve 
> column_name"
> Wrote a simple query as below.
> [DataFrame df= df1
>   .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
>   .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));]
> Exception Found : 
> resolved attribute(s) MERCH_ID#738 missing from 
> MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project 
> [MERCH_ID#738,MERCHANT#737];
> Problem Solved by following code:
> DataFrame df= df1.alias("df1").
>   .join(df2.alias("df2"), 
> functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
>   .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));
> Similar kind of issue appears rare, but I want to know the root cause of this 
> problem. 
> Is it a bug in Spark 1.6 or something else.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Attachment: fixed01.png
fixed02.png
notfixed01.png
notfixed02.png

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !duration-notfixed.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !duration-fixed.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21922:
-
Description: 
As title described,and below is an example:
!notfixed01.png|Before fixed!
!notfixed02.png|Before fixed!
We can fix the duration time by the modify time of event log:
!fixed01.png|After fixed!
!fixed02.png|After fixed!

  was:
As title described,and below is an example:
!duration-notfixed.png|Before fixed!
We can fix the duration time by the modify time of event log:
!duration-fixed.png|After fixed!


> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21903) Upgrade scalastyle to 1.0.0

2017-09-05 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21903.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19116
[https://github.com/apache/spark/pull/19116]

> Upgrade scalastyle to 1.0.0
> ---
>
> Key: SPARK-21903
> URL: https://issues.apache.org/jira/browse/SPARK-21903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> 1.0.0 fixes an issue with import order, explicit type for public methods, 
> line length limitation and comment validation:
> {code}
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49:
>  File line length exceeds 100 characters
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2:
>  Insert a space after the start of the comment
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43:
>  JavaDStream should come before JavaDStreamLike.
> {code}
> I am also interested in {{org.scalastyle.scalariform.OverrideJavaChecker}} 
> feature, which we had to work around to check this SPARK-16877, which was 
> added from 0.9.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21903) Upgrade scalastyle to 1.0.0

2017-09-05 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21903:


Assignee: Hyukjin Kwon

> Upgrade scalastyle to 1.0.0
> ---
>
> Key: SPARK-21903
> URL: https://issues.apache.org/jira/browse/SPARK-21903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> 1.0.0 fixes an issue with import order, explicit type for public methods, 
> line length limitation and comment validation:
> {code}
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/Main.scala:50:16:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:49:
>  File line length exceeds 100 characters
> [error] 
> .../spark/repl/scala-2.11/src/main/scala/org/apache/spark/repl/SparkILoop.scala:22:21:
>  Are you sure you want to println? If yes, wrap the code block with
> [error]   // scalastyle:off println
> [error]   println(...)
> [error]   // scalastyle:on println
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:35:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:51:6:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:93:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:98:15:
>  Public method must have explicit type
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:47:2:
>  Insert a space after the start of the comment
> [error] 
> .../spark/streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala:26:43:
>  JavaDStream should come before JavaDStreamLike.
> {code}
> I am also interested in {{org.scalastyle.scalariform.OverrideJavaChecker}} 
> feature, which we had to work around to check this SPARK-16877, which was 
> added from 0.9.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21922:


Assignee: Apache Spark

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21922:


Assignee: (was: Apache Spark)

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153455#comment-16153455
 ] 

Apache Spark commented on SPARK-21922:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19132

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block

2017-09-05 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21902:
-
Description: 
As logging below, actually exception will be hidden when removeBlockInternal 
throw an exception.
{code:java}
2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
block broadcast_110 failed due to an exception
2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
Failed to create a new broadcast in 1 attempts
java.io.IOException: Failed to create local dir in 
/tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
at 
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}
I want to print the exception first for troubleshooting.Or may be we should not 
throw exception when removing blocks.


  was:
As logging below, actually exception was hidden when removeBlockInternal throw 
an exception.
{code:java}
2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
block broadcast_110 failed due to an exception
2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
Failed to create a new broadcast in 1 attempts
java.io.IOException: Failed to create local dir in 
/tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
at 
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
at 
org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
at 
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
   

[jira] [Assigned] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21902:


Assignee: (was: Apache Spark)

> BlockManager.doPut will hide actually exception when exception thrown in 
> finally block
> --
>
> Key: SPARK-21902
> URL: https://issues.apache.org/jira/browse/SPARK-21902
> Project: Spark
>  Issue Type: Wish
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> As logging below, actually exception will be hidden when removeBlockInternal 
> throw an exception.
> {code:java}
> 2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
> block broadcast_110 failed due to an exception
> 2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
> Failed to create a new broadcast in 1 attempts
> java.io.IOException: Failed to create local dir in 
> /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
> at 
> org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at 
> org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> I want to print the exception first for troubleshooting.Or may be we should 
> not throw exception when removing blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153463#comment-16153463
 ] 

Apache Spark commented on SPARK-21902:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19133

> BlockManager.doPut will hide actually exception when exception thrown in 
> finally block
> --
>
> Key: SPARK-21902
> URL: https://issues.apache.org/jira/browse/SPARK-21902
> Project: Spark
>  Issue Type: Wish
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> As logging below, actually exception will be hidden when removeBlockInternal 
> throw an exception.
> {code:java}
> 2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
> block broadcast_110 failed due to an exception
> 2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
> Failed to create a new broadcast in 1 attempts
> java.io.IOException: Failed to create local dir in 
> /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
> at 
> org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at 
> org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> I want to print the exception first for troubleshooting.Or may be we should 
> not throw exception when removing blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21902:


Assignee: Apache Spark

> BlockManager.doPut will hide actually exception when exception thrown in 
> finally block
> --
>
> Key: SPARK-21902
> URL: https://issues.apache.org/jira/browse/SPARK-21902
> Project: Spark
>  Issue Type: Wish
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
>
> As logging below, actually exception will be hidden when removeBlockInternal 
> throw an exception.
> {code:java}
> 2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
> block broadcast_110 failed due to an exception
> 2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
> Failed to create a new broadcast in 1 attempts
> java.io.IOException: Failed to create local dir in 
> /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
> at 
> org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at 
> org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> I want to print the exception first for troubleshooting.Or may be we should 
> not throw exception when removing blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21893:
--
Priority: Minor  (was: Major)

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21893:


Assignee: (was: Apache Spark)

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21893:


Assignee: Apache Spark

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153503#comment-16153503
 ] 

Apache Spark commented on SPARK-21893:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19134

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21923) Avoid call reserveUnrollMemoryForThisTask every record

2017-09-05 Thread Xianyang Liu (JIRA)
Xianyang Liu created SPARK-21923:


 Summary: Avoid call reserveUnrollMemoryForThisTask every record
 Key: SPARK-21923
 URL: https://issues.apache.org/jira/browse/SPARK-21923
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Xianyang Liu


When Spark persist data to Unsafe memory, we call  the method 
`MemoryStore.putIteratorAsValues`, which need synchronize the `memoryManager` 
for every record write. This implementation is not necessary, we can apply for 
more memory at a time to reduce unnecessary synchronization.

Test case:
```scala
val start = System.currentTimeMillis()
val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
  .persist(StorageLevel.OFF_HEAP)
  .count()

println(System.currentTimeMillis() - start)

```

Test result:

before

|  27647  |  29108  |  28591  |  28264  |  27232  |

after

|  26868  |  26358  |  27767  |  26653  |  26693  |




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21923) Avoid call reserveUnrollMemoryForThisTask every record

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21923:


Assignee: (was: Apache Spark)

> Avoid call reserveUnrollMemoryForThisTask every record
> --
>
> Key: SPARK-21923
> URL: https://issues.apache.org/jira/browse/SPARK-21923
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> When Spark persist data to Unsafe memory, we call  the method 
> `MemoryStore.putIteratorAsValues`, which need synchronize the `memoryManager` 
> for every record write. This implementation is not necessary, we can apply 
> for more memory at a time to reduce unnecessary synchronization.
> Test case:
> ```scala
> val start = System.currentTimeMillis()
> val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
>   .persist(StorageLevel.OFF_HEAP)
>   .count()
> println(System.currentTimeMillis() - start)
> ```
> Test result:
> before
> |  27647  |  29108  |  28591  |  28264  |  27232  |
> after
> |  26868  |  26358  |  27767  |  26653  |  26693  |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21923) Avoid call reserveUnrollMemoryForThisTask every record

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21923:


Assignee: Apache Spark

> Avoid call reserveUnrollMemoryForThisTask every record
> --
>
> Key: SPARK-21923
> URL: https://issues.apache.org/jira/browse/SPARK-21923
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Apache Spark
>
> When Spark persist data to Unsafe memory, we call  the method 
> `MemoryStore.putIteratorAsValues`, which need synchronize the `memoryManager` 
> for every record write. This implementation is not necessary, we can apply 
> for more memory at a time to reduce unnecessary synchronization.
> Test case:
> ```scala
> val start = System.currentTimeMillis()
> val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
>   .persist(StorageLevel.OFF_HEAP)
>   .count()
> println(System.currentTimeMillis() - start)
> ```
> Test result:
> before
> |  27647  |  29108  |  28591  |  28264  |  27232  |
> after
> |  26868  |  26358  |  27767  |  26653  |  26693  |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21923) Avoid call reserveUnrollMemoryForThisTask every record

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153590#comment-16153590
 ] 

Apache Spark commented on SPARK-21923:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19135

> Avoid call reserveUnrollMemoryForThisTask every record
> --
>
> Key: SPARK-21923
> URL: https://issues.apache.org/jira/browse/SPARK-21923
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> When Spark persist data to Unsafe memory, we call  the method 
> `MemoryStore.putIteratorAsValues`, which need synchronize the `memoryManager` 
> for every record write. This implementation is not necessary, we can apply 
> for more memory at a time to reduce unnecessary synchronization.
> Test case:
> ```scala
> val start = System.currentTimeMillis()
> val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
>   .persist(StorageLevel.OFF_HEAP)
>   .count()
> println(System.currentTimeMillis() - start)
> ```
> Test result:
> before
> |  27647  |  29108  |  28591  |  28264  |  27232  |
> after
> |  26868  |  26358  |  27767  |  26653  |  26693  |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21923) Avoid call reserveUnrollMemoryForThisTask every record

2017-09-05 Thread Xianyang Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-21923:
-
Description: 
When Spark persist data to Unsafe memory, we call  the method 
`MemoryStore.putIteratorAsBytes`, which need synchronize the `memoryManager` 
for every record write. This implementation is not necessary, we can apply for 
more memory at a time to reduce unnecessary synchronization.

Test case:
```scala
val start = System.currentTimeMillis()
val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
  .persist(StorageLevel.OFF_HEAP)
  .count()

println(System.currentTimeMillis() - start)

```

Test result:

before

|  27647  |  29108  |  28591  |  28264  |  27232  |

after

|  26868  |  26358  |  27767  |  26653  |  26693  |


  was:
When Spark persist data to Unsafe memory, we call  the method 
`MemoryStore.putIteratorAsValues`, which need synchronize the `memoryManager` 
for every record write. This implementation is not necessary, we can apply for 
more memory at a time to reduce unnecessary synchronization.

Test case:
```scala
val start = System.currentTimeMillis()
val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
  .persist(StorageLevel.OFF_HEAP)
  .count()

println(System.currentTimeMillis() - start)

```

Test result:

before

|  27647  |  29108  |  28591  |  28264  |  27232  |

after

|  26868  |  26358  |  27767  |  26653  |  26693  |



> Avoid call reserveUnrollMemoryForThisTask every record
> --
>
> Key: SPARK-21923
> URL: https://issues.apache.org/jira/browse/SPARK-21923
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>
> When Spark persist data to Unsafe memory, we call  the method 
> `MemoryStore.putIteratorAsBytes`, which need synchronize the `memoryManager` 
> for every record write. This implementation is not necessary, we can apply 
> for more memory at a time to reduce unnecessary synchronization.
> Test case:
> ```scala
> val start = System.currentTimeMillis()
> val data = sc.parallelize(0 until Integer.MAX_VALUE, 100)
>   .persist(StorageLevel.OFF_HEAP)
>   .count()
> println(System.currentTimeMillis() - start)
> ```
> Test result:
> before
> |  27647  |  29108  |  28591  |  28264  |  27232  |
> after
> |  26868  |  26358  |  27767  |  26653  |  26693  |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21888:
--
Description: While running Spark on Yarn in cluster mode, currently there 
is no way to add any config files to Client classpath. An example for this is 
that suppose you want to run an application that uses hbase. Then, unless and 
until we do not copy the necessary config files required by hbase to Spark 
Config folder, we cannot specify or set their exact locations in classpath on 
Client end which we could do so earlier by setting the environment variable 
"SPARK_CLASSPATH".  (was: While running Spark on Yarn in cluster mode, 
currently there is no way to add any config files, jars etc. to Client 
classpath. An example for this is that suppose you want to run an application 
that uses hbase. Then, unless and until we do not copy the necessary config 
files required by hbase to Spark Config folder, we cannot specify or set their 
exact locations in classpath on Client end which we could do so earlier by 
setting the environment variable "SPARK_CLASSPATH".)

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Riccardo Corbella (JIRA)
Riccardo Corbella created SPARK-21924:
-

 Summary: Bug in Structured Streaming Documentation
 Key: SPARK-21924
 URL: https://issues.apache.org/jira/browse/SPARK-21924
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.2.0
Reporter: Riccardo Corbella
Priority: Trivial


Under the structured streaming documentation page, more precisely in text 
immediately after the image "Watermarking in Windowed Grouped Aggregation with 
Update Mode" there's the following erroneous sentence: "For example, the data 
(12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 
and 12:10 - 12:20.". It should be updated as following "For example, the data 
(12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 
and 12:05 - 12:15."




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang commented on SPARK-21866:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, I'd support to follow other Spark SQL data 
source to the greatest extent. Even we don't use UDT, a familiar API would make 
more users to adopt it.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the sta

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:37 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, I'd support to follow other Spark SQL data 
source to the greatest extent. Even we don't use UDT, a familiar API would make 
more users to adopt it.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image p

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:38 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PN

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:39 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users. Is there any obstacle to implement like this?
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compres

[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153672#comment-16153672
 ] 

Marco Gaido commented on SPARK-21918:
-

hive.server2.enable.doAs=true is currently not supported in STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153672#comment-16153672
 ] 

Marco Gaido edited comment on SPARK-21918 at 9/5/17 1:54 PM:
-

{{hive.server2.enable.doAs=true}} is currently not supported in STS.


was (Author: mgaido):
hive.server2.enable.doAs=true is currently not supported in STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153709#comment-16153709
 ] 

Yanbo Liang commented on SPARK-15689:
-

[~cloud_fan] Great doc! I like the direction this is going in, especially for 
columnar read interface, schema inference interface and schema evolution. 
Thanks.

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Hu Liu, (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153713#comment-16153713
 ] 

Hu Liu, commented on SPARK-21918:
-

[~mgaido]
It seems that doAS works correctly in hiveThrift server(HiveSessionImplwithUGI) 
which run sql via spark


> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153739#comment-16153739
 ] 

Thomas Graves commented on SPARK-21888:
---

[~mgaido]  I don't think that is true unless something went into master that 
I'm not aware of.  It doesn't work with 2.2 for sure.  We need the 
file/directory to get into the classpath of the client submitting the 
application.  --files does work to get the driver and executors to load it.  
if I'm missing something please let me know.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153744#comment-16153744
 ] 

Thomas Graves commented on SPARK-21888:
---

also note that you can do this in client mode by using the driver extra 
classpath option.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20978) CSV emits NPE when the number of tokens is less than given schema and corrupt column is given

2017-09-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20978.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19113
[https://github.com/apache/spark/pull/19113]

> CSV emits NPE when the number of tokens is less than given schema and corrupt 
> column is given
> -
>
> Key: SPARK-20978
> URL: https://issues.apache.org/jira/browse/SPARK-20978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Currently, if the number of tokens is less than the given schema, CSV 
> datasource throws an NPE as below:
> {code}
> scala> spark.read.schema("a string, b string, unparsed 
> string").option("columnNameOfCorruptRecord", 
> "unparsed").csv(Seq("a").toDS).show()
> 17/06/05 13:59:26 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
>   at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$apply$4.apply(DataFrameReader.scala:471)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$apply$4.apply(DataFrameReader.scala:471)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {code}
> If this is not given, it works as below:
> {code}
> scala> spark.read.schema("a string, b string, unparsed 
> string").csv(Seq("a").toDS).show()
> +---+++
> |  a|   b|unparsed|
> +---+++
> |  a|null|null|
> +---+++
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20978) CSV emits NPE when the number of tokens is less than given schema and corrupt column is given

2017-09-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20978:
---

Assignee: Hyukjin Kwon

> CSV emits NPE when the number of tokens is less than given schema and corrupt 
> column is given
> -
>
> Key: SPARK-20978
> URL: https://issues.apache.org/jira/browse/SPARK-20978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Currently, if the number of tokens is less than the given schema, CSV 
> datasource throws an NPE as below:
> {code}
> scala> spark.read.schema("a string, b string, unparsed 
> string").option("columnNameOfCorruptRecord", 
> "unparsed").csv(Seq("a").toDS).show()
> 17/06/05 13:59:26 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
> java.lang.NullPointerException
>   at 
> scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89)
>   at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:50)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser$$anonfun$2.apply(FailureSafeParser.scala:43)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:64)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$apply$4.apply(DataFrameReader.scala:471)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$apply$4.apply(DataFrameReader.scala:471)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> {code}
> If this is not given, it works as below:
> {code}
> scala> spark.read.schema("a string, b string, unparsed 
> string").csv(Seq("a").toDS).show()
> +---+++
> |  a|   b|unparsed|
> +---+++
> |  a|null|null|
> +---+++
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15689) Data source API v2

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15689:


Assignee: Apache Spark

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153834#comment-16153834
 ] 

Apache Spark commented on SPARK-15689:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19136

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15689) Data source API v2

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15689:


Assignee: (was: Apache Spark)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153847#comment-16153847
 ] 

Sean Owen commented on SPARK-21924:
---

Agree, go ahead and make a pull request.

> Bug in Structured Streaming Documentation
> -
>
> Key: SPARK-21924
> URL: https://issues.apache.org/jira/browse/SPARK-21924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Priority: Trivial
>
> Under the structured streaming documentation page, more precisely in text 
> immediately after the image "Watermarking in Windowed Grouped Aggregation 
> with Update Mode" there's the following erroneous sentence: "For example, the 
> data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 
> 12:15 and 12:10 - 12:20.". It should be updated as following "For example, 
> the data (12:09, cat) is out of order and late, and it falls in windows 12:00 
> - 12:10 and 12:05 - 12:15."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153856#comment-16153856
 ] 

Apache Spark commented on SPARK-21924:
--

User 'riccardocorbella' has created a pull request for this issue:
https://github.com/apache/spark/pull/19137

> Bug in Structured Streaming Documentation
> -
>
> Key: SPARK-21924
> URL: https://issues.apache.org/jira/browse/SPARK-21924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Priority: Trivial
>
> Under the structured streaming documentation page, more precisely in text 
> immediately after the image "Watermarking in Windowed Grouped Aggregation 
> with Update Mode" there's the following erroneous sentence: "For example, the 
> data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 
> 12:15 and 12:10 - 12:20.". It should be updated as following "For example, 
> the data (12:09, cat) is out of order and late, and it falls in windows 12:00 
> - 12:10 and 12:05 - 12:15."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21924:


Assignee: Apache Spark

> Bug in Structured Streaming Documentation
> -
>
> Key: SPARK-21924
> URL: https://issues.apache.org/jira/browse/SPARK-21924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Assignee: Apache Spark
>Priority: Trivial
>
> Under the structured streaming documentation page, more precisely in text 
> immediately after the image "Watermarking in Windowed Grouped Aggregation 
> with Update Mode" there's the following erroneous sentence: "For example, the 
> data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 
> 12:15 and 12:10 - 12:20.". It should be updated as following "For example, 
> the data (12:09, cat) is out of order and late, and it falls in windows 12:00 
> - 12:10 and 12:05 - 12:15."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21924:


Assignee: (was: Apache Spark)

> Bug in Structured Streaming Documentation
> -
>
> Key: SPARK-21924
> URL: https://issues.apache.org/jira/browse/SPARK-21924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Priority: Trivial
>
> Under the structured streaming documentation page, more precisely in text 
> immediately after the image "Watermarking in Windowed Grouped Aggregation 
> with Update Mode" there's the following erroneous sentence: "For example, the 
> data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 
> 12:15 and 12:10 - 12:20.". It should be updated as following "For example, 
> the data (12:09, cat) is out of order and late, and it falls in windows 12:00 
> - 12:10 and 12:05 - 12:15."



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21924) Bug in Structured Streaming Documentation

2017-09-05 Thread Riccardo Corbella (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riccardo Corbella updated SPARK-21924:
--
Description: Under the structured streaming documentation page, more 
precisely in text immediately after the image "Watermarking in Windowed Grouped 
Aggregation with Update Mode" there's the following erroneous sentence: "For 
example, the data (12:09, cat) is out of order and late, and it falls in 
windows 12:05 - 12:15 and 12:10 - 12:20.". It should be updated as following 
"For example, the data (12:09, cat) is out of order and late, and it falls in 
windows 12:00 - 12:10 and 12:05 - 12:15." because 12:09 cannot fall in the 
12:10 - 12:20 window.  (was: Under the structured streaming documentation page, 
more precisely in text immediately after the image "Watermarking in Windowed 
Grouped Aggregation with Update Mode" there's the following erroneous sentence: 
"For example, the data (12:09, cat) is out of order and late, and it falls in 
windows 12:05 - 12:15 and 12:10 - 12:20.". It should be updated as following 
"For example, the data (12:09, cat) is out of order and late, and it falls in 
windows 12:00 - 12:10 and 12:05 - 12:15."
)

> Bug in Structured Streaming Documentation
> -
>
> Key: SPARK-21924
> URL: https://issues.apache.org/jira/browse/SPARK-21924
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Priority: Trivial
>
> Under the structured streaming documentation page, more precisely in text 
> immediately after the image "Watermarking in Windowed Grouped Aggregation 
> with Update Mode" there's the following erroneous sentence: "For example, the 
> data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 
> 12:15 and 12:10 - 12:20.". It should be updated as following "For example, 
> the data (12:09, cat) is out of order and late, and it falls in windows 12:00 
> - 12:10 and 12:05 - 12:15." because 12:09 cannot fall in the 12:10 - 12:20 
> window.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153888#comment-16153888
 ] 

Marco Gaido commented on SPARK-21888:
-

[~tgraves] Sorry, I misread. Of course, this doesn't add it to the client, only 
to the driver and the executors. But in the example you made, ie. writing to 
HBase, I can't see why you would need it: it is enough to load the conf in 
driver and the executors.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-05 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153955#comment-16153955
 ] 

Takuya Ueshin commented on SPARK-21190:
---

[~bryanc] Thank you for your suggestion. {{**kwargs}} might be a good idea to 
provide the {{size}} hint and other metadata in the future. I'm not sure how to 
inspect the UDF to check if it accepts kwargs yet, though.
Should we make {{**kwargs}} required? If not, users can still define 
0-parameter UDF without it and we can't determine how to handle it.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-05 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153978#comment-16153978
 ] 

Takuya Ueshin commented on SPARK-21190:
---

[~leif] Thank you for your proposal.
I'm sorry but I couldn't figure out the big difference between only size 
parameter and a Series/DataFrame with no content and an index, and I guess 
no-content Series/DataFrame confuses users more.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issue

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-05 Thread Leif Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153986#comment-16153986
 ] 

Leif Walsh commented on SPARK-21190:


I think the size parameter is confusing: if a 1-or-more-parameter UDF gets 
called with some number of Series or DataFrame objects, but a 0-parameter UDF 
switches to getting a size scalar, it's less consistent. In fact, the most 
consistent interface would be for all UDFs to get called with a single 
DataFrame parameter, containing all the columns, already aligned. However, I 
somewhat more like the idea of a UDF on multiple columns getting separate 
Series objects, as this makes the UDF's function signature look more similar to 
how it will be called. 

You can use the {{inspect}} module in the python standard library to find out 
what parameters a function accepts, but I'd like to warn you against using a 
magic {{**kwargs}} parameter, that's a fairly non-intuitive API and is also 
somewhat brittle. 

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  

[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154006#comment-16154006
 ] 

Thomas Graves commented on SPARK-21888:
---

the client needs to get the hbase credentials for secure hbase to send along 
with the job run in cluster mode.  If it doesn't load the jars and the 
hbase-site.xml it won't get the credentials to send along and driver/executors 
won't be able to talk to hbase.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154033#comment-16154033
 ] 

Marco Gaido commented on SPARK-21918:
-

What do you mean by "works correctly"? Actually all the jobs are executed using 
the user who started STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20922) Unsafe deserialization in Spark LauncherConnection

2017-09-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154075#comment-16154075
 ] 

Sean Owen commented on SPARK-20922:
---

This came up again today and our security folks also suggested this should be a 
CVE. I can work on this but feel free to supply text as a summary.

> Unsafe deserialization in Spark LauncherConnection
> --
>
> Key: SPARK-20922
> URL: https://issues.apache.org/jira/browse/SPARK-20922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Aditya Sharad
>Assignee: Marcelo Vanzin
>  Labels: security
> Fix For: 2.0.3, 2.1.2, 2.2.0, 2.3.0
>
> Attachments: spark-deserialize-master.zip
>
>
> The {{run()}} method of the class 
> {{org.apache.spark.launcher.LauncherConnection}} performs unsafe 
> deserialization of data received by its socket. This makes Spark applications 
> launched programmatically using the {{SparkLauncher}} framework potentially 
> vulnerable to remote code execution by an attacker with access to any user 
> account on the local machine. Such an attacker could send a malicious 
> serialized Java object to multiple ports on the local machine, and if this 
> port matches the one (randomly) chosen by the Spark launcher, the malicious 
> object will be deserialized. By making use of gadget chains in code present 
> on the Spark application classpath, the deserialization process can lead to 
> RCE or privilege escalation.
> This vulnerability is identified by the “Unsafe deserialization” rule on 
> lgtm.com:
> https://lgtm.com/projects/g/apache/spark/snapshot/80fdc2c9d1693f5b3402a79ca4ec76f6e422ff13/files/launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java#V58
>  
> Attached is a proof-of-concept exploit involving a simple 
> {{SparkLauncher}}-based application and a known gadget chain in the Apache 
> Commons Beanutils library referenced by Spark.
> See the readme file for demonstration instructions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21917) Remote http(s) resources is not supported in YARN mode

2017-09-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154097#comment-16154097
 ] 

Marcelo Vanzin commented on SPARK-21917:


I think the optimal way would be to try to add things to the cache and, if that 
fails, fallback to download + re-upload.

The sucky part is that there's no reliable way to do it. The libraries 
available on the client side may support different file systems than the 
libraries available on the NM; so if you have the http fs in your classpath, 
but the NM does not, the container localizer would probably fail.

#1 works but it also penalizes those who are running YARN 2.9 or any other 
future version where that support exists.

So perhaps a compromise could be:

- by default, assume that client and NM libraries are "in sync"; if 
{{FileSystem.get()}} does not complain, assume the NM can also download files 
from that scheme. If it does, then download the file and re-upload it to HDFS.
- add a config option where users can blacklist schemes and force those to 
download and re-upload.


> Remote http(s) resources is not supported in YARN mode
> --
>
> Key: SPARK-21917
> URL: https://issues.apache.org/jira/browse/SPARK-21917
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the current Spark, when submitting application on YARN with remote 
> resources {{./bin/spark-shell --jars 
> http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar
>  --master yarn-client -v}}, Spark will be failed with:
> {noformat}
> java.io.IOException: No FileSystem for scheme: http
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599)
>   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173)
> {noformat}
> This is because {{YARN#client}} assumes resources must be on the Hadoop 
> compatible FS, also in the NM 
> (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245)
>  it will only use Hadoop compatible FS to download resources. So this makes 
> Spark on YARN fail to support remote http(s) resources.
> To solve this problem, there might be several options:
> * Download remote http(s) resources to local and add this local downloaded 
> resources to dist cache. The downside of this option is that remote resources 
> will be uploaded again unnecessarily.
> * Filter remote http(s) resources and add them with spark.jars or 
> spark.files, to leverage Spark's internal fileserver to distribute remote 
> http(s) resources. The problem of this solution is: for some resources which 
> require to be available before application start may not work.
> * Leverage Hadoop's support http(s) file system 
> (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in 
> Hadoop 2.9+, and I think even we implement a similar one in Spark will not be 
> worked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To 

[jira] [Created] (SPARK-21925) Update trigger interval documentation in docs with behavior change in Spark 2.2

2017-09-05 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-21925:
---

 Summary: Update trigger interval documentation in docs with 
behavior change in Spark 2.2
 Key: SPARK-21925
 URL: https://issues.apache.org/jira/browse/SPARK-21925
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Structured Streaming
Affects Versions: 2.2.0
Reporter: Burak Yavuz


I shall update documentation in apache after.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21925) Update trigger interval documentation in docs with behavior change in Spark 2.2

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154106#comment-16154106
 ] 

Apache Spark commented on SPARK-21925:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/19138

> Update trigger interval documentation in docs with behavior change in Spark 
> 2.2
> ---
>
> Key: SPARK-21925
> URL: https://issues.apache.org/jira/browse/SPARK-21925
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> I shall update documentation in apache after.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21925) Update trigger interval documentation in docs with behavior change in Spark 2.2

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21925:


Assignee: (was: Apache Spark)

> Update trigger interval documentation in docs with behavior change in Spark 
> 2.2
> ---
>
> Key: SPARK-21925
> URL: https://issues.apache.org/jira/browse/SPARK-21925
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> I shall update documentation in apache after.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21925) Update trigger interval documentation in docs with behavior change in Spark 2.2

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21925:


Assignee: Apache Spark

> Update trigger interval documentation in docs with behavior change in Spark 
> 2.2
> ---
>
> Key: SPARK-21925
> URL: https://issues.apache.org/jira/browse/SPARK-21925
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> I shall update documentation in apache after.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154132#comment-16154132
 ] 

Timothy Hunter commented on SPARK-21866:


[~yanboliang] thanks you for the comments. Regarding your questions:

1. making {{image}} part of {{ml}} or not: I do not have a strong preference, 
but I think that image support is more general than machine learning.

2. there is no obstacle, but that would create a dependency between the core 
({{spark.read}}) and an external module. This sort of dependency inversion is 
not great design, as any change into a sub-package will have API repercussion 
into the core of Spark. The SQL team is already struggling with such issues.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
>

[jira] [Commented] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154190#comment-16154190
 ] 

Apache Spark commented on SPARK-21890:
--

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/19140

> ObtainCredentials does not pass creds to addDelegationTokens
> 
>
> Key: SPARK-21890
> URL: https://issues.apache.org/jira/browse/SPARK-21890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sanket Reddy
>
> I observed this while running a oozie job trying to connect to hbase via 
> spark.
> It look like the creds are not being passed in 
> thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
>  for 2.2 release.
> More Info as to why it fails on secure grid:
> Oozie client gets the necessary tokens the application needs before 
> launching.  It passes those tokens along to the oozie launcher job (MR job) 
> which will then actually call the Spark client to launch the spark app and 
> pass the tokens along.
> The oozie launcher job cannot get anymore tokens because all it has is tokens 
> ( you can't get tokens with tokens, you need tgt or keytab).  
> The error here is because the launcher job runs the Spark Client to submit 
> the spark job but the spark client doesn't see that it already has the hdfs 
> tokens so it tries to get more, which ends with the exception.
> There was  a change with SPARK-19021 to generalize the hdfs credentials 
> provider that changed it so we don't pass the existing credentials into the 
> call to get tokens so it doesn't realize it already has the necessary tokens.
> Stack trace:
> Warning: Skip remote jar 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar.
> Failing Oozie Launcher, Main class 
> [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, 
> Delegation Token can be issued only with kerberos or web authentication
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1471)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1408)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Inv

[jira] [Resolved] (SPARK-21925) Update trigger interval documentation in docs with behavior change in Spark 2.2

2017-09-05 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-21925.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.2.1

Issue resolved by pull request 19138
[https://github.com/apache/spark/pull/19138]

> Update trigger interval documentation in docs with behavior change in Spark 
> 2.2
> ---
>
> Key: SPARK-21925
> URL: https://issues.apache.org/jira/browse/SPARK-21925
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
> Fix For: 2.2.1, 3.0.0
>
>
> I shall update documentation in apache after.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9776) Another instance of Derby may have already booted the database

2017-09-05 Thread Holger Brandl (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154270#comment-16154270
 ] 

Holger Brandl commented on SPARK-9776:
--

I'm still having this issue with spark-2.2.0-bin-hadoop2.7. I've downloaded the 
binary distribution, started a local cluster with 
`$SPARK_HOME/sbin/start-all.sh ` and tried to launch two spark-shells in 
parallel with `$SPARK_HOME/bin/spark-shell --master spark://foo.local:7077`. 
The first one starts up nicely, the second crashes miserably with the error 
from above. Is there any workaround/setting to prevent that problem?


> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21876) Idling Executors that never handled any tasks are not cleared from BlockManager after being removed

2017-09-05 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154303#comment-16154303
 ] 

Imran Rashid commented on SPARK-21876:
--

Thanks for reporting this [~julie.zhang].

I got a little stuck at first following how GC would trigger a message to the 
lost executor.  Since the executor never ran any tasks, it wouldn't have any 
active blocks -- but I guess cleaning a broadcast always results in a message 
to all executors it thinks are alive.  Does that sound correct?

> Idling Executors that never handled any tasks are not cleared from 
> BlockManager after being removed
> ---
>
> Key: SPARK-21876
> URL: https://issues.apache.org/jira/browse/SPARK-21876
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Julie Zhang
>
> This happens when 'spark.dynamicAllocation.enabled' is set to be 'true'. We 
> use Yarn as our resource manager.
> 1) Executor A is launched, but no task has been submitted to it; 
> 2) After 'spark.dynamicAllocation.executorIdleTimeout' seconds, executor A 
> will be removed. (ExecutorAllocationManager.scala schedule(): 294); 
> 3) The scheduler gets notified that executor A has been lost; (in our case, 
> YarnSchedulerBackend.scla: 209).
> In the TaskschedulerImpl.scala method executorLost(executorId: String, 
> reason: ExecutorLossReason), the assumption in the None 
> case(TaskSchedulerImpl.scala: 548) that the executor has already been removed 
> is not always valid. As a result, the DAGScheduler and BlockManagerMaster are 
> never notified about the loss of executor A.
> When GC eventually happens, the ContextCleaner will try to clean up 
> un-referenced objects. Because the executor A was not removed from the 
> blockManagerIdByExecutor map, BlockManagerMasterEndpoint will send out 
> requests to clean the references to the non-existent executor, producing a 
> lot of error message like this in the driver log:
> ERROR [2017-08-08 00:00:23,596] 
> org.apache.spark.network.client.TransportClient: Failed to send RPC xxx to 
> xxx/xxx:x: java.nio.channels.ClosedChannelException



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-05 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-21926:
---

 Summary: Some transformers in spark.ml.feature fail when trying to 
transform steaming dataframes
 Key: SPARK-21926
 URL: https://issues.apache.org/jira/browse/SPARK-21926
 Project: Spark
  Issue Type: Bug
  Components: ML, Structured Streaming
Affects Versions: 2.2.0
Reporter: Bago Amirbekian


We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
a) Re-design vectorUDT metadata to support missing metadata for some elements. 
(This might be a good thing to do anyways SPARK-19141)
b) drop metadata in streaming context.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154336#comment-16154336
 ] 

Marcelo Vanzin commented on SPARK-18085:


Ok I think I was able to reproduce that locally.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-05 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154361#comment-16154361
 ] 

Bryan Cutler commented on SPARK-21190:
--

Thanks [~ueshin], I think having an optional {{kwargs}} at least would make 
things consistent across all types of {{pandas_udf}}.  Here is how you could 
use to check that a function has a {{**}} arg that would accept keyword 
arguments:

{noformat}
In [1]: import inspect

In [2]: def f(a, b, **kwargs):
   ...: print(a, b, kwargs)

In [3]: def g(a, b):
   ...: print(a, b) 

In [4]: inspect.getargspec(f)
Out[4]: ArgSpec(args=['a', 'b'], varargs=None, keywords='kwargs', defaults=None)

In [5]: inspect.getargspec(g)
Out[5]: ArgSpec(args=['a', 'b'], varargs=None, keywords=None, defaults=None)

In [6]: if inspect.getargspec(f).keywords is not None:
   ...: f(1, 2, size=3)
   ...: 
(1, 2, {'size': 3})
{noformat}

If the user defines a 0-parameter UDF without a {{**kwargs}} then it is 
probably best to raise an error if the returned size isn't right - although it 
would be possible to repeat and/or slice..

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154363#comment-16154363
 ] 

Marcelo Vanzin commented on SPARK-18085:


That should be fixed if you sync to my repo's 'shs-ng/HEAD'.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Kris Mok (JIRA)
Kris Mok created SPARK-21927:


 Summary: Spark pom.xml's dependency management is broken
 Key: SPARK-21927
 URL: https://issues.apache.org/jira/browse/SPARK-21927
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
 Environment: Apache Spark current master (commit 
12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
Reporter: Kris Mok


When building the current Spark master just now (commit 
12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot of 
warning messages such as the following. Looks like the dependency management in 
the POMs are somehow broken recently.

{code:none}
.../workspace/apache-spark/master (master) $ build/sbt clean package
Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.16.jar
[info] Loading project definition from .../workspace/apache-spark/master/project
[info] Updating {file:.../workspace/apache-spark/master/project/}master-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading 
https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
 ...
[info] [SUCCESSFUL ] 
org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
[info] downloading 
https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
 ...
[info] [SUCCESSFUL ] 
org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
[info] Done updating.
[warn] Found version conflict(s) in library dependencies; some are suspected to 
be binary incompatible:
[warn] 
[warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
1.0-beta-6
[warn] +- org.apache.maven:maven-compat:3.0.4(depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
on 2.2)
[warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
(scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
on 2.2)
[warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
on 2.2)
[warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
on 1.0-beta-6)
[warn] 
[warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
2.0.6, 2.1, 1.5.5}
[warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
on 3.0)
[warn] +- org.apache.maven:maven-compat:3.0.4(depends 
on 2.0.6)
[warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-core:3.0.4  (depends 
on 2.0.6)
[warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
on 2.0.6)
[warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
on 2.0.7)
[warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
on 2.0.7)
[warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-model:3.0.4 (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
on 2.0.7)
[warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
on 2.0.7)
[warn] 
[warn] * cglib:cglib is evicted completely
[warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
on 2.2.2)
[warn] 
[warn] * asm:asm is evicted completely
[warn] +- cglib:cglib:2.2.2  (depends 
on 3.3.1)
[warn] 
[warn] Run 'evicted' to see detailed eviction warnings
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Kris Mok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154426#comment-16154426
 ] 

Kris Mok commented on SPARK-21927:
--

[~sowen] Could you please take a look and see if recent change to POM (e.g. 
SPARK-14280) would have caused this issue? Thanks!

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apac

[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154469#comment-16154469
 ] 

Sean Owen commented on SPARK-21927:
---

It's probably the scalastyle update today but why is this filed as a bug? 
Nothing appears to be an error here. 

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21928) ML LogisticRegression training occasionally produces java.lang.ClassNotFoundException when attempting to load custom Kryo registrator class

2017-09-05 Thread John Brock (JIRA)
John Brock created SPARK-21928:
--

 Summary: ML LogisticRegression training occasionally produces 
java.lang.ClassNotFoundException when attempting to load custom Kryo 
registrator class
 Key: SPARK-21928
 URL: https://issues.apache.org/jira/browse/SPARK-21928
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: John Brock


I unfortunately can't reliably reproduce this bug; it happens only 
occasionally, when training a logistic regression model with very large 
datasets. The training will often proceed through several {{treeAggregate}} 
calls without any problems, and then suddenly workers will start running into 
this {{java.lang.ClassNotFoundException}}.

After doing some debugging, it seems that whenever this error happens, Spark is 
trying to use the {{sun.misc.Launcher$AppClassLoader}} {{ClassLoader}} instance 
instead of the usual {{org.apache.spark.util.MutableURLClassLoader}}. 
{{MutableURLClassLoader}} can see my custom Kryo registrator, but the 
{{AppClassLoader}} instance can't.

When this error does pop up, it's usually accompanied by the task seeming to 
hang, and I need to kill Spark manually.

I'm running a Spark application in cluster mode via spark-submit, and I have a 
custom Kryo registrator. The JAR is built with {{sbt assembly}}.

Exception message:

{noformat}
17/08/29 22:39:04 ERROR TransportRequestHandler: Error opening block 
StreamChunkId{streamId=542074019336, chunkIndex=0} for request from 
/10.0.29.65:34332
org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:139)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:292)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:277)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:186)
at 
org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:169)
at 
org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1382)
at 
org.apache.spark.storage.BlockManager$$anonfun$dropFromMemory$3.apply(BlockManager.scala:1377)
at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1377)
at 
org.apache.spark.storage.memory.MemoryStore.org$apache$spark$storage$memory$MemoryStore$$dropBlock$1(MemoryStore.scala:524)
at 
org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:545)
at 
org.apache.spark.storage.memory.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:539)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.storage.memory.MemoryStore.evictBlocksToFreeSpace(MemoryStore.scala:539)
at 
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:92)
at 
org.apache.spark.memory.StorageMemoryPool.acquireMemory(StorageMemoryPool.scala:73)
at 
org.apache.spark.memory.StaticMemoryManager.acquireStorageMemory(StaticMemoryManager.scala:72)
at 
org.apache.spark.storage.memory.MemoryStore.putBytes(MemoryStore.scala:147)
at 
org.apache.spark.storage.BlockManager.maybeCacheDiskBytesInMemory(BlockManager.scala:1143)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doGetLocalBytes(BlockManager.scala:594)
at 
org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at 
org.apache.spark.storage.BlockManager$$anonfun$getLocalBytes$2.apply(BlockManager.scala:559)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:559)
at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:353)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)
at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:89)
at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:125)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(

[jira] [Updated] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21927:

Issue Type: Improvement  (was: Bug)

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21927:

Affects Version/s: (was: 2.2.1)
   2.3.0

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21927) Spark pom.xml's dependency management is broken

2017-09-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154575#comment-16154575
 ] 

Yin Huai commented on SPARK-21927:
--

My worry is that it may mask actual issues related to dependencies. For 
example, the dependency resolvers may pick a version that is not specified in 
our pom.

> Spark pom.xml's dependency management is broken
> ---
>
> Key: SPARK-21927
> URL: https://issues.apache.org/jira/browse/SPARK-21927
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
> Environment: Apache Spark current master (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d)
>Reporter: Kris Mok
>
> When building the current Spark master just now (commit 
> 12ab7f7e89ec9e102859ab3b710815d3058a2e8d), I noticed the build prints a lot 
> of warning messages such as the following. Looks like the dependency 
> management in the POMs are somehow broken recently.
> {code:none}
> .../workspace/apache-spark/master (master) $ build/sbt clean package
> Attempting to fetch sbt
> Launching sbt from build/sbt-launch-0.13.16.jar
> [info] Loading project definition from 
> .../workspace/apache-spark/master/project
> [info] Updating 
> {file:.../workspace/apache-spark/master/project/}master-build...
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle-sbt-plugin_2.10_0.13/1.0.0/scalastyle-sbt-plugin-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle-sbt-plugin;1.0.0!scalastyle-sbt-plugin.jar (239ms)
> [info] downloading 
> https://repo1.maven.org/maven2/org/scalastyle/scalastyle_2.10/1.0.0/scalastyle_2.10-1.0.0.jar
>  ...
> [info] [SUCCESSFUL ] 
> org.scalastyle#scalastyle_2.10;1.0.0!scalastyle_2.10.jar (465ms)
> [info] Done updating.
> [warn] Found version conflict(s) in library dependencies; some are suspected 
> to be binary incompatible:
> [warn] 
> [warn] * org.apache.maven.wagon:wagon-provider-api:2.2 is selected over 
> 1.0-beta-6
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-file:2.2  (depends 
> on 2.2)
> [warn] +- org.spark-project:sbt-pom-reader:1.0.0-spark 
> (scalaVersion=2.10, sbtVersion=0.13) (depends on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-shared4:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http:2.2  (depends 
> on 2.2)
> [warn] +- org.apache.maven.wagon:wagon-http-lightweight:2.2  (depends 
> on 2.2)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 1.0-beta-6)
> [warn] 
> [warn] * org.codehaus.plexus:plexus-utils:3.0 is selected over {2.0.7, 
> 2.0.6, 2.1, 1.5.5}
> [warn] +- org.apache.maven.wagon:wagon-provider-api:2.2  (depends 
> on 3.0)
> [warn] +- org.apache.maven:maven-compat:3.0.4(depends 
> on 2.0.6)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.3.0 (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-artifact:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-core:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.sonatype.plexus:plexus-sec-dispatcher:1.3  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-embedder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-settings-builder:3.0.4  (depends 
> on 2.0.6)
> [warn] +- org.apache.maven:maven-model-builder:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.sonatype.aether:aether-connector-wagon:1.13.1  (depends 
> on 2.0.7)
> [warn] +- org.sonatype.sisu:sisu-inject-plexus:2.2.3 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-model:3.0.4 (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-aether-provider:3.0.4   (depends 
> on 2.0.7)
> [warn] +- org.apache.maven:maven-repository-metadata:3.0.4   (depends 
> on 2.0.7)
> [warn] 
> [warn] * cglib:cglib is evicted completely
> [warn] +- org.sonatype.sisu:sisu-guice:3.0.3 (depends 
> on 2.2.2)
> [warn] 
> [warn] * asm:asm is evicted completely
> [warn] +- cglib:cglib:2.2.2  (depends 
> on 3.3.1)
> [warn] 
> [warn] Run 'evicted' to see detailed eviction warnings
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-21384) Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails

2017-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154627#comment-16154627
 ] 

Apache Spark commented on SPARK-21384:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19141

> Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails
> ---
>
> Key: SPARK-21384
> URL: https://issues.apache.org/jira/browse/SPARK-21384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: holdenk
>
> In making the updated version of Spark 2.2 + YARN it seems that the auto 
> packaging of JARS based on SPARK_HOME isn't quite working (which results in a 
> warning anyways). You can see the build failure in travis at 
> https://travis-ci.org/holdenk/spark-testing-base/builds/252656109 (I've 
> reproed it locally).
> This results in an exception like:
> {code}
> 17/07/12 03:14:11 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/07/12 03:14:11 WARN NMAuditLogger: USER=travis OPERATION=Container 
> Finished - Failed   TARGET=ContainerImplRESULT=FAILURE  
> DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
> APPID=application_1499829231193_0001
> CONTAINERID=container_1499829231193_0001_01_01
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11]
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11_tmp]
> 17/07/12 03:14:13 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$00

[jira] [Assigned] (SPARK-21384) Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21384:


Assignee: Apache Spark

> Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails
> ---
>
> Key: SPARK-21384
> URL: https://issues.apache.org/jira/browse/SPARK-21384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: holdenk
>Assignee: Apache Spark
>
> In making the updated version of Spark 2.2 + YARN it seems that the auto 
> packaging of JARS based on SPARK_HOME isn't quite working (which results in a 
> warning anyways). You can see the build failure in travis at 
> https://travis-ci.org/holdenk/spark-testing-base/builds/252656109 (I've 
> reproed it locally).
> This results in an exception like:
> {code}
> 17/07/12 03:14:11 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/07/12 03:14:11 WARN NMAuditLogger: USER=travis OPERATION=Container 
> Finished - Failed   TARGET=ContainerImplRESULT=FAILURE  
> DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
> APPID=application_1499829231193_0001
> CONTAINERID=container_1499829231193_0001_01_01
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11]
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11_tmp]
> 17/07/12 03:14:13 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:36

[jira] [Assigned] (SPARK-21384) Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails

2017-09-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21384:


Assignee: (was: Apache Spark)

> Spark 2.2 + YARN without spark.yarn.jars / spark.yarn.archive fails
> ---
>
> Key: SPARK-21384
> URL: https://issues.apache.org/jira/browse/SPARK-21384
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: holdenk
>
> In making the updated version of Spark 2.2 + YARN it seems that the auto 
> packaging of JARS based on SPARK_HOME isn't quite working (which results in a 
> warning anyways). You can see the build failure in travis at 
> https://travis-ci.org/holdenk/spark-testing-base/builds/252656109 (I've 
> reproed it locally).
> This results in an exception like:
> {code}
> 17/07/12 03:14:11 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 17/07/12 03:14:11 WARN NMAuditLogger: USER=travis OPERATION=Container 
> Finished - Failed   TARGET=ContainerImplRESULT=FAILURE  
> DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
> APPID=application_1499829231193_0001
> CONTAINERID=container_1499829231193_0001_01_01
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11]
> 17/07/12 03:14:11 WARN DefaultContainerExecutor: delete returned false for 
> path: 
> [/home/travis/build/holdenk/spark-testing-base/target/com.holdenkarau.spark.testing.YARNCluster/com.holdenkarau.spark.testing.YARNCluster-localDir-nm-0_0/usercache/travis/filecache/11_tmp]
> 17/07/12 03:14:13 WARN ResourceLocalizationService: { 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip,
>  1499829249000, ARCHIVE, null } failed: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-0dc9dd59-dd7f-48fc-be2c-11a1bbd57d70/__spark_libs__8035392745283841054.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.

[jira] [Created] (SPARK-21929) Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source

2017-09-05 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21929:
-

 Summary: Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC 
data source
 Key: SPARK-21929
 URL: https://issues.apache.org/jira/browse/SPARK-21929
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dongjoon Hyun


SPARK-19261 implemented `ADD COLUMNS` at Spark 2.2, but ORC data source is not 
supported due to its limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21930) When the number of attempting to restart receiver greater than 0,spark do nothing in 'else'

2017-09-05 Thread liuxianjiao (JIRA)
liuxianjiao created SPARK-21930:
---

 Summary: When the number of  attempting to restart receiver 
greater than 0,spark do nothing in 'else'
 Key: SPARK-21930
 URL: https://issues.apache.org/jira/browse/SPARK-21930
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: liuxianjiao
Priority: Trivial


When the number of  attempting to restart receiver greater than 0,spark do 
nothing in 'else'.So I think we should log trace to let users know why.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21912) ORC/Parquet table should create invalid column names

2017-09-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21912:
--
Summary: ORC/Parquet table should create invalid column names  (was: 
Creating ORC datasource table should check invalid column names)

> ORC/Parquet table should create invalid column names
> 
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21912) ORC/Parquet table should not create invalid column names

2017-09-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21912:
--
Summary: ORC/Parquet table should not create invalid column names  (was: 
ORC/Parquet table should create invalid column names)

> ORC/Parquet table should not create invalid column names
> 
>
> Key: SPARK-21912
> URL: https://issues.apache.org/jira/browse/SPARK-21912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> Currently, users meet job abortions while creating ORC data source tables 
> with invalid column names. We had better prevent this by raising 
> AnalysisException like Paquet data source tables.
> {code}
> scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
> 17/09/04 13:28:21 ERROR Utils: Aborting task
> java.lang.IllegalArgumentException: Error: : expected at the position 8 of 
> 'struct' but ' ' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
> ...
> 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete 
> file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0
> 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
> 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.SparkException: Task failed while writing rows.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >