[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-15 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753728#comment-15753728
 ] 

Navya Krishnappa commented on SPARK-18877:
--

Precision and scale vary depending on the decimal values in the column. Suppose 
if source file contains 

Amount(column name)
9.03E+12
1.19E+11
24335739714
1.71E+11

then spark consider Amount column as decimal(3,-9). and throws an below 
mentioned exception

Caused by: java.lang.IllegalArgumentException: requirement failed: Decimal 
precision 4 exceeds max precision 3
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:112)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:425)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:264)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)



> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence

2016-12-15 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-18845:
---
Assignee: Andrew Ray

> PageRank has incorrect initialization value that leads to slow convergence
> --
>
> Key: SPARK-18845
> URL: https://issues.apache.org/jira/browse/SPARK-18845
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2
>Reporter: Andrew Ray
>Assignee: Andrew Ray
> Fix For: 2.2.0
>
>
> All variants of PageRank in GraphX have incorrect initialization value that 
> leads to slow convergence. In the current implementations ranks are seeded 
> with the reset probability when it should be 1. This appears to have been 
> introduced a long time ago in 
> https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
> This also hides the fact that source vertices (vertices with no incoming 
> edges) are not updated. This is because source vertices generally* have 
> pagerank equal to the reset probability. Therefore both need to be fixed at 
> once.
> PR will be added shortly
> *when there are no sinks -- but that's a separate bug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence

2016-12-15 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-18845.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16271
[https://github.com/apache/spark/pull/16271]

> PageRank has incorrect initialization value that leads to slow convergence
> --
>
> Key: SPARK-18845
> URL: https://issues.apache.org/jira/browse/SPARK-18845
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2
>Reporter: Andrew Ray
> Fix For: 2.2.0
>
>
> All variants of PageRank in GraphX have incorrect initialization value that 
> leads to slow convergence. In the current implementations ranks are seeded 
> with the reset probability when it should be 1. This appears to have been 
> introduced a long time ago in 
> https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
> This also hides the fact that source vertices (vertices with no incoming 
> edges) are not updated. This is because source vertices generally* have 
> pagerank equal to the reset probability. Therefore both need to be fixed at 
> once.
> PR will be added shortly
> *when there are no sinks -- but that's a separate bug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20

2016-12-15 Thread Navya Krishnappa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753716#comment-15753716
 ] 

Navya Krishnappa commented on SPARK-18877:
--

I'm reading through csvReader (.csv(sourceFile)) and i'm not setting any 
precision and scale, Spark is automatically detecting the precision and scale 
for the values in the source file. And precision and scale vary depending on 
the decimal values in the column. 

Stack trace:

Caused by: java.lang.IllegalArgumentException: requirement failed: Decimal 
precision 28 exceeds max precision 20
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:112)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:425)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:264)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 common frames omitted

> Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: 
> requirement failed: Decimal precision 28 exceeds max precision 20
> --
>
> Key: SPARK-18877
> URL: https://issues.apache.org/jira/browse/SPARK-18877
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When reading below mentioned csv data, even though the maximum decimal 
> precision is 38, following exception is thrown 
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 
> exceeds max precision 20
> Decimal
> 2323366225312000
> 2433573971400
> 23233662253000
> 23233662253



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753671#comment-15753671
 ] 

Apache Spark commented on SPARK-18895:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16305

> Fix resource-closing-related and path-related test failures in identified 
> ones on Windows
> -
>
> Key: SPARK-18895
> URL: https://issues.apache.org/jira/browse/SPARK-18895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are several tests failing due to resource-closing-related and 
> path-related  problems on Windows as below.
> - {{RPackageUtilsSuite}}:
> {code}
> - build an R package from a jar end to end *** FAILED *** (1 second, 625 
> milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - faulty R package shows documentation *** FAILED *** (359 milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - SparkR zipping works properly *** FAILED *** (47 milliseconds)
>   java.util.regex.PatternSyntaxException: Unknown character property name {r} 
> near index 4
> C:\projects\spark\target\tmp\1481729429282-0
> ^
>   at java.util.regex.Pattern.error(Pattern.java:1955)
>   at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
> {code}
> - {{InputOutputMetricsSuite}}:
> {code}
> - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics for new Hadoop API with coalesce *** FAILED *** (0 
> milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
> - input metrics when reading text file *** FAILED *** (110 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - simple *** FAILED *** (125 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - more stages *** FAILED *** (110 
> milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 

[jira] [Assigned] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18895:


Assignee: (was: Apache Spark)

> Fix resource-closing-related and path-related test failures in identified 
> ones on Windows
> -
>
> Key: SPARK-18895
> URL: https://issues.apache.org/jira/browse/SPARK-18895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> There are several tests failing due to resource-closing-related and 
> path-related  problems on Windows as below.
> - {{RPackageUtilsSuite}}:
> {code}
> - build an R package from a jar end to end *** FAILED *** (1 second, 625 
> milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - faulty R package shows documentation *** FAILED *** (359 milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - SparkR zipping works properly *** FAILED *** (47 milliseconds)
>   java.util.regex.PatternSyntaxException: Unknown character property name {r} 
> near index 4
> C:\projects\spark\target\tmp\1481729429282-0
> ^
>   at java.util.regex.Pattern.error(Pattern.java:1955)
>   at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
> {code}
> - {{InputOutputMetricsSuite}}:
> {code}
> - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics for new Hadoop API with coalesce *** FAILED *** (0 
> milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
> - input metrics when reading text file *** FAILED *** (110 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - simple *** FAILED *** (125 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - more stages *** FAILED *** (110 
> milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> 

[jira] [Assigned] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18895:


Assignee: Apache Spark

> Fix resource-closing-related and path-related test failures in identified 
> ones on Windows
> -
>
> Key: SPARK-18895
> URL: https://issues.apache.org/jira/browse/SPARK-18895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> There are several tests failing due to resource-closing-related and 
> path-related  problems on Windows as below.
> - {{RPackageUtilsSuite}}:
> {code}
> - build an R package from a jar end to end *** FAILED *** (1 second, 625 
> milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - faulty R package shows documentation *** FAILED *** (359 milliseconds)
>   java.io.IOException: Unable to delete file: 
> C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
> - SparkR zipping works properly *** FAILED *** (47 milliseconds)
>   java.util.regex.PatternSyntaxException: Unknown character property name {r} 
> near index 4
> C:\projects\spark\target\tmp\1481729429282-0
> ^
>   at java.util.regex.Pattern.error(Pattern.java:1955)
>   at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
> {code}
> - {{InputOutputMetricsSuite}}:
> {code}
> - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics for new Hadoop API with coalesce *** FAILED *** (0 
> milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)
> - input metrics when reading text file *** FAILED *** (110 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - simple *** FAILED *** (125 milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records read - more stages *** FAILED *** (110 
> milliseconds)
>   java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: Wrong FS: 
> file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
>   at 
> 

[jira] [Created] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows

2016-12-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18895:


 Summary: Fix resource-closing-related and path-related test 
failures in identified ones on Windows
 Key: SPARK-18895
 URL: https://issues.apache.org/jira/browse/SPARK-18895
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Hyukjin Kwon
Priority: Minor


There are several tests failing due to resource-closing-related and 
path-related  problems on Windows as below.

- {{RPackageUtilsSuite}}:

{code}
- build an R package from a jar end to end *** FAILED *** (1 second, 625 
milliseconds)
  java.io.IOException: Unable to delete file: 
C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar
  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)

- faulty R package shows documentation *** FAILED *** (359 milliseconds)
  java.io.IOException: Unable to delete file: 
C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar
  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)

- SparkR zipping works properly *** FAILED *** (47 milliseconds)
  java.util.regex.PatternSyntaxException: Unknown character property name {r} 
near index 4

C:\projects\spark\target\tmp\1481729429282-0

^
  at java.util.regex.Pattern.error(Pattern.java:1955)
  at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781)
{code}


- {{InputOutputMetricsSuite}}:

{code}
- input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics with cache and coalesce *** FAILED *** (109 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics for new Hadoop API with coalesce *** FAILED *** (0 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: 
file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt,
 expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)

- input metrics when reading text file *** FAILED *** (110 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records read - simple *** FAILED *** (125 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records read - more stages *** FAILED *** (110 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)

- input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: Wrong FS: 
file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt,
 expected: file:///
  at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
  at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462)
  at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114)

- input metrics on records read with cache *** FAILED *** (93 milliseconds)
  java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
  at 

[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753588#comment-15753588
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

Just to check - Is your Spark installation built with Hive support (i.e. with 
-Phive) ?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18892) Alias percentile_approx approx_percentile

2016-12-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18892.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> Alias percentile_approx approx_percentile
> -
>
> Key: SPARK-18892
> URL: https://issues.apache.org/jira/browse/SPARK-18892
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.1, 2.2.0
>
>
> percentile_approx is the name used in Hive, and approx_percentile is the name 
> used in Presto. approx_percentile is actually more consistent with our 
> approx_count_distinct. Given the cost to alias SQL functions is low 
> (one-liner), it'd be better to just alias them so it is easier to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2016-12-15 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753535#comment-15753535
 ] 

Sital Kedia commented on SPARK-18838:
-

cc - [~kayousterhout]

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs

2016-12-15 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-18838:

Description: 
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and this delay might hurt the job performance 
significantly or even fail the job.  For example, a significant delay in 
receiving the `SparkListenerTaskStart` might cause `ExecutorAllocationManager` 
manager to mistakenly remove an executor which is not idle.  

The problem is that the event processor in `ListenerBus` is a single thread 
which loops through all the Listeners for each event and processes each event 
synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 This single threaded processor often becomes the bottleneck for large jobs.  
Also, if one of the Listener is very slow, all the listeners will pay the price 
of delay incurred by the slow listener. In addition to that a slow listener can 
cause events to be dropped from the event queue which might be fatal to the job.

To solve the above problems, we propose to get rid of the event queue and the 
single threaded event processor. Instead each listener will have its own 
dedicate single threaded executor service . When ever an event is posted, it 
will be submitted to executor service of all the listeners. The Single threaded 
executor service will guarantee in order processing of the events per listener. 
 The queue used for the executor service will be bounded to guarantee we do not 
grow the memory indefinitely. The downside of this approach is separate event 
queue per listener will increase the driver memory footprint. 




  was:
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 
The single threaded processor often becomes the bottleneck for large jobs.  In 
addition to that, if one of the Listener is very slow, all the listeners will 
pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a per listener single threaded 
executor service and separate event queue. That way we are not bottlenecked by 
the single threaded event processor and also critical listeners will not be 
penalized by the slow listeners. The downside of this approach is separate 
event queue per listener will increase the driver memory footprint. 





> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> 

[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753462#comment-15753462
 ] 

Felix Cheung commented on SPARK-18817:
--

I ran more of this but wasn't seeinng derby.log or metastore_db

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753414#comment-15753414
 ] 

Felix Cheung commented on SPARK-18817:
--

It looks like javax.jdo.option.ConnectionURL can also be set in Hive-site.xml?

In that sense we should only change javax.jdo.option.ConnectionURL and 
spark.sql.default.warehouse.dir when they are not set in conf or hive-site, and 
we need to handle both for a complete fix.



> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753304#comment-15753304
 ] 

lichenglin commented on SPARK-18893:


spark 2.0 has disable "alter table".

[https://issues.apache.org/jira/browse/SPARK-14118]

[https://issues.apache.org/jira/browse/SPARK-14130]

I think this is very import feature for data warehouse 

Can spark handle it firstly?

> Not support "alter table .. add columns .." 
> 
>
> Key: SPARK-18893
> URL: https://issues.apache.org/jira/browse/SPARK-18893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: zuotingbing
>
> when we update spark from version 1.5.2 to 2.0.1, all cases we have need 
> change the table use "alter table add columns " failed, but it is said "All 
> Hive DDL Functions, including: alter table" in the official document : 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> Is there any plan to support  sql "alter table .. add/replace columns" ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18894:


Assignee: Apache Spark  (was: Tathagata Das)

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753298#comment-15753298
 ] 

Apache Spark commented on SPARK-18894:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16304

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18894:


Assignee: Tathagata Das  (was: Apache Spark)

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18894:
--
Priority: Critical  (was: Major)

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18894:
--
Affects Version/s: 2.1.0
 Target Version/s: 2.1.0

> Event time watermark delay threshold specified in months or years gives 
> incorrect results
> -
>
> Key: SPARK-18894
> URL: https://issues.apache.org/jira/browse/SPARK-18894
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>
> Internally we use CalendarInterval to parse the delay. Non-determinstic 
> intervals like "month" and "year" are handled such a way that the generated 
> delay in milliseconds is 0 delayThreshold is in months or years.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results

2016-12-15 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-18894:
-

 Summary: Event time watermark delay threshold specified in months 
or years gives incorrect results
 Key: SPARK-18894
 URL: https://issues.apache.org/jira/browse/SPARK-18894
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Internally we use CalendarInterval to parse the delay. Non-determinstic 
intervals like "month" and "year" are handled such a way that the generated 
delay in milliseconds is 0 delayThreshold is in months or years.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18272) Test topic addition for subscribePattern on Kafka DStream and Structured Stream

2016-12-15 Thread Bravo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753289#comment-15753289
 ] 

Bravo Zhang commented on SPARK-18272:
-

Does "subscribing topic by pattern with topic deletions" in KafkaSourceSuite 
already cover this case? It also has topic creation.

> Test topic addition for subscribePattern on Kafka DStream and Structured 
> Stream
> ---
>
> Key: SPARK-18272
> URL: https://issues.apache.org/jira/browse/SPARK-18272
> Project: Spark
>  Issue Type: Test
>  Components: DStreams, Structured Streaming
>Reporter: Cody Koeninger
>
> We've had reports of the following sequence
> - create subscribePattern stream that doesn't match any existing topics at 
> the time stream starts
> - add a topic that matches pattern
> - expect that messages from that topic show up, but they don't
> We don't seem to actually have tests that cover this case, so we should add 
> them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18893) Not support "alter table .. add columns .."

2016-12-15 Thread zuotingbing (JIRA)
zuotingbing created SPARK-18893:
---

 Summary: Not support "alter table .. add columns .." 
 Key: SPARK-18893
 URL: https://issues.apache.org/jira/browse/SPARK-18893
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: zuotingbing


when we update spark from version 1.5.2 to 2.0.1, all cases we have need change 
the table use "alter table add columns " failed, but it is said "All Hive DDL 
Functions, including: alter table" in the official document : 
http://spark.apache.org/docs/latest/sql-programming-guide.html.

Is there any plan to support  sql "alter table .. add/replace columns" ?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753227#comment-15753227
 ] 

Wenchen Fan commented on SPARK-18817:
-

the warehouse path will be created no matter hive support is enabled or not, 
but derbe.log and metastore_db will only be created with hive support. The 
simplest solution will be: disable hive support by default. We can also change 
the location of metastore_db by "javax.jdo.option.ConnectionURL". I'm not sure 
how to do it in R side, may be tricky.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146
 ] 

lichenglin edited comment on SPARK-14130 at 12/16/16 2:00 AM:
--

"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

Even though it only works on hivecontext with specially fileformat




was (Author: licl):
"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14130:
---
Comment: was deleted

(was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

)

> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14130:
---
Comment: was deleted

(was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??

)

> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753147#comment-15753147
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753148#comment-15753148
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14130) [Table related commands] Alter column

2016-12-15 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146
 ] 

lichenglin commented on SPARK-14130:


"TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse.

Does spark have any plan to support  for it??



> [Table related commands] Alter column
> -
>
> Key: SPARK-14130
> URL: https://issues.apache.org/jira/browse/SPARK-14130
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_RENAMECOL
> TOK_ALTERTABLE_ADDCOLS
> TOK_ALTERTABLE_REPLACECOLS
> For data source tables, we should throw exceptions. For Hive tables, we 
> should support them. *For Hive tables, we should check Hive's behavior to see 
> if there is any file format that does not any of above command*. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java
>  is a good reference for Hive's behavior. 
> Also, for a Hive table stored in a format, we need to make sure that even if 
> Spark can read this tables after an alter column operation. If we cannot read 
> the table, even Hive allows the alter column operation, we should still throw 
> an exception. For example, if renaming a column of a Hive parquet table 
> causes the renamed column inaccessible (we cannot read values), we should not 
> allow this renaming operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18855) Add RDD flatten function

2016-12-15 Thread Linbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linbo closed SPARK-18855.
-
Resolution: Unresolved

> Add RDD flatten function
> 
>
> Key: SPARK-18855
> URL: https://issues.apache.org/jira/browse/SPARK-18855
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Linbo
>Priority: Minor
>  Labels: flatten, rdd
>
> A new RDD flatten function is similar to flatten function of scala 
> collections:
> {code:title=spark-shell|borderStyle=solid}
> scala> val rdd = sc.makeRDD(List(List(1, 2, 3), List(4, 5), List(6)))
> rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at 
> makeRDD at :24
> scala> rdd.flatten.collect
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18855) Add RDD flatten function

2016-12-15 Thread Linbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753120#comment-15753120
 ] 

Linbo commented on SPARK-18855:
---

Tried several ways, the more "Spark" way is trying to create 
TraversableRDDFunctions file and implicit def 
rddToTraversableRDDFunctions[U](rdd: RDD[TraversableRDDFunctions[U]]) inside 
RDD object. But it's hard to make this method generic because class RDD is 
invariant. I will close this issue. It's more impactful that this should go on 
Dataset.

> Add RDD flatten function
> 
>
> Key: SPARK-18855
> URL: https://issues.apache.org/jira/browse/SPARK-18855
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Linbo
>Priority: Minor
>  Labels: flatten, rdd
>
> A new RDD flatten function is similar to flatten function of scala 
> collections:
> {code:title=spark-shell|borderStyle=solid}
> scala> val rdd = sc.makeRDD(List(List(1, 2, 3), List(4, 5), List(6)))
> rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at 
> makeRDD at :24
> scala> rdd.flatten.collect
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093
 ] 

Brendan Dwyer edited comment on SPARK-18817 at 12/16/16 1:45 AM:
-

I'm not sure the CRAN people would be okay with that. It might be enough to 
pass any automatic testing they have but it would still be against their 
policies.

{quote}
Limited exceptions may be allowed in interactive sessions if the package 
*obtains confirmation from the user*.
{quote}


was (Author: bdwyer):
I'm not sure the CRAN people would be okay with that. It might be enough to 
pass any automatic testing they have but it would still be against their 
policies.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093
 ] 

Brendan Dwyer edited comment on SPARK-18817 at 12/16/16 1:30 AM:
-

I'm not sure the CRAN people would be okay with that. It might be enough to 
pass any automatic testing they have but it would still be against their 
policies.


was (Author: bdwyer):
I'm not sure the CRAN people would be okay with that.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753095#comment-15753095
 ] 

Brendan Dwyer commented on SPARK-18817:
---

{code}
library("SparkR")
sparkR.session()
df <- as.DataFrame(iris)
{code}

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093
 ] 

Brendan Dwyer commented on SPARK-18817:
---

I'm not sure the CRAN people would be okay with that.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files

2016-12-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753089#comment-15753089
 ] 

Yanbo Liang commented on SPARK-18862:
-

Great! Will send PR soon.

> Split SparkR mllib.R into multiple files
> 
>
> Key: SPARK-18862
> URL: https://issues.apache.org/jira/browse/SPARK-18862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to 
> split it into multiple files to make us easy to maintain:
> * mllibClassification.R
> * mllibRegression.R
> * mllibClustering.R
> * mllibFeature.R
> or:
> * mllib/classification.R
> * mllib/regression.R
> * mllib/clustering.R
> * mllib/features.R
> For R convention, it's more prefer the first way. And I'm not sure whether R 
> supports the second organized way (will check later). Please let me know your 
> preference. I think the start of a new release cycle is a good opportunity to 
> do this, since it will involves less conflicts. If this proposal was 
> approved, I can work on it.
> cc [~felixcheung] [~josephkb] [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17807:


Assignee: Apache Spark

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Assignee: Apache Spark
>Priority: Trivial
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17807:


Assignee: (was: Apache Spark)

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753085#comment-15753085
 ] 

Apache Spark commented on SPARK-17807:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16303

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753064#comment-15753064
 ] 

Felix Cheung commented on SPARK-18817:
--

Actually, I'm not seeing derby.log or metastore_db in the quick tests I have, 

{code}
> createOrReplaceTempView(a, "foo")
> sql("SELECT * from foo")
{code}

[~bdwyer]do you have the steps that create these files?

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753048#comment-15753048
 ] 

Felix Cheung commented on SPARK-18817:
--

Tested this just now, I still see spark-warehouse when enableHiveSupport = FALSE

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753040#comment-15753040
 ] 

Felix Cheung commented on SPARK-18817:
--

we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
enableHiveSupport = !interactive()
{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?


> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18817:
-
Comment: was deleted

(was: we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
enableHiveSupport = !interactive()
{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?
)

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753041#comment-15753041
 ] 

Felix Cheung commented on SPARK-18817:
--

we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
enableHiveSupport = !interactive()
{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?


> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753041#comment-15753041
 ] 

Felix Cheung edited comment on SPARK-18817 at 12/16/16 1:03 AM:


we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
sparkR.session <- function(
  master = "",
  appName = "SparkR",
  sparkHome = Sys.getenv("SPARK_HOME"),
  sparkConfig = list(),
  sparkJars = "",
  sparkPackages = "",
-  enableHiveSupport = TRUE,
+ enableHiveSupport = !interactive()
  ...) {

{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?



was (Author: felixcheung):
we could, but we did ship 2.0 with it enabled by default though.
perhaps
{code}
enableHiveSupport = !interactive()
{code}
as default?


shouldn't derby.log and metastore_db go to the warehouse.dir?


> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files

2016-12-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753032#comment-15753032
 ] 

Felix Cheung commented on SPARK-18862:
--

FYI I reorg the vignettes based on what's discussed here.
https://github.com/apache/spark/pull/16301

> Split SparkR mllib.R into multiple files
> 
>
> Key: SPARK-18862
> URL: https://issues.apache.org/jira/browse/SPARK-18862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>
> SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to 
> split it into multiple files to make us easy to maintain:
> * mllibClassification.R
> * mllibRegression.R
> * mllibClustering.R
> * mllibFeature.R
> or:
> * mllib/classification.R
> * mllib/regression.R
> * mllib/clustering.R
> * mllib/features.R
> For R convention, it's more prefer the first way. And I'm not sure whether R 
> supports the second organized way (will check later). Please let me know your 
> preference. I think the start of a new release cycle is a good opportunity to 
> do this, since it will involves less conflicts. If this proposal was 
> approved, I can work on it.
> cc [~felixcheung] [~josephkb] [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18849) Vignettes final checks for Spark 2.1

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753027#comment-15753027
 ] 

Apache Spark commented on SPARK-18849:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16301

> Vignettes final checks for Spark 2.1
> 
>
> Key: SPARK-18849
> URL: https://issues.apache.org/jira/browse/SPARK-18849
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Xiangrui Meng
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>
> Make a final pass over the vignettes and ensure the content is consistent.
> * remove "since version" because is not that useful for vignettes
> * re-order/group the list of ML algorithms so there exists a logical ordering
> * check for warning or error in output message
> * anything else that seems out of place



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18892) Alias percentile_approx approx_percentile

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18892:


Assignee: Reynold Xin  (was: Apache Spark)

> Alias percentile_approx approx_percentile
> -
>
> Key: SPARK-18892
> URL: https://issues.apache.org/jira/browse/SPARK-18892
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> percentile_approx is the name used in Hive, and approx_percentile is the name 
> used in Presto. approx_percentile is actually more consistent with our 
> approx_count_distinct. Given the cost to alias SQL functions is low 
> (one-liner), it'd be better to just alias them so it is easier to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18892) Alias percentile_approx approx_percentile

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753011#comment-15753011
 ] 

Apache Spark commented on SPARK-18892:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16300

> Alias percentile_approx approx_percentile
> -
>
> Key: SPARK-18892
> URL: https://issues.apache.org/jira/browse/SPARK-18892
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> percentile_approx is the name used in Hive, and approx_percentile is the name 
> used in Presto. approx_percentile is actually more consistent with our 
> approx_count_distinct. Given the cost to alias SQL functions is low 
> (one-liner), it'd be better to just alias them so it is easier to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18892) Alias percentile_approx approx_percentile

2016-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18892:


Assignee: Apache Spark  (was: Reynold Xin)

> Alias percentile_approx approx_percentile
> -
>
> Key: SPARK-18892
> URL: https://issues.apache.org/jira/browse/SPARK-18892
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> percentile_approx is the name used in Hive, and approx_percentile is the name 
> used in Presto. approx_percentile is actually more consistent with our 
> approx_count_distinct. Given the cost to alias SQL functions is low 
> (one-liner), it'd be better to just alias them so it is easier to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18892) Alias percentile_approx approx_percentile

2016-12-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18892:
---

 Summary: Alias percentile_approx approx_percentile
 Key: SPARK-18892
 URL: https://issues.apache.org/jira/browse/SPARK-18892
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


percentile_approx is the name used in Hive, and approx_percentile is the name 
used in Presto. approx_percentile is actually more consistent with our 
approx_count_distinct. Given the cost to alias SQL functions is low 
(one-liner), it'd be better to just alias them so it is easier to use.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752997#comment-15752997
 ] 

Marcelo Vanzin commented on SPARK-17807:


Reopening since this is a real issue (the dependency leaks when you depend on 
spark-core in maven and don't have scalatest as an explicit test dependency in 
your project).

> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17807) Scalatest listed as compile dependency in spark-tags

2016-12-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-17807:


> Scalatest listed as compile dependency in spark-tags
> 
>
> Key: SPARK-17807
> URL: https://issues.apache.org/jira/browse/SPARK-17807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tom Standard
>Priority: Trivial
>
> In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - 
> shouldn't this be in test scope?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752973#comment-15752973
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

In that case an easier fix might be to just disable Hive support by default ? 
cc [~felixcheung]

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752951#comment-15752951
 ] 

Brendan Dwyer commented on SPARK-18817:
---

[~shivaram] it does not happen if I disable Hive.

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752948#comment-15752948
 ] 

William Shen commented on SPARK-5632:
-

Thanks [~marmbrus]. 
I see that the backtick works in 1.5.0 as well (with the limitation on 
distinct, which is fixed in SPARK-15230). Hopefully this will get sorted out 
together with SPARK-18084. Thanks again for your help!

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: array (containsNull = 
> false)
>  \|\|\|\|\|\|\|

[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752929#comment-15752929
 ] 

Michael Armbrust commented on SPARK-5632:
-

Hmm, I agree that error is confusing.  It does work if you use 
[backticks|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880245/2840265927289860/latest.html]
 (at least with 2.1).

I think this falls into the general class of issues where we don't have 
consistent handling of strings that reference columns.  I'm going to link this 
ticket to [SPARK-18084] (which i've also targeted for investigation in the 2.2 
release).

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|

[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-12-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18084:
-
Target Version/s: 2.2.0

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
>  line 656, in text
> self._jwrite.text(path)
>   File 
> 

[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752907#comment-15752907
 ] 

Andrew Ash commented on SPARK-18278:


There are definitely challenges in building features that take longer than a 
release cycle (quarterly for Spark).

We could maintain a long-running feature branch for spark-k8s that lasts 
several months and then gets merged into Spark in a big-bang merge, with that 
feature branch living either on apache/spark or in some other 
community-accessible repo.  I don't think there are many practical differences 
between in apache/spark vs a different repo for where the source is hosted if 
both are not in Apache releases.

Or we could merge many smaller commits for spark-k8s into the apache/spark 
master branch along the way and release as an experimental feature when release 
time comes.  This enables more continuous code review but has the risk of 
destabilizing the master branch if code reviews miss things.

Looking to past instances of large features spanning multiple release cycles 
(like SparkSQL and YARN integration), both of those had work happening 
primarily in-repo from what I can tell, and releases included large disclaimers 
in release notes for those experimental features.  That precedent seems to 
suggest Kubernetes integration should follow a similar path.

Personally I lean towards the approach of more smaller commits into master 
rather than a long-running feature branch.  By code reviewing PRs into the main 
repo as we go the feature will be easier to code review and will also get wider 
feedback as an experimental feature than a side branch or side repo would get.  
This also serves to include Apache committers from the start in understanding 
the codebase, rather than foisting a foreign codebase onto the project and hope 
committers grok it well enough to hold the line on high quality code reviews.  
Looking to the future where Kubernetes integration is potentially included in 
the mainline apache release (like Mesos and YARN), it's best to work as 
contributor + committer together from the start for shared understanding.

Making an API for third party cluster managers sound great and the easy, clean 
choice from a software engineering point of view, but I wonder how much value 
the practical benefits of having a pluggable cluster manager actually gets the 
Apache project.  It seems like both Two Sigma and IBM have been able to 
maintain their proprietary schedulers without the benefits of the API we're 
considering building.  Who / what workflows are we aiming to support with an 
API?

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752902#comment-15752902
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

[~bdwyer] Does this still happen if you disable Hive ? One way to test that is 
to stop the sparkSession and create one with `enableHiveSupport=F`

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752896#comment-15752896
 ] 

William Shen commented on SPARK-5632:
-

Thank you [~marmbrus] for the speedy response!
However I ran into the following issue in 1.5.0, which seems to be the same 
issue with resolving dot in field name.
{noformat}
scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val data = Seq((1,2)).toDF("column_1", "column.with.dot")
data: org.apache.spark.sql.DataFrame = [column_1: int, column.with.dot: int]

scala> data.select("column.with.dot").collect
org.apache.spark.sql.AnalysisException: cannot resolve 'column.with.dot' given 
input columns column_1, column.with.dot;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 

[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752898#comment-15752898
 ] 

Shivaram Venkataraman commented on SPARK-18817:
---

Yeah I dont know how to avoid creating those two -- Doesn't look like its 
configurable 

cc [~cloud_fan] [~rxin]

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite

2016-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18868.
--
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.2.0
   2.1.1

> Flaky Test: StreamingQueryListenerSuite
> ---
>
> Key: SPARK-18868
> URL: https://issues.apache.org/jira/browse/SPARK-18868
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> Example: 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752864#comment-15752864
 ] 

Joseph K. Bradley commented on SPARK-18844:
---

Note: Please don't set the Target Version or Fix Version.  Committers use those 
to track releases.  Thanks!

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18844:
--
Target Version/s:   (was: 2.0.3)

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18844:
--
Fix Version/s: (was: 2.0.2)

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18844:
--
Issue Type: New Feature  (was: Improvement)

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
> Fix For: 2.0.2
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18891) Support for specific collection types

2016-12-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18891:


 Summary: Support for specific collection types
 Key: SPARK-18891
 URL: https://issues.apache.org/jira/browse/SPARK-18891
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.3, 2.1.0
Reporter: Michael Armbrust
Priority: Critical


Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which force 
users to only define classes with the most generic type.

An [example 
error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
{code}
case class SpecificCollection(aList: List[Int])
Seq(SpecificCollection(1 :: Nil)).toDS().collect()
{code}

{code}
java.lang.RuntimeException: Error while decoding: 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 98, Column 120: No applicable constructor/method found for actual 
parameters "scala.collection.Seq"; candidates are: 
"line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-15 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752844#comment-15752844
 ] 

Brendan Dwyer commented on SPARK-18817:
---

I'm also seeing _derby.log_ and a folder named _metastore_db_ being created in 
my working directory when I run the following:

{code}
library("SparkR")
sparkR.session()
df <- as.DataFrame(iris)
{code}

> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>Priority: Critical
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2016-12-15 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-18890:
---
Description: 
 As part of benchmarking this change: 
https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I 
found that moving task serialization from TaskSetManager (which happens as part 
of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to 
approximately a 10% reduction in job runtime for a job that counted 10,000 
partitions (that each had 1 int) using 20 machines.  Similar performance 
improvements were reported in the pull request linked above.  This would appear 
to be because the TaskSchedulerImpl thread is the bottleneck, so moving 
serialization to CGSB reduces runtime.  This change may *not* improve runtime 
(and could potentially worsen runtime) in scenarios where the CGSB thread is 
the bottleneck (e.g., if tasks are very large, so calling launch to send the 
tasks to the executor blocks on the network).

One benefit of implementing this change is that it makes it easier to 
parallelize the serialization of tasks (different tasks could be serialized by 
different threads).  Another benefit is that all of the serialization occurs in 
the same place (currently, the Task is serialized in TaskSetManager, and the 
TaskDescription is serialized in CGSB).

I'm not totally convinced we should fix this because it seems like there are 
better ways of reducing the serialization time (e.g., by re-using a single 
serialized object with the Task/jars/files and broadcasting it for each stage) 
but I wanted to open this JIRA to document the discussion.

cc [~witgo]



  was:
 As part of benchmarking this change: 
https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I 
found that moving task serialization from TaskSetManager (which happens as part 
of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to 
approximately a 10% reduction in job runtime for a job that counted 10,000 
partitions (that each had 1 int) using 20 machines.  Similar performance 
improvements were reported in the pull request linked above.  This would appear 
to be because the TaskSchedulerImpl thread is the bottleneck, so moving 
serialization to CGSB reduces runtime.  This change may *not* improve runtime 
(and could potentially worsen runtime) in scenarios where the CGSB thread is 
the bottleneck (e.g., if tasks are very large, so calling launch to send the 
tasks to the executor blocks on the network).

One benefit of implementing this change is that it makes it easier to 
parallelize the serialization of tasks (different tasks could be serialized by 
different threads).  Another benefit is that all of the serialization occurs in 
the same place (currently, the Task is serialized in TaskSetManager, and the 
TaskDescription is serialized in CGSB).

I'm not totally convinced we should fix this because it seems like there are 
better ways of reducing the serialization time (e.g., by re-using the Task 
object within a stage) but I wanted to open this JIRA to document the 
discussion.

cc [~witgo]




> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should 

[jira] [Created] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2016-12-15 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-18890:
--

 Summary: Do all task serialization in CoarseGrainedExecutorBackend 
thread (rather than TaskSchedulerImpl)
 Key: SPARK-18890
 URL: https://issues.apache.org/jira/browse/SPARK-18890
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Kay Ousterhout
Priority: Minor


 As part of benchmarking this change: 
https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I 
found that moving task serialization from TaskSetManager (which happens as part 
of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to 
approximately a 10% reduction in job runtime for a job that counted 10,000 
partitions (that each had 1 int) using 20 machines.  Similar performance 
improvements were reported in the pull request linked above.  This would appear 
to be because the TaskSchedulerImpl thread is the bottleneck, so moving 
serialization to CGSB reduces runtime.  This change may *not* improve runtime 
(and could potentially worsen runtime) in scenarios where the CGSB thread is 
the bottleneck (e.g., if tasks are very large, so calling launch to send the 
tasks to the executor blocks on the network).

One benefit of implementing this change is that it makes it easier to 
parallelize the serialization of tasks (different tasks could be serialized by 
different threads).  Another benefit is that all of the serialization occurs in 
the same place (currently, the Task is serialized in TaskSetManager, and the 
TaskDescription is serialized in CGSB).

I'm not totally convinced we should fix this because it seems like there are 
better ways of reducing the serialization time (e.g., by re-using the Task 
object within a stage) but I wanted to open this JIRA to document the 
discussion.

cc [~witgo]





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752771#comment-15752771
 ] 

Michael Armbrust commented on SPARK-5632:
-

If you expand the commit you'll see its included in many tags.  The "fix 
version" here is 1.4, which means it was released with 1.4 and all subsequent 
versions.

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: array (containsNull = 
> false)
>  \|\|\|\|\|\|\|\|-- element: string (containsNull 
> = false)
>  \|\| 

[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752760#comment-15752760
 ] 

Apache Spark commented on SPARK-8425:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/16298

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Mao, Wei
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-15 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752747#comment-15752747
 ] 

William Shen commented on SPARK-5632:
-

Is this still targeted for 1.4.0 as indicated in JIRA (or was it released with 
1.4.0)? 
The git commit is tagged with v2.1.0-rc3, can someone confirm if it has been 
moved to 2.1.0? 
Thanks!

> not able to resolve dot('.') in field name
> --
>
> Key: SPARK-5632
> URL: https://issues.apache.org/jira/browse/SPARK-5632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
> Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
> Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
>Reporter: Lishu Liu
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 1.4.0
>
>
> My cassandra table task_trace has a field sm.result which contains dot in the 
> name. So SQL tried to look up sm instead of full name 'sm.result'. 
> Here is my code: 
> {code}
> scala> import org.apache.spark.sql.cassandra.CassandraSQLContext
> scala> val cc = new CassandraSQLContext(sc)
> scala> val task_trace = cc.jsonFile("/task_trace.json")
> scala> task_trace.registerTempTable("task_trace")
> scala> cc.setKeyspace("cerberus_data_v4")
> scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, 
> task_body.sm.result FROM task_trace WHERE task_id = 
> 'fff7304e-9984-4b45-b10c-0423a96745ce'")
> res: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[57] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> == Physical Plan ==
> java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
> cerberus_id, couponId, coupon_code, created, description, domain, expires, 
> message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
> sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
> validity
> {code}
> The full schema look like this:
> {code}
> scala> task_trace.printSchema()
> root
>  \|-- received_datetime: long (nullable = true)
>  \|-- task_body: struct (nullable = true)
>  \|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|-- cerberus_id: string (nullable = true)
>  \|\|-- couponId: integer (nullable = true)
>  \|\|-- coupon_code: string (nullable = true)
>  \|\|-- created: string (nullable = true)
>  \|\|-- description: string (nullable = true)
>  \|\|-- domain: string (nullable = true)
>  \|\|-- expires: string (nullable = true)
>  \|\|-- message_id: string (nullable = true)
>  \|\|-- neverShowAfter: string (nullable = true)
>  \|\|-- neverShowBefore: string (nullable = true)
>  \|\|-- offerTitle: string (nullable = true)
>  \|\|-- screenshots: array (nullable = true)
>  \|\|\|-- element: string (containsNull = false)
>  \|\|-- sm.result: struct (nullable = true)
>  \|\|\|-- cerberus_batch_id: string (nullable = true)
>  \|\|\|-- cerberus_id: string (nullable = true)
>  \|\|\|-- code: string (nullable = true)
>  \|\|\|-- couponId: integer (nullable = true)
>  \|\|\|-- created: string (nullable = true)
>  \|\|\|-- description: string (nullable = true)
>  \|\|\|-- domain: string (nullable = true)
>  \|\|\|-- expires: string (nullable = true)
>  \|\|\|-- message_id: string (nullable = true)
>  \|\|\|-- neverShowAfter: string (nullable = true)
>  \|\|\|-- neverShowBefore: string (nullable = true)
>  \|\|\|-- offerTitle: string (nullable = true)
>  \|\|\|-- result: struct (nullable = true)
>  \|\|\|\|-- post: struct (nullable = true)
>  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
>  \|\|\|\|\|\|-- ci: double (nullable = true)
>  \|\|\|\|\|\|-- value: boolean (nullable = true)
>  \|\|\|\|\|-- meta: struct (nullable = true)
>  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- exceptions: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: string (containsNull = 
> false)
>  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
>  \|\|\|\|\|\|\|-- element: array (containsNull = 
> false)
>  \|\|\|\|\|\|\|\|-- element: string (containsNull 
> 

[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization

2016-12-15 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-17931:
---
Component/s: (was: Spark Core)
 Scheduler

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization

2016-12-15 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-17931:
---
Description: 
In the existing code, there are three layers of serialization
involved in sending a task from the scheduler to an executor:
- A Task object is serialized
- The Task object is copied to a byte buffer that also
contains serialized information about any additional JARs,
files, and Properties needed for the task to execute. This
byte buffer is stored as the member variable serializedTask
in the TaskDescription class.
- The TaskDescription is serialized (in addition to the serialized
task + JARs, the TaskDescription class contains the task ID and
other metadata) and sent in a LaunchTask message.

While it is necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary (this is as a result of SPARK-2521). We should
eliminate a layer of serialization by moving the JARs, files, and Properties
into the TaskDescription class.

  was:
When taskScheduler instantiates TaskDescription, it calls 
`Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, 
ser)`.  It serializes task and its dependency. 

But after SPARK-2521 has been merged into the master, the ResultTask class and 
ShuffleMapTask  class no longer contain rdd and closure objects. 
TaskDescription class can be changed as below:

{noformat}
class TaskDescription[T](
val taskId: Long,
val attemptNumber: Int,
val executorId: String,
val name: String,
val index: Int, 
val task: Task[T]) extends Serializable
{noformat}


> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12777) Dataset fields can't be Scala tuples

2016-12-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12777.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

This works in 2.1:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/408017793305293/2840265927289860/latest.html

> Dataset fields can't be Scala tuples
> 
>
> Key: SPARK-12777
> URL: https://issues.apache.org/jira/browse/SPARK-12777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Chris Jansen
> Fix For: 2.1.0
>
>
> Datasets can't seem to handle scala tuples as fields of case classes in 
> datasets.
> {code}
> Seq((1,2), (3,4)).toDS().show() //works
> {code}
> When including a tuple as a field, the code fails:
> {code}
> case class Test(v: (Int, Int))
> Seq(Test((1,2)), Test((3,4)).toDS().show //fails
> {code}
> {code}
>   UnresolvedException: : Invalid call to dataType on unresolved object, tree: 
> 'name  (unresolved.scala:59)
>  
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:59)
>  
> org.apache.spark.sql.catalyst.expressions.GetStructField.org$apache$spark$sql$catalyst$expressions$GetStructField$$field$lzycompute(complexTypeExtractors.scala:107)
>  
> org.apache.spark.sql.catalyst.expressions.GetStructField.org$apache$spark$sql$catalyst$expressions$GetStructField$$field(complexTypeExtractors.scala:107)
>  
> org.apache.spark.sql.catalyst.expressions.GetStructField$$anonfun$toString$1.apply(complexTypeExtractors.scala:111)
>  
> org.apache.spark.sql.catalyst.expressions.GetStructField$$anonfun$toString$1.apply(complexTypeExtractors.scala:111)
>  
> org.apache.spark.sql.catalyst.expressions.GetStructField.toString(complexTypeExtractors.scala:111)
>  
> org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217)
>  
> org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217)
>  
> org.apache.spark.sql.catalyst.expressions.If.toString(conditionalExpressions.scala:76)
>  
> org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217)
>  
> org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:155)
>  
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argString$1.apply(TreeNode.scala:385)
>  
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argString$1.apply(TreeNode.scala:381)
>  org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:388)
>  org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:391)
>  
> org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:172)
>  
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:441)
>  org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:396)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$5.apply(RuleExecutor.scala:118)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$5.apply(RuleExecutor.scala:119)
>  org.apache.spark.Logging$class.logDebug(Logging.scala:62)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor.logDebug(RuleExecutor.scala:44)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:115)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
>  
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
>  
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:253)
>  org.apache.spark.sql.Dataset.(Dataset.scala:78)
>  org.apache.spark.sql.Dataset.(Dataset.scala:89)
>  org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:507)
>  
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:80)
> {code}
> When providing a type alias, the code fails in a different way:
> {code}
> type TwoInt = (Int, Int)
> case class Test(v: TwoInt)
> Seq(Test((1,2)), Test((3,4)).toDS().show //fails
> {code}
> {code}
>   NoSuchElementException: : head of empty list  (ScalaReflection.scala:504)
>  
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:504)
>  
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:502)
>  
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:502)
>  
> 

[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752722#comment-15752722
 ] 

Imran Rashid commented on SPARK-18886:
--

[~mridul] sorry if I am being slow here, but do you mind spelling out for me in 
more detail?  I'm *not* asking about the benefits of using locality preferences 
-- I get that part.  I'm asking about why the *delay*.  There has to be 
something happening during the delay which we want to wait for.

One possibility is that you've got multiple tasksets running concurrently, with 
different locality preferences.  You wouldn't want the first taskset to use all 
the resources, you'd rather take both tasksets into account.  This is 
accomplished with delay scheduling, but you don't actually *need* the delay.

Another possibility is that there is such a huge gap in runtime that you expect 
your preferred locations will finish *all* tasks in the taskset before that 
delay is up, by having some executors run multiple tasks.

The reason I'm trying to figure this out is to figure out if there is a 
sensible fix here (and what the smallest possible fix would be).  If this is 
it, then the fix I suggested above to Mark should handle this case, while still 
working as intended in other cases.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18888) partitionBy in DataStreamWriter in Python throws _to_seq not defined

2016-12-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-1.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.0

> partitionBy in DataStreamWriter in Python throws _to_seq not defined
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 2.1.0
>
>
> {code}
> python/pyspark/sql/streaming.py in partitionBy(self, *cols)
> 716 if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
> 717 cols = cols[0]
> --> 718 self._jwrite = 
> self._jwrite.partitionBy(_to_seq(self._spark._sc, cols))
> 719 return self
> 720 
> NameError: global name '_to_seq' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable

2016-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18850:
-
Target Version/s: 2.1.0

> Make StreamExecution and progress classes serializable
> --
>
> Key: SPARK-18850
> URL: https://issues.apache.org/jira/browse/SPARK-18850
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Make StreamExecution and progress classes serializable because it is too easy 
> for it to get captured with normal usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable

2016-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18850:
-
Affects Version/s: 2.1.0

> Make StreamExecution and progress classes serializable
> --
>
> Key: SPARK-18850
> URL: https://issues.apache.org/jira/browse/SPARK-18850
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Make StreamExecution and progress classes serializable because it is too easy 
> for it to get captured with normal usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable

2016-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18850:
-
Summary: Make StreamExecution and progress classes serializable  (was: Make 
StreamExecution serializable)

> Make StreamExecution and progress classes serializable
> --
>
> Key: SPARK-18850
> URL: https://issues.apache.org/jira/browse/SPARK-18850
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Make StreamExecution serializable because it is too easy for it to get 
> captured with normal usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable

2016-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18850:
-
Description: Make StreamExecution and progress classes serializable because 
it is too easy for it to get captured with normal usage.  (was: Make 
StreamExecution serializable because it is too easy for it to get captured with 
normal usage.)

> Make StreamExecution and progress classes serializable
> --
>
> Key: SPARK-18850
> URL: https://issues.apache.org/jira/browse/SPARK-18850
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Make StreamExecution and progress classes serializable because it is too easy 
> for it to get captured with normal usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-12-15 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-16178.
-
Resolution: Won't Fix

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-12-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752636#comment-15752636
 ] 

Dongjoon Hyun commented on SPARK-16178:
---

Thank you! Then, I'll close this as Won't Fix.

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-12-15 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752632#comment-15752632
 ] 

Ryan Blue commented on SPARK-16178:
---

Sure. I think the result was Won't Fix.

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-15 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752579#comment-15752579
 ] 

Mridul Muralidharan edited comment on SPARK-18886 at 12/15/16 9:35 PM:
---

[~imranr] For almost all cases, delay scheduling dramatically increases 
performance. The difference even between PROCESS and NODE is significantly high 
(between NODE and 'lower' levels, it can depend on your network config).
For both tasks with short duration and tasks processing large amounts of data, 
it has non trivial impact : long tasks processing small data, it is not so 
useful in comparison iirc, same for degenerate cases where locality preference 
is suboptimal to begin with. [As an aside, the ability to not specify PROCESS 
level locality preference actually is a drawback in our api]

The job(s) I mentioned where we set it to 0 were special cases, where we knew 
the costs well enough to make the decision to lower it : but I would not 
recommend it unless users are very sure of what they are doing. While analysing 
the cost, it should also be kept in mind that transferring data across nodes 
impacts not just spark job, but every other job in the cluster.


was (Author: mridulm80):

[~imranr] For almost all cases, delay scheduling dramatically increases 
performance. The difference even between PROCESS and NODE is significantly high 
(between NODE and 'lower' levels, it can depend on your network config).
For both tasks with short duration and tasks processing large amounts of data, 
it has non trivial impact : long tasks processing small data, it is not so 
useful in comparison iirc, same for degenerate cases where locality preference 
is suboptimal to begin with. [As an aside, the ability to not specify PROCESS 
level locality actually is a drawback in our api]

The job(s) I mentioned where we set it to 0 were special cases, where we knew 
the costs well enough to make the decision to lower it : but I would not 
recommend it unless users are very sure of what they are doing. While analysing 
the cost, it should also be kept in mind that transferring data across nodes 
impacts not just spark job, but every other job in the cluster.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of 

[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-15 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752579#comment-15752579
 ] 

Mridul Muralidharan commented on SPARK-18886:
-


[~imranr] For almost all cases, delay scheduling dramatically increases 
performance. The difference even between PROCESS and NODE is significantly high 
(between NODE and 'lower' levels, it can depend on your network config).
For both tasks with short duration and tasks processing large amounts of data, 
it has non trivial impact : long tasks processing small data, it is not so 
useful in comparison iirc, same for degenerate cases where locality preference 
is suboptimal to begin with. [As an aside, the ability to not specify PROCESS 
level locality actually is a drawback in our api]

The job(s) I mentioned where we set it to 0 were special cases, where we knew 
the costs well enough to make the decision to lower it : but I would not 
recommend it unless users are very sure of what they are doing. While analysing 
the cost, it should also be kept in mind that transferring data across nodes 
impacts not just spark job, but every other job in the cluster.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions

2016-12-15 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752548#comment-15752548
 ] 

Dongjoon Hyun commented on SPARK-16178:
---

Hi, [~rdblue].
The PR seems to be closed. I'm wondering we can close this issue.
This issue is currently a subtask of SPARK-16032 for 2.1.0.

> SQL - Hive writer should not require partition names to match table partitions
> --
>
> Key: SPARK-16178
> URL: https://issues.apache.org/jira/browse/SPARK-16178
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Ryan Blue
>
> SPARK-14459 added a check that the {{partition}} metadata on 
> {{InsertIntoTable}} must match the table's partition column names. But if 
> {{partitionBy}} is used to set up partition columns, those columns may not be 
> named or the names may not match.
> For example:
> {code}
> // Tables:
> // CREATE TABLE src (id string, date int, hour int, timestamp bigint);
> // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int)
> //   PARTITIONED BY (utc_dateint int, utc_hour int);
> spark.table("src").write.partitionBy("date", "hour").insertInto("dest")
> {code}
> The call to partitionBy correctly places the date and hour columns at the end 
> of the logical plan, but the names don't match the "utc_" prefix and the 
> write fails. But the analyzer will verify the types and insert an {{Alias}} 
> so the query is actually valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18889) Spark incorrectly reads default columns from a Hive view

2016-12-15 Thread Salil Surendran (JIRA)
Salil Surendran created SPARK-18889:
---

 Summary: Spark incorrectly reads default columns from a Hive view
 Key: SPARK-18889
 URL: https://issues.apache.org/jira/browse/SPARK-18889
 Project: Spark
  Issue Type: Bug
Reporter: Salil Surendran


Spark fails to read a view that have columns that are given default names;
To reproduce follow the following steps in Hive:

   * CREATE TABLE IF NOT EXISTS employee_details ( eid int, name String,
salary String, destination String, json String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
* insert into employee_details values(100, "Salil", "100k", "Mumbai", 
s"""{"Foo":"ABC","Bar":"2009010110","Quux":{"QuuxId":1234,"QuuxName":"Sam"}}"""
 )
   * create view employee_25 as select eid, name, `_c4` from (select eid, name, 
destination,v1.foo, cast(v1.bar as timestamp) from employee_details LATERAL 
VIEW json_tuple(json,'Foo','Bar')v1 as foo, bar)v2;
* select * from employee_25;

You will see an output like this:
+--+---+--+--+
| employee_25.eid  | employee_25.name  | employee_25._c4  |
+--+---+--+--+
| 100   | Salil   | NULL
 |
+--+---+--+--+

Now go to spark-shell and try to query the view:
scala> spark.sql("select * from employee_25").show
org.apache.spark.sql.AnalysisException: cannot resolve '`v2._c4`' given input 
columns: [foo, name, eid, bar, destination]; line 1 pos 32;
'Project [*]
+- 'SubqueryAlias employee_25
   +- 'Project [eid#56, name#57, 'v2._c4]
  +- SubqueryAlias v2
 +- Project [eid#56, name#57, destination#59, foo#61, cast(bar#62 as 
timestamp) AS bar#63]
+- Generate json_tuple(json#60, Foo, Bar), true, false, v1, 
[foo#61, bar#62]
   +- MetastoreRelation default, employee_details

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 

[jira] [Resolved] (SPARK-18826) Make FileStream be able to start with most recent files

2016-12-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18826.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16251
[https://github.com/apache/spark/pull/16251]

> Make FileStream be able to start with most recent files
> ---
>
> Key: SPARK-18826
> URL: https://issues.apache.org/jira/browse/SPARK-18826
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.0
>
>
> When starting a stream with a lot of backfill and maxFilesPerTrigger, the 
> user could often want to start with most recent files first. This would let 
> you keep low latency for recent data and slowly backfill historical data.
> It's better to add an option to control this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files

2016-12-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17119.

Resolution: Duplicate

This was actually already implemented (without the need for a config option).

> Add configuration property to allow the history server to delete .inprogress 
> files
> --
>
> Key: SPARK-17119
> URL: https://issues.apache.org/jira/browse/SPARK-17119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>  Labels: historyserver
>
> The History Server (HS) currently only considers completed applications when 
> deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). 
> This means that over time, .inprogress files (from failed jobs, jobs where 
> the SparkContext is not closed, spark-shell exits etc...) can accumulate and 
> impact the HS.
> Instead of having to manually delete these files, maybe users could have the 
> option of telling the HS to delete all files where (now - 
> attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete 
> .inprogress files with lastUpdated older then 7d?
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

2016-12-15 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752443#comment-15752443
 ] 

Anbu Cheeralan edited comment on SPARK-17493 at 12/15/16 8:55 PM:
--

[~sowen] I faced a similar error while writing to google storage. This issue is 
specific to  object stores and in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code 
causes huge number of RPC calls when the file system is an Object Store (S3, 
GS). 
{quote}
  if (mode == SaveMode.Append) \{
val existingPartitionColumns = Try \{
  resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
\}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.


was (Author: alunarbeach):
[~sowen] I faced a similar error while writing to google storage. This issue is 
specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code 
causes huge number of RPC calls when the file system is on Object Stores (S3, 
GS). 
{quote}
  if (mode == SaveMode.Append) \{
val existingPartitionColumns = Try \{
  resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
\}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> --
>
> Key: SPARK-17493
> URL: https://issues.apache.org/jira/browse/SPARK-17493
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: AWS Cluster
>Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following 
> rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//")
>  , the spark job was hanging as the two write tasks with Shuffle Read of size 
> 0 could not complete. But, the executors notified the driver about the 
> completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

2016-12-15 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752443#comment-15752443
 ] 

Anbu Cheeralan edited comment on SPARK-17493 at 12/15/16 8:54 PM:
--

[~sowen] I faced a similar error while writing to google storage. This issue is 
specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code 
causes huge number of RPC calls when the file system is on Object Stores (S3, 
GS). 
{quote}
  if (mode == SaveMode.Append) \{
val existingPartitionColumns = Try \{
  resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
\}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.


was (Author: alunarbeach):
[~sowen] I faced a similar error while writing to google storage. This issue is 
specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code 
causes huge number of RPC calls when the file system is on Object Stores (S3, 
GS). 
{quote}
  if (mode == SaveMode.Append) {
val existingPartitionColumns = Try {
  resolveRelation()
.asInstanceOf[HadoopFsRelation]
.location
.partitionSpec()
.partitionColumns
.fieldNames
.toSeq
}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work 
on the patch.

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> --
>
> Key: SPARK-17493
> URL: https://issues.apache.org/jira/browse/SPARK-17493
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: AWS Cluster
>Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following 
> rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//")
>  , the spark job was hanging as the two write tasks with Shuffle Read of size 
> 0 could not complete. But, the executors notified the driver about the 
> completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-12-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-8425.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 14079
https://github.com/apache/spark/pull/14079

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Mao, Wei
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8425) Add blacklist mechanism for task scheduling

2016-12-15 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-8425:

Assignee: Mao, Wei  (was: Imran Rashid)

> Add blacklist mechanism for task scheduling
> ---
>
> Key: SPARK-8425
> URL: https://issues.apache.org/jira/browse/SPARK-8425
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Assignee: Mao, Wei
>Priority: Minor
> Attachments: DesignDocforBlacklistMechanism.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752462#comment-15752462
 ] 

Imran Rashid commented on SPARK-18886:
--

[~mridulm80] good point, perhaps the right answer here is just to turn off 
delay scheduling completely -- not setting {{"spark.locality.wait.process"}} to 
a small value, as I had suggested in the initial workaround, but just turning 
it off completely, to avoid having to futz with tuning that value relative to 
task runtime.

But lemme ask you more or less the same question I just asked mark, phrased a 
little differently -- given the fragility of this, wouldn't it make more sense 
for us to turn delay scheduling *off* by default?

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18823:
--
Fix Version/s: (was: 2.0.2)

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-15 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752456#comment-15752456
 ] 

Joseph K. Bradley commented on SPARK-18823:
---

Note: Please don't set the Target Version or Fix Version.  Committers can use 
those fields for tracking releases.  Thanks!

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >