[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753728#comment-15753728 ] Navya Krishnappa commented on SPARK-18877: -- Precision and scale vary depending on the decimal values in the column. Suppose if source file contains Amount(column name) 9.03E+12 1.19E+11 24335739714 1.71E+11 then spark consider Amount column as decimal(3,-9). and throws an below mentioned exception Caused by: java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:112) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:425) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:264) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence
[ https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-18845: --- Assignee: Andrew Ray > PageRank has incorrect initialization value that leads to slow convergence > -- > > Key: SPARK-18845 > URL: https://issues.apache.org/jira/browse/SPARK-18845 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2 >Reporter: Andrew Ray >Assignee: Andrew Ray > Fix For: 2.2.0 > > > All variants of PageRank in GraphX have incorrect initialization value that > leads to slow convergence. In the current implementations ranks are seeded > with the reset probability when it should be 1. This appears to have been > introduced a long time ago in > https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90 > This also hides the fact that source vertices (vertices with no incoming > edges) are not updated. This is because source vertices generally* have > pagerank equal to the reset probability. Therefore both need to be fixed at > once. > PR will be added shortly > *when there are no sinks -- but that's a separate bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence
[ https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-18845. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16271 [https://github.com/apache/spark/pull/16271] > PageRank has incorrect initialization value that leads to slow convergence > -- > > Key: SPARK-18845 > URL: https://issues.apache.org/jira/browse/SPARK-18845 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2 >Reporter: Andrew Ray > Fix For: 2.2.0 > > > All variants of PageRank in GraphX have incorrect initialization value that > leads to slow convergence. In the current implementations ranks are seeded > with the reset probability when it should be 1. This appears to have been > introduced a long time ago in > https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90 > This also hides the fact that source vertices (vertices with no incoming > edges) are not updated. This is because source vertices generally* have > pagerank equal to the reset probability. Therefore both need to be fixed at > once. > PR will be added shortly > *when there are no sinks -- but that's a separate bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18877) Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20
[ https://issues.apache.org/jira/browse/SPARK-18877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753716#comment-15753716 ] Navya Krishnappa commented on SPARK-18877: -- I'm reading through csvReader (.csv(sourceFile)) and i'm not setting any precision and scale, Spark is automatically detecting the precision and scale for the values in the source file. And precision and scale vary depending on the decimal values in the column. Stack trace: Caused by: java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 exceeds max precision 20 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:112) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:425) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:264) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 common frames omitted > Unable to read given csv data. Excepion: java.lang.IllegalArgumentException: > requirement failed: Decimal precision 28 exceeds max precision 20 > -- > > Key: SPARK-18877 > URL: https://issues.apache.org/jira/browse/SPARK-18877 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When reading below mentioned csv data, even though the maximum decimal > precision is 38, following exception is thrown > java.lang.IllegalArgumentException: requirement failed: Decimal precision 28 > exceeds max precision 20 > Decimal > 2323366225312000 > 2433573971400 > 23233662253000 > 23233662253 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753671#comment-15753671 ] Apache Spark commented on SPARK-18895: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16305 > Fix resource-closing-related and path-related test failures in identified > ones on Windows > - > > Key: SPARK-18895 > URL: https://issues.apache.org/jira/browse/SPARK-18895 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are several tests failing due to resource-closing-related and > path-related problems on Windows as below. > - {{RPackageUtilsSuite}}: > {code} > - build an R package from a jar end to end *** FAILED *** (1 second, 625 > milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - faulty R package shows documentation *** FAILED *** (359 milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - SparkR zipping works properly *** FAILED *** (47 milliseconds) > java.util.regex.PatternSyntaxException: Unknown character property name {r} > near index 4 > C:\projects\spark\target\tmp\1481729429282-0 > ^ > at java.util.regex.Pattern.error(Pattern.java:1955) > at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781) > {code} > - {{InputOutputMetricsSuite}}: > {code} > - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics with cache and coalesce *** FAILED *** (109 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics for new Hadoop API with coalesce *** FAILED *** (0 > milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) > at > org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114) > - input metrics when reading text file *** FAILED *** (110 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - simple *** FAILED *** (125 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - more stages *** FAILED *** (110 > milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at
[jira] [Assigned] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18895: Assignee: (was: Apache Spark) > Fix resource-closing-related and path-related test failures in identified > ones on Windows > - > > Key: SPARK-18895 > URL: https://issues.apache.org/jira/browse/SPARK-18895 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Priority: Minor > > There are several tests failing due to resource-closing-related and > path-related problems on Windows as below. > - {{RPackageUtilsSuite}}: > {code} > - build an R package from a jar end to end *** FAILED *** (1 second, 625 > milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - faulty R package shows documentation *** FAILED *** (359 milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - SparkR zipping works properly *** FAILED *** (47 milliseconds) > java.util.regex.PatternSyntaxException: Unknown character property name {r} > near index 4 > C:\projects\spark\target\tmp\1481729429282-0 > ^ > at java.util.regex.Pattern.error(Pattern.java:1955) > at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781) > {code} > - {{InputOutputMetricsSuite}}: > {code} > - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics with cache and coalesce *** FAILED *** (109 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics for new Hadoop API with coalesce *** FAILED *** (0 > milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) > at > org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114) > - input metrics when reading text file *** FAILED *** (110 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - simple *** FAILED *** (125 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - more stages *** FAILED *** (110 > milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) > at >
[jira] [Assigned] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18895: Assignee: Apache Spark > Fix resource-closing-related and path-related test failures in identified > ones on Windows > - > > Key: SPARK-18895 > URL: https://issues.apache.org/jira/browse/SPARK-18895 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > There are several tests failing due to resource-closing-related and > path-related problems on Windows as below. > - {{RPackageUtilsSuite}}: > {code} > - build an R package from a jar end to end *** FAILED *** (1 second, 625 > milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - faulty R package shows documentation *** FAILED *** (359 milliseconds) > java.io.IOException: Unable to delete file: > C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar > at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) > at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) > at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) > - SparkR zipping works properly *** FAILED *** (47 milliseconds) > java.util.regex.PatternSyntaxException: Unknown character property name {r} > near index 4 > C:\projects\spark\target\tmp\1481729429282-0 > ^ > at java.util.regex.Pattern.error(Pattern.java:1955) > at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781) > {code} > - {{InputOutputMetricsSuite}}: > {code} > - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics with cache and coalesce *** FAILED *** (109 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics for new Hadoop API with coalesce *** FAILED *** (0 > milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) > at > org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114) > - input metrics when reading text file *** FAILED *** (110 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - simple *** FAILED *** (125 milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records read - more stages *** FAILED *** (110 > milliseconds) > java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) > - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: Wrong FS: > file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) > at >
[jira] [Created] (SPARK-18895) Fix resource-closing-related and path-related test failures in identified ones on Windows
Hyukjin Kwon created SPARK-18895: Summary: Fix resource-closing-related and path-related test failures in identified ones on Windows Key: SPARK-18895 URL: https://issues.apache.org/jira/browse/SPARK-18895 Project: Spark Issue Type: Sub-task Components: Tests Reporter: Hyukjin Kwon Priority: Minor There are several tests failing due to resource-closing-related and path-related problems on Windows as below. - {{RPackageUtilsSuite}}: {code} - build an R package from a jar end to end *** FAILED *** (1 second, 625 milliseconds) java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729427517-0\a\dep2\d\dep2-d.jar at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) - faulty R package shows documentation *** FAILED *** (359 milliseconds) java.io.IOException: Unable to delete file: C:\projects\spark\target\tmp\1481729428970-0\dep1-c.jar at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) - SparkR zipping works properly *** FAILED *** (47 milliseconds) java.util.regex.PatternSyntaxException: Unknown character property name {r} near index 4 C:\projects\spark\target\tmp\1481729429282-0 ^ at java.util.regex.Pattern.error(Pattern.java:1955) at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2781) {code} - {{InputOutputMetricsSuite}}: {code} - input metrics for old hadoop with coalesce *** FAILED *** (240 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) - input metrics with cache and coalesce *** FAILED *** (109 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) - input metrics for new Hadoop API with coalesce *** FAILED *** (0 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-9366ec94-dac7-4a5c-a74b-3e7594a692ab\test\InputOutputMetricsSuite.txt, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114) - input metrics when reading text file *** FAILED *** (110 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) - input metrics on records read - simple *** FAILED *** (125 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) - input metrics on records read - more stages *** FAILED *** (110 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) - input metrics on records - New Hadoop API *** FAILED *** (16 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-3f10a1a4-7820-4772-b821-25fd7523bf6f\test\InputOutputMetricsSuite.txt, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:462) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:114) - input metrics on records read with cache *** FAILED *** (93 milliseconds) java.io.IOException: Not a file: file:/C:/projects/spark/core/ignored at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:277) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753588#comment-15753588 ] Shivaram Venkataraman commented on SPARK-18817: --- Just to check - Is your Spark installation built with Hive support (i.e. with -Phive) ? > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18892) Alias percentile_approx approx_percentile
[ https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18892. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Alias percentile_approx approx_percentile > - > > Key: SPARK-18892 > URL: https://issues.apache.org/jira/browse/SPARK-18892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.1, 2.2.0 > > > percentile_approx is the name used in Hive, and approx_percentile is the name > used in Presto. approx_percentile is actually more consistent with our > approx_count_distinct. Given the cost to alias SQL functions is low > (one-liner), it'd be better to just alias them so it is easier to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753535#comment-15753535 ] Sital Kedia commented on SPARK-18838: - cc - [~kayousterhout] > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-18838: Description: Currently we are observing the issue of very high event processing delay in driver's `ListenerBus` for large jobs with many tasks. Many critical component of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might hurt the job performance significantly or even fail the job. For example, a significant delay in receiving the `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to mistakenly remove an executor which is not idle. The problem is that the event processor in `ListenerBus` is a single thread which loops through all the Listeners for each event and processes each event synchronously https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. This single threaded processor often becomes the bottleneck for large jobs. Also, if one of the Listener is very slow, all the listeners will pay the price of delay incurred by the slow listener. In addition to that a slow listener can cause events to be dropped from the event queue which might be fatal to the job. To solve the above problems, we propose to get rid of the event queue and the single threaded event processor. Instead each listener will have its own dedicate single threaded executor service . When ever an event is posted, it will be submitted to executor service of all the listeners. The Single threaded executor service will guarantee in order processing of the events per listener. The queue used for the executor service will be bounded to guarantee we do not grow the memory indefinitely. The downside of this approach is separate event queue per listener will increase the driver memory footprint. was: Currently we are observing the issue of very high event processing delay in driver's `ListenerBus` for large jobs with many tasks. Many critical component of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is causing job failure. For example, a significant delay in receiving the `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to remove an executor which is not idle. The event processor in `ListenerBus` is a single thread which loops through all the Listeners for each event and processes each event synchronously https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. The single threaded processor often becomes the bottleneck for large jobs. In addition to that, if one of the Listener is very slow, all the listeners will pay the price of delay incurred by the slow listener. To solve the above problems, we plan to have a per listener single threaded executor service and separate event queue. That way we are not bottlenecked by the single threaded event processor and also critical listeners will not be penalized by the slow listeners. The downside of this approach is separate event queue per listener will increase the driver memory footprint. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own >
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753462#comment-15753462 ] Felix Cheung commented on SPARK-18817: -- I ran more of this but wasn't seeinng derby.log or metastore_db > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753414#comment-15753414 ] Felix Cheung commented on SPARK-18817: -- It looks like javax.jdo.option.ConnectionURL can also be set in Hive-site.xml? In that sense we should only change javax.jdo.option.ConnectionURL and spark.sql.default.warehouse.dir when they are not set in conf or hive-site, and we need to handle both for a complete fix. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753304#comment-15753304 ] lichenglin commented on SPARK-18893: spark 2.0 has disable "alter table". [https://issues.apache.org/jira/browse/SPARK-14118] [https://issues.apache.org/jira/browse/SPARK-14130] I think this is very import feature for data warehouse Can spark handle it firstly? > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18894: Assignee: Apache Spark (was: Tathagata Das) > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753298#comment-15753298 ] Apache Spark commented on SPARK-18894: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16304 > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18894: Assignee: Tathagata Das (was: Apache Spark) > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-18894: -- Priority: Critical (was: Major) > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
[ https://issues.apache.org/jira/browse/SPARK-18894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-18894: -- Affects Version/s: 2.1.0 Target Version/s: 2.1.0 > Event time watermark delay threshold specified in months or years gives > incorrect results > - > > Key: SPARK-18894 > URL: https://issues.apache.org/jira/browse/SPARK-18894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Critical > > Internally we use CalendarInterval to parse the delay. Non-determinstic > intervals like "month" and "year" are handled such a way that the generated > delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18894) Event time watermark delay threshold specified in months or years gives incorrect results
Tathagata Das created SPARK-18894: - Summary: Event time watermark delay threshold specified in months or years gives incorrect results Key: SPARK-18894 URL: https://issues.apache.org/jira/browse/SPARK-18894 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Tathagata Das Assignee: Tathagata Das Internally we use CalendarInterval to parse the delay. Non-determinstic intervals like "month" and "year" are handled such a way that the generated delay in milliseconds is 0 delayThreshold is in months or years. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18272) Test topic addition for subscribePattern on Kafka DStream and Structured Stream
[ https://issues.apache.org/jira/browse/SPARK-18272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753289#comment-15753289 ] Bravo Zhang commented on SPARK-18272: - Does "subscribing topic by pattern with topic deletions" in KafkaSourceSuite already cover this case? It also has topic creation. > Test topic addition for subscribePattern on Kafka DStream and Structured > Stream > --- > > Key: SPARK-18272 > URL: https://issues.apache.org/jira/browse/SPARK-18272 > Project: Spark > Issue Type: Test > Components: DStreams, Structured Streaming >Reporter: Cody Koeninger > > We've had reports of the following sequence > - create subscribePattern stream that doesn't match any existing topics at > the time stream starts > - add a topic that matches pattern > - expect that messages from that topic show up, but they don't > We don't seem to actually have tests that cover this case, so we should add > them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18893) Not support "alter table .. add columns .."
zuotingbing created SPARK-18893: --- Summary: Not support "alter table .. add columns .." Key: SPARK-18893 URL: https://issues.apache.org/jira/browse/SPARK-18893 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: zuotingbing when we update spark from version 1.5.2 to 2.0.1, all cases we have need change the table use "alter table add columns " failed, but it is said "All Hive DDL Functions, including: alter table" in the official document : http://spark.apache.org/docs/latest/sql-programming-guide.html. Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753227#comment-15753227 ] Wenchen Fan commented on SPARK-18817: - the warehouse path will be created no matter hive support is enabled or not, but derbe.log and metastore_db will only be created with hive support. The simplest solution will be: disable hive support by default. We can also change the location of metastore_db by "javax.jdo.option.ConnectionURL". I'm not sure how to do it in R side, may be tricky. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146 ] lichenglin edited comment on SPARK-14130 at 12/16/16 2:00 AM: -- "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? Even though it only works on hivecontext with specially fileformat was (Author: licl): "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14130: --- Comment: was deleted (was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? ) > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-14130: --- Comment: was deleted (was: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? ) > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753147#comment-15753147 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753148#comment-15753148 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14130) [Table related commands] Alter column
[ https://issues.apache.org/jira/browse/SPARK-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753146#comment-15753146 ] lichenglin commented on SPARK-14130: "TOK_ALTERTABLE_ADDCOLS" is a very important command for data warehouse. Does spark have any plan to support for it?? > [Table related commands] Alter column > - > > Key: SPARK-14130 > URL: https://issues.apache.org/jira/browse/SPARK-14130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > For alter column command, we have the following tokens. > TOK_ALTERTABLE_RENAMECOL > TOK_ALTERTABLE_ADDCOLS > TOK_ALTERTABLE_REPLACECOLS > For data source tables, we should throw exceptions. For Hive tables, we > should support them. *For Hive tables, we should check Hive's behavior to see > if there is any file format that does not any of above command*. > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java > is a good reference for Hive's behavior. > Also, for a Hive table stored in a format, we need to make sure that even if > Spark can read this tables after an alter column operation. If we cannot read > the table, even Hive allows the alter column operation, we should still throw > an exception. For example, if renaming a column of a Hive parquet table > causes the renamed column inaccessible (we cannot read values), we should not > allow this renaming operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18855) Add RDD flatten function
[ https://issues.apache.org/jira/browse/SPARK-18855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linbo closed SPARK-18855. - Resolution: Unresolved > Add RDD flatten function > > > Key: SPARK-18855 > URL: https://issues.apache.org/jira/browse/SPARK-18855 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Linbo >Priority: Minor > Labels: flatten, rdd > > A new RDD flatten function is similar to flatten function of scala > collections: > {code:title=spark-shell|borderStyle=solid} > scala> val rdd = sc.makeRDD(List(List(1, 2, 3), List(4, 5), List(6))) > rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at > makeRDD at :24 > scala> rdd.flatten.collect > res0: Array[Int] = Array(1, 2, 3, 4, 5, 6) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18855) Add RDD flatten function
[ https://issues.apache.org/jira/browse/SPARK-18855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753120#comment-15753120 ] Linbo commented on SPARK-18855: --- Tried several ways, the more "Spark" way is trying to create TraversableRDDFunctions file and implicit def rddToTraversableRDDFunctions[U](rdd: RDD[TraversableRDDFunctions[U]]) inside RDD object. But it's hard to make this method generic because class RDD is invariant. I will close this issue. It's more impactful that this should go on Dataset. > Add RDD flatten function > > > Key: SPARK-18855 > URL: https://issues.apache.org/jira/browse/SPARK-18855 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Linbo >Priority: Minor > Labels: flatten, rdd > > A new RDD flatten function is similar to flatten function of scala > collections: > {code:title=spark-shell|borderStyle=solid} > scala> val rdd = sc.makeRDD(List(List(1, 2, 3), List(4, 5), List(6))) > rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at > makeRDD at :24 > scala> rdd.flatten.collect > res0: Array[Int] = Array(1, 2, 3, 4, 5, 6) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093 ] Brendan Dwyer edited comment on SPARK-18817 at 12/16/16 1:45 AM: - I'm not sure the CRAN people would be okay with that. It might be enough to pass any automatic testing they have but it would still be against their policies. {quote} Limited exceptions may be allowed in interactive sessions if the package *obtains confirmation from the user*. {quote} was (Author: bdwyer): I'm not sure the CRAN people would be okay with that. It might be enough to pass any automatic testing they have but it would still be against their policies. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093 ] Brendan Dwyer edited comment on SPARK-18817 at 12/16/16 1:30 AM: - I'm not sure the CRAN people would be okay with that. It might be enough to pass any automatic testing they have but it would still be against their policies. was (Author: bdwyer): I'm not sure the CRAN people would be okay with that. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753095#comment-15753095 ] Brendan Dwyer commented on SPARK-18817: --- {code} library("SparkR") sparkR.session() df <- as.DataFrame(iris) {code} > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753093#comment-15753093 ] Brendan Dwyer commented on SPARK-18817: --- I'm not sure the CRAN people would be okay with that. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files
[ https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753089#comment-15753089 ] Yanbo Liang commented on SPARK-18862: - Great! Will send PR soon. > Split SparkR mllib.R into multiple files > > > Key: SPARK-18862 > URL: https://issues.apache.org/jira/browse/SPARK-18862 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to > split it into multiple files to make us easy to maintain: > * mllibClassification.R > * mllibRegression.R > * mllibClustering.R > * mllibFeature.R > or: > * mllib/classification.R > * mllib/regression.R > * mllib/clustering.R > * mllib/features.R > For R convention, it's more prefer the first way. And I'm not sure whether R > supports the second organized way (will check later). Please let me know your > preference. I think the start of a new release cycle is a good opportunity to > do this, since it will involves less conflicts. If this proposal was > approved, I can work on it. > cc [~felixcheung] [~josephkb] [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17807: Assignee: Apache Spark > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Assignee: Apache Spark >Priority: Trivial > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17807: Assignee: (was: Apache Spark) > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753085#comment-15753085 ] Apache Spark commented on SPARK-17807: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/16303 > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753064#comment-15753064 ] Felix Cheung commented on SPARK-18817: -- Actually, I'm not seeing derby.log or metastore_db in the quick tests I have, {code} > createOrReplaceTempView(a, "foo") > sql("SELECT * from foo") {code} [~bdwyer]do you have the steps that create these files? > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753048#comment-15753048 ] Felix Cheung commented on SPARK-18817: -- Tested this just now, I still see spark-warehouse when enableHiveSupport = FALSE > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753040#comment-15753040 ] Felix Cheung commented on SPARK-18817: -- we could, but we did ship 2.0 with it enabled by default though. perhaps {code} enableHiveSupport = !interactive() {code} as default? shouldn't derby.log and metastore_db go to the warehouse.dir? > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18817: - Comment: was deleted (was: we could, but we did ship 2.0 with it enabled by default though. perhaps {code} enableHiveSupport = !interactive() {code} as default? shouldn't derby.log and metastore_db go to the warehouse.dir? ) > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753041#comment-15753041 ] Felix Cheung commented on SPARK-18817: -- we could, but we did ship 2.0 with it enabled by default though. perhaps {code} enableHiveSupport = !interactive() {code} as default? shouldn't derby.log and metastore_db go to the warehouse.dir? > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753041#comment-15753041 ] Felix Cheung edited comment on SPARK-18817 at 12/16/16 1:03 AM: we could, but we did ship 2.0 with it enabled by default though. perhaps {code} sparkR.session <- function( master = "", appName = "SparkR", sparkHome = Sys.getenv("SPARK_HOME"), sparkConfig = list(), sparkJars = "", sparkPackages = "", - enableHiveSupport = TRUE, + enableHiveSupport = !interactive() ...) { {code} as default? shouldn't derby.log and metastore_db go to the warehouse.dir? was (Author: felixcheung): we could, but we did ship 2.0 with it enabled by default though. perhaps {code} enableHiveSupport = !interactive() {code} as default? shouldn't derby.log and metastore_db go to the warehouse.dir? > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18862) Split SparkR mllib.R into multiple files
[ https://issues.apache.org/jira/browse/SPARK-18862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753032#comment-15753032 ] Felix Cheung commented on SPARK-18862: -- FYI I reorg the vignettes based on what's discussed here. https://github.com/apache/spark/pull/16301 > Split SparkR mllib.R into multiple files > > > Key: SPARK-18862 > URL: https://issues.apache.org/jira/browse/SPARK-18862 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > > SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to > split it into multiple files to make us easy to maintain: > * mllibClassification.R > * mllibRegression.R > * mllibClustering.R > * mllibFeature.R > or: > * mllib/classification.R > * mllib/regression.R > * mllib/clustering.R > * mllib/features.R > For R convention, it's more prefer the first way. And I'm not sure whether R > supports the second organized way (will check later). Please let me know your > preference. I think the start of a new release cycle is a good opportunity to > do this, since it will involves less conflicts. If this proposal was > approved, I can work on it. > cc [~felixcheung] [~josephkb] [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18849) Vignettes final checks for Spark 2.1
[ https://issues.apache.org/jira/browse/SPARK-18849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753027#comment-15753027 ] Apache Spark commented on SPARK-18849: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/16301 > Vignettes final checks for Spark 2.1 > > > Key: SPARK-18849 > URL: https://issues.apache.org/jira/browse/SPARK-18849 > Project: Spark > Issue Type: Documentation > Components: Documentation, SparkR >Reporter: Xiangrui Meng >Assignee: Felix Cheung > Fix For: 2.1.0 > > > Make a final pass over the vignettes and ensure the content is consistent. > * remove "since version" because is not that useful for vignettes > * re-order/group the list of ML algorithms so there exists a logical ordering > * check for warning or error in output message > * anything else that seems out of place -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18892) Alias percentile_approx approx_percentile
[ https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18892: Assignee: Reynold Xin (was: Apache Spark) > Alias percentile_approx approx_percentile > - > > Key: SPARK-18892 > URL: https://issues.apache.org/jira/browse/SPARK-18892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > percentile_approx is the name used in Hive, and approx_percentile is the name > used in Presto. approx_percentile is actually more consistent with our > approx_count_distinct. Given the cost to alias SQL functions is low > (one-liner), it'd be better to just alias them so it is easier to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18892) Alias percentile_approx approx_percentile
[ https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753011#comment-15753011 ] Apache Spark commented on SPARK-18892: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/16300 > Alias percentile_approx approx_percentile > - > > Key: SPARK-18892 > URL: https://issues.apache.org/jira/browse/SPARK-18892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > percentile_approx is the name used in Hive, and approx_percentile is the name > used in Presto. approx_percentile is actually more consistent with our > approx_count_distinct. Given the cost to alias SQL functions is low > (one-liner), it'd be better to just alias them so it is easier to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18892) Alias percentile_approx approx_percentile
[ https://issues.apache.org/jira/browse/SPARK-18892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18892: Assignee: Apache Spark (was: Reynold Xin) > Alias percentile_approx approx_percentile > - > > Key: SPARK-18892 > URL: https://issues.apache.org/jira/browse/SPARK-18892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > percentile_approx is the name used in Hive, and approx_percentile is the name > used in Presto. approx_percentile is actually more consistent with our > approx_count_distinct. Given the cost to alias SQL functions is low > (one-liner), it'd be better to just alias them so it is easier to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18892) Alias percentile_approx approx_percentile
Reynold Xin created SPARK-18892: --- Summary: Alias percentile_approx approx_percentile Key: SPARK-18892 URL: https://issues.apache.org/jira/browse/SPARK-18892 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin percentile_approx is the name used in Hive, and approx_percentile is the name used in Presto. approx_percentile is actually more consistent with our approx_count_distinct. Given the cost to alias SQL functions is low (one-liner), it'd be better to just alias them so it is easier to use. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752997#comment-15752997 ] Marcelo Vanzin commented on SPARK-17807: Reopening since this is a real issue (the dependency leaks when you depend on spark-core in maven and don't have scalatest as an explicit test dependency in your project). > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-17807: > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752973#comment-15752973 ] Shivaram Venkataraman commented on SPARK-18817: --- In that case an easier fix might be to just disable Hive support by default ? cc [~felixcheung] > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752951#comment-15752951 ] Brendan Dwyer commented on SPARK-18817: --- [~shivaram] it does not happen if I disable Hive. > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752948#comment-15752948 ] William Shen commented on SPARK-5632: - Thanks [~marmbrus]. I see that the backtick works in 1.5.0 as well (with the limitation on distinct, which is fixed in SPARK-15230). Hopefully this will get sorted out together with SPARK-18084. Thanks again for your help! > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_transformed: array (nullable = true) > \|\|\|\|\|\|\|-- element: array (containsNull = > false) > \|\|\|\|\|\|\|
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752929#comment-15752929 ] Michael Armbrust commented on SPARK-5632: - Hmm, I agree that error is confusing. It does work if you use [backticks|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880245/2840265927289860/latest.html] (at least with 2.1). I think this falls into the general class of issues where we don't have consistent handling of strings that reference columns. I'm going to link this ticket to [SPARK-18084] (which i've also targeted for investigation in the 2.2 release). > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|
[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18084: - Target Version/s: 2.2.0 > write.partitionBy() does not recognize nested columns that select() can access > -- > > Key: SPARK-18084 > URL: https://issues.apache.org/jira/browse/SPARK-18084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a simple repro in the PySpark shell: > {code} > from pyspark.sql import Row > rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > df = spark.createDataFrame(rdd) > df.printSchema() > df.select('a.b').show() # works > df.write.partitionBy('a.b').text('/tmp/test') # doesn't work > {code} > Here's what I see when I run this: > {code} > >>> from pyspark.sql import Row > >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > >>> df = spark.createDataFrame(rdd) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: long (nullable = true) > >>> df.show() > +---+ > | a| > +---+ > |[5]| > +---+ > >>> df.select('a.b').show() > +---+ > | b| > +---+ > | 5| > +---+ > >>> df.write.partitionBy('a.b').text('/tmp/test') > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. > : org.apache.spark.sql.AnalysisException: Partition column a.b not found in > schema > StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py", > line 656, in text > self._jwrite.text(path) > File >
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752907#comment-15752907 ] Andrew Ash commented on SPARK-18278: There are definitely challenges in building features that take longer than a release cycle (quarterly for Spark). We could maintain a long-running feature branch for spark-k8s that lasts several months and then gets merged into Spark in a big-bang merge, with that feature branch living either on apache/spark or in some other community-accessible repo. I don't think there are many practical differences between in apache/spark vs a different repo for where the source is hosted if both are not in Apache releases. Or we could merge many smaller commits for spark-k8s into the apache/spark master branch along the way and release as an experimental feature when release time comes. This enables more continuous code review but has the risk of destabilizing the master branch if code reviews miss things. Looking to past instances of large features spanning multiple release cycles (like SparkSQL and YARN integration), both of those had work happening primarily in-repo from what I can tell, and releases included large disclaimers in release notes for those experimental features. That precedent seems to suggest Kubernetes integration should follow a similar path. Personally I lean towards the approach of more smaller commits into master rather than a long-running feature branch. By code reviewing PRs into the main repo as we go the feature will be easier to code review and will also get wider feedback as an experimental feature than a side branch or side repo would get. This also serves to include Apache committers from the start in understanding the codebase, rather than foisting a foreign codebase onto the project and hope committers grok it well enough to hold the line on high quality code reviews. Looking to the future where Kubernetes integration is potentially included in the mainline apache release (like Mesos and YARN), it's best to work as contributor + committer together from the start for shared understanding. Making an API for third party cluster managers sound great and the easy, clean choice from a software engineering point of view, but I wonder how much value the practical benefits of having a pluggable cluster manager actually gets the Apache project. It seems like both Two Sigma and IBM have been able to maintain their proprietary schedulers without the benefits of the API we're considering building. Who / what workflows are we aiming to support with an API? > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752902#comment-15752902 ] Shivaram Venkataraman commented on SPARK-18817: --- [~bdwyer] Does this still happen if you disable Hive ? One way to test that is to stop the sparkSession and create one with `enableHiveSupport=F` > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752896#comment-15752896 ] William Shen commented on SPARK-5632: - Thank you [~marmbrus] for the speedy response! However I ran into the following issue in 1.5.0, which seems to be the same issue with resolving dot in field name. {noformat} scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> val data = Seq((1,2)).toDF("column_1", "column.with.dot") data: org.apache.spark.sql.DataFrame = [column_1: int, column.with.dot: int] scala> data.select("column.with.dot").collect org.apache.spark.sql.AnalysisException: cannot resolve 'column.with.dot' given input columns column_1, column.with.dot; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752898#comment-15752898 ] Shivaram Venkataraman commented on SPARK-18817: --- Yeah I dont know how to avoid creating those two -- Doesn't look like its configurable cc [~cloud_fan] [~rxin] > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18868) Flaky Test: StreamingQueryListenerSuite
[ https://issues.apache.org/jira/browse/SPARK-18868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18868. -- Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 2.2.0 2.1.1 > Flaky Test: StreamingQueryListenerSuite > --- > > Key: SPARK-18868 > URL: https://issues.apache.org/jira/browse/SPARK-18868 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.1, 2.2.0 > > > Example: > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3496/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752864#comment-15752864 ] Joseph K. Bradley commented on SPARK-18844: --- Note: Please don't set the Target Version or Fix Version. Committers use those to track releases. Thanks! > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18844: -- Target Version/s: (was: 2.0.3) > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18844: -- Fix Version/s: (was: 2.0.2) > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18844: -- Issue Type: New Feature (was: Improvement) > Add more binary classification metrics to BinaryClassificationMetrics > - > > Key: SPARK-18844 > URL: https://issues.apache.org/jira/browse/SPARK-18844 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.2 >Reporter: Zak Patterson >Priority: Minor > Labels: evaluation > Fix For: 2.0.2 > > Original Estimate: 5h > Remaining Estimate: 5h > > BinaryClassificationMetrics only implements Precision (positive predictive > value) and recall (true positive rate). It should implement more > comprehensive metrics. > Moreover, the instance variables storing computed counts are marked private, > and there are no accessors for them. So if one desired to add this > functionality, one would have to duplicate this calculation, which is not > trivial: > https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144 > Currently Implemented Metrics > --- > * Precision (PPV): `precisionByThreshold` > * Recall (Sensitivity, true positive rate): `recallByThreshold` > Desired additional metrics > --- > * False omission rate: `forByThreshold` > * False discovery rate: `fdrByThreshold` > * Negative predictive value: `npvByThreshold` > * False negative rate: `fnrByThreshold` > * True negative rate (Specificity): `specificityByThreshold` > * False positive rate: `fprByThreshold` > Alternatives > --- > The `createCurve` method is marked private. If it were marked public, and the > trait BinaryClassificationMetricComputer were also marked public, then it > would be easy to define new computers to get whatever the user wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18891) Support for specific collection types
Michael Armbrust created SPARK-18891: Summary: Support for specific collection types Key: SPARK-18891 URL: https://issues.apache.org/jira/browse/SPARK-18891 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.3, 2.1.0 Reporter: Michael Armbrust Priority: Critical Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which force users to only define classes with the most generic type. An [example error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: {code} case class SpecificCollection(aList: List[Int]) Seq(SpecificCollection(1 :: Nil)).toDS().collect() {code} {code} java.lang.RuntimeException: Error while decoding: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 98, Column 120: No applicable constructor/method found for actual parameters "scala.collection.Seq"; candidates are: "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default
[ https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752844#comment-15752844 ] Brendan Dwyer commented on SPARK-18817: --- I'm also seeing _derby.log_ and a folder named _metastore_db_ being created in my working directory when I run the following: {code} library("SparkR") sparkR.session() df <- as.DataFrame(iris) {code} > Ensure nothing is written outside R's tempdir() by default > -- > > Key: SPARK-18817 > URL: https://issues.apache.org/jira/browse/SPARK-18817 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Brendan Dwyer >Priority: Critical > > Per CRAN policies > https://cran.r-project.org/web/packages/policies.html > {quote} > - Packages should not write in the users’ home filespace, nor anywhere else > on the file system apart from the R session’s temporary directory (or during > installation in the location pointed to by TMPDIR: and such usage should be > cleaned up). Installing into the system’s R installation (e.g., scripts to > its bin directory) is not allowed. > Limited exceptions may be allowed in interactive sessions if the package > obtains confirmation from the user. > - Packages should not modify the global environment (user’s workspace). > {quote} > Currently "spark-warehouse" gets created in the working directory when > sparkR.session() is called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-18890: --- Description: As part of benchmarking this change: https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I found that moving task serialization from TaskSetManager (which happens as part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to approximately a 10% reduction in job runtime for a job that counted 10,000 partitions (that each had 1 int) using 20 machines. Similar performance improvements were reported in the pull request linked above. This would appear to be because the TaskSchedulerImpl thread is the bottleneck, so moving serialization to CGSB reduces runtime. This change may *not* improve runtime (and could potentially worsen runtime) in scenarios where the CGSB thread is the bottleneck (e.g., if tasks are very large, so calling launch to send the tasks to the executor blocks on the network). One benefit of implementing this change is that it makes it easier to parallelize the serialization of tasks (different tasks could be serialized by different threads). Another benefit is that all of the serialization occurs in the same place (currently, the Task is serialized in TaskSetManager, and the TaskDescription is serialized in CGSB). I'm not totally convinced we should fix this because it seems like there are better ways of reducing the serialization time (e.g., by re-using a single serialized object with the Task/jars/files and broadcasting it for each stage) but I wanted to open this JIRA to document the discussion. cc [~witgo] was: As part of benchmarking this change: https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I found that moving task serialization from TaskSetManager (which happens as part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to approximately a 10% reduction in job runtime for a job that counted 10,000 partitions (that each had 1 int) using 20 machines. Similar performance improvements were reported in the pull request linked above. This would appear to be because the TaskSchedulerImpl thread is the bottleneck, so moving serialization to CGSB reduces runtime. This change may *not* improve runtime (and could potentially worsen runtime) in scenarios where the CGSB thread is the bottleneck (e.g., if tasks are very large, so calling launch to send the tasks to the executor blocks on the network). One benefit of implementing this change is that it makes it easier to parallelize the serialization of tasks (different tasks could be serialized by different threads). Another benefit is that all of the serialization occurs in the same place (currently, the Task is serialized in TaskSetManager, and the TaskDescription is serialized in CGSB). I'm not totally convinced we should fix this because it seems like there are better ways of reducing the serialization time (e.g., by re-using the Task object within a stage) but I wanted to open this JIRA to document the discussion. cc [~witgo] > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should
[jira] [Created] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
Kay Ousterhout created SPARK-18890: -- Summary: Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl) Key: SPARK-18890 URL: https://issues.apache.org/jira/browse/SPARK-18890 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Kay Ousterhout Priority: Minor As part of benchmarking this change: https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I found that moving task serialization from TaskSetManager (which happens as part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to approximately a 10% reduction in job runtime for a job that counted 10,000 partitions (that each had 1 int) using 20 machines. Similar performance improvements were reported in the pull request linked above. This would appear to be because the TaskSchedulerImpl thread is the bottleneck, so moving serialization to CGSB reduces runtime. This change may *not* improve runtime (and could potentially worsen runtime) in scenarios where the CGSB thread is the bottleneck (e.g., if tasks are very large, so calling launch to send the tasks to the executor blocks on the network). One benefit of implementing this change is that it makes it easier to parallelize the serialization of tasks (different tasks could be serialized by different threads). Another benefit is that all of the serialization occurs in the same place (currently, the Task is serialized in TaskSetManager, and the TaskDescription is serialized in CGSB). I'm not totally convinced we should fix this because it seems like there are better ways of reducing the serialization time (e.g., by re-using the Task object within a stage) but I wanted to open this JIRA to document the discussion. cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752771#comment-15752771 ] Michael Armbrust commented on SPARK-5632: - If you expand the commit you'll see its included in many tags. The "fix version" here is 1.4, which means it was released with 1.4 and all subsequent versions. > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_transformed: array (nullable = true) > \|\|\|\|\|\|\|-- element: array (containsNull = > false) > \|\|\|\|\|\|\|\|-- element: string (containsNull > = false) > \|\|
[jira] [Commented] (SPARK-8425) Add blacklist mechanism for task scheduling
[ https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752760#comment-15752760 ] Apache Spark commented on SPARK-8425: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/16298 > Add blacklist mechanism for task scheduling > --- > > Key: SPARK-8425 > URL: https://issues.apache.org/jira/browse/SPARK-8425 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Assignee: Mao, Wei >Priority: Minor > Fix For: 2.2.0 > > Attachments: DesignDocforBlacklistMechanism.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752747#comment-15752747 ] William Shen commented on SPARK-5632: - Is this still targeted for 1.4.0 as indicated in JIRA (or was it released with 1.4.0)? The git commit is tagged with v2.1.0-rc3, can someone confirm if it has been moved to 2.1.0? Thanks! > not able to resolve dot('.') in field name > -- > > Key: SPARK-5632 > URL: https://issues.apache.org/jira/browse/SPARK-5632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.2.0, 1.3.0 > Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 > Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 >Reporter: Lishu Liu >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.4.0 > > > My cassandra table task_trace has a field sm.result which contains dot in the > name. So SQL tried to look up sm instead of full name 'sm.result'. > Here is my code: > {code} > scala> import org.apache.spark.sql.cassandra.CassandraSQLContext > scala> val cc = new CassandraSQLContext(sc) > scala> val task_trace = cc.jsonFile("/task_trace.json") > scala> task_trace.registerTempTable("task_trace") > scala> cc.setKeyspace("cerberus_data_v4") > scala> val res = cc.sql("SELECT received_datetime, task_body.cerberus_id, > task_body.sm.result FROM task_trace WHERE task_id = > 'fff7304e-9984-4b45-b10c-0423a96745ce'") > res: org.apache.spark.sql.SchemaRDD = > SchemaRDD[57] at RDD at SchemaRDD.scala:108 > == Query Plan == > == Physical Plan == > java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, > cerberus_id, couponId, coupon_code, created, description, domain, expires, > message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, > sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, > validity > {code} > The full schema look like this: > {code} > scala> task_trace.printSchema() > root > \|-- received_datetime: long (nullable = true) > \|-- task_body: struct (nullable = true) > \|\|-- cerberus_batch_id: string (nullable = true) > \|\|-- cerberus_id: string (nullable = true) > \|\|-- couponId: integer (nullable = true) > \|\|-- coupon_code: string (nullable = true) > \|\|-- created: string (nullable = true) > \|\|-- description: string (nullable = true) > \|\|-- domain: string (nullable = true) > \|\|-- expires: string (nullable = true) > \|\|-- message_id: string (nullable = true) > \|\|-- neverShowAfter: string (nullable = true) > \|\|-- neverShowBefore: string (nullable = true) > \|\|-- offerTitle: string (nullable = true) > \|\|-- screenshots: array (nullable = true) > \|\|\|-- element: string (containsNull = false) > \|\|-- sm.result: struct (nullable = true) > \|\|\|-- cerberus_batch_id: string (nullable = true) > \|\|\|-- cerberus_id: string (nullable = true) > \|\|\|-- code: string (nullable = true) > \|\|\|-- couponId: integer (nullable = true) > \|\|\|-- created: string (nullable = true) > \|\|\|-- description: string (nullable = true) > \|\|\|-- domain: string (nullable = true) > \|\|\|-- expires: string (nullable = true) > \|\|\|-- message_id: string (nullable = true) > \|\|\|-- neverShowAfter: string (nullable = true) > \|\|\|-- neverShowBefore: string (nullable = true) > \|\|\|-- offerTitle: string (nullable = true) > \|\|\|-- result: struct (nullable = true) > \|\|\|\|-- post: struct (nullable = true) > \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) > \|\|\|\|\|\|-- ci: double (nullable = true) > \|\|\|\|\|\|-- value: boolean (nullable = true) > \|\|\|\|\|-- meta: struct (nullable = true) > \|\|\|\|\|\|-- None_tx_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- exceptions: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- no_input_value: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_mapped: array (nullable = true) > \|\|\|\|\|\|\|-- element: string (containsNull = > false) > \|\|\|\|\|\|-- not_transformed: array (nullable = true) > \|\|\|\|\|\|\|-- element: array (containsNull = > false) > \|\|\|\|\|\|\|\|-- element: string (containsNull >
[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-17931: --- Component/s: (was: Spark Core) Scheduler > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-17931: --- Description: In the existing code, there are three layers of serialization involved in sending a task from the scheduler to an executor: - A Task object is serialized - The Task object is copied to a byte buffer that also contains serialized information about any additional JARs, files, and Properties needed for the task to execute. This byte buffer is stored as the member variable serializedTask in the TaskDescription class. - The TaskDescription is serialized (in addition to the serialized task + JARs, the TaskDescription class contains the task ID and other metadata) and sent in a LaunchTask message. While it is necessary to have two layers of serialization, so that the JAR, file, and Property info can be deserialized prior to deserializing the Task object, the third layer of deserialization is unnecessary (this is as a result of SPARK-2521). We should eliminate a layer of serialization by moving the JARs, files, and Properties into the TaskDescription class. was: When taskScheduler instantiates TaskDescription, it calls `Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser)`. It serializes task and its dependency. But after SPARK-2521 has been merged into the master, the ResultTask class and ShuffleMapTask class no longer contain rdd and closure objects. TaskDescription class can be changed as below: {noformat} class TaskDescription[T]( val taskId: Long, val attemptNumber: Int, val executorId: String, val name: String, val index: Int, val task: Task[T]) extends Serializable {noformat} > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12777) Dataset fields can't be Scala tuples
[ https://issues.apache.org/jira/browse/SPARK-12777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12777. -- Resolution: Fixed Fix Version/s: 2.1.0 This works in 2.1: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/408017793305293/2840265927289860/latest.html > Dataset fields can't be Scala tuples > > > Key: SPARK-12777 > URL: https://issues.apache.org/jira/browse/SPARK-12777 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Chris Jansen > Fix For: 2.1.0 > > > Datasets can't seem to handle scala tuples as fields of case classes in > datasets. > {code} > Seq((1,2), (3,4)).toDS().show() //works > {code} > When including a tuple as a field, the code fails: > {code} > case class Test(v: (Int, Int)) > Seq(Test((1,2)), Test((3,4)).toDS().show //fails > {code} > {code} > UnresolvedException: : Invalid call to dataType on unresolved object, tree: > 'name (unresolved.scala:59) > > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:59) > > org.apache.spark.sql.catalyst.expressions.GetStructField.org$apache$spark$sql$catalyst$expressions$GetStructField$$field$lzycompute(complexTypeExtractors.scala:107) > > org.apache.spark.sql.catalyst.expressions.GetStructField.org$apache$spark$sql$catalyst$expressions$GetStructField$$field(complexTypeExtractors.scala:107) > > org.apache.spark.sql.catalyst.expressions.GetStructField$$anonfun$toString$1.apply(complexTypeExtractors.scala:111) > > org.apache.spark.sql.catalyst.expressions.GetStructField$$anonfun$toString$1.apply(complexTypeExtractors.scala:111) > > org.apache.spark.sql.catalyst.expressions.GetStructField.toString(complexTypeExtractors.scala:111) > > org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217) > > org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217) > > org.apache.spark.sql.catalyst.expressions.If.toString(conditionalExpressions.scala:76) > > org.apache.spark.sql.catalyst.expressions.Expression.toString(Expression.scala:217) > > org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:155) > > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argString$1.apply(TreeNode.scala:385) > > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argString$1.apply(TreeNode.scala:381) > org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:388) > org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:391) > > org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:172) > > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:441) > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:396) > > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$5.apply(RuleExecutor.scala:118) > > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$5.apply(RuleExecutor.scala:119) > org.apache.spark.Logging$class.logDebug(Logging.scala:62) > > org.apache.spark.sql.catalyst.rules.RuleExecutor.logDebug(RuleExecutor.scala:44) > > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:115) > > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72) > > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72) > > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:253) > org.apache.spark.sql.Dataset.(Dataset.scala:78) > org.apache.spark.sql.Dataset.(Dataset.scala:89) > org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:507) > > org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:80) > {code} > When providing a type alias, the code fails in a different way: > {code} > type TwoInt = (Int, Int) > case class Test(v: TwoInt) > Seq(Test((1,2)), Test((3,4)).toDS().show //fails > {code} > {code} > NoSuchElementException: : head of empty list (ScalaReflection.scala:504) > > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:504) > > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:502) > > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:502) > >
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752722#comment-15752722 ] Imran Rashid commented on SPARK-18886: -- [~mridul] sorry if I am being slow here, but do you mind spelling out for me in more detail? I'm *not* asking about the benefits of using locality preferences -- I get that part. I'm asking about why the *delay*. There has to be something happening during the delay which we want to wait for. One possibility is that you've got multiple tasksets running concurrently, with different locality preferences. You wouldn't want the first taskset to use all the resources, you'd rather take both tasksets into account. This is accomplished with delay scheduling, but you don't actually *need* the delay. Another possibility is that there is such a huge gap in runtime that you expect your preferred locations will finish *all* tasks in the taskset before that delay is up, by having some executors run multiple tasks. The reason I'm trying to figure this out is to figure out if there is a sensible fix here (and what the smallest possible fix would be). If this is it, then the fix I suggested above to Mark should handle this case, while still working as intended in other cases. > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of partitions. (Shuffle > locality preference is assigned if any node has more than 20% of the shuffle > input data -- by chance, you may have one node just above that threshold, and > all other nodes just below it.) In this case, you can turn off locality > preference for shuffle data by setting > {{spark.shuffle.reduceLocality.enabled=false}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18888) partitionBy in DataStreamWriter in Python throws _to_seq not defined
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-1. --- Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 2.1.0 > partitionBy in DataStreamWriter in Python throws _to_seq not defined > > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.0.2 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > Fix For: 2.1.0 > > > {code} > python/pyspark/sql/streaming.py in partitionBy(self, *cols) > 716 if len(cols) == 1 and isinstance(cols[0], (list, tuple)): > 717 cols = cols[0] > --> 718 self._jwrite = > self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) > 719 return self > 720 > NameError: global name '_to_seq' is not defined > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable
[ https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18850: - Target Version/s: 2.1.0 > Make StreamExecution and progress classes serializable > -- > > Key: SPARK-18850 > URL: https://issues.apache.org/jira/browse/SPARK-18850 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Make StreamExecution and progress classes serializable because it is too easy > for it to get captured with normal usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable
[ https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18850: - Affects Version/s: 2.1.0 > Make StreamExecution and progress classes serializable > -- > > Key: SPARK-18850 > URL: https://issues.apache.org/jira/browse/SPARK-18850 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Make StreamExecution and progress classes serializable because it is too easy > for it to get captured with normal usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable
[ https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18850: - Summary: Make StreamExecution and progress classes serializable (was: Make StreamExecution serializable) > Make StreamExecution and progress classes serializable > -- > > Key: SPARK-18850 > URL: https://issues.apache.org/jira/browse/SPARK-18850 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Make StreamExecution serializable because it is too easy for it to get > captured with normal usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18850) Make StreamExecution and progress classes serializable
[ https://issues.apache.org/jira/browse/SPARK-18850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18850: - Description: Make StreamExecution and progress classes serializable because it is too easy for it to get captured with normal usage. (was: Make StreamExecution serializable because it is too easy for it to get captured with normal usage.) > Make StreamExecution and progress classes serializable > -- > > Key: SPARK-18850 > URL: https://issues.apache.org/jira/browse/SPARK-18850 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Make StreamExecution and progress classes serializable because it is too easy > for it to get captured with normal usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-16178. - Resolution: Won't Fix > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752636#comment-15752636 ] Dongjoon Hyun commented on SPARK-16178: --- Thank you! Then, I'll close this as Won't Fix. > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752632#comment-15752632 ] Ryan Blue commented on SPARK-16178: --- Sure. I think the result was Won't Fix. > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752579#comment-15752579 ] Mridul Muralidharan edited comment on SPARK-18886 at 12/15/16 9:35 PM: --- [~imranr] For almost all cases, delay scheduling dramatically increases performance. The difference even between PROCESS and NODE is significantly high (between NODE and 'lower' levels, it can depend on your network config). For both tasks with short duration and tasks processing large amounts of data, it has non trivial impact : long tasks processing small data, it is not so useful in comparison iirc, same for degenerate cases where locality preference is suboptimal to begin with. [As an aside, the ability to not specify PROCESS level locality preference actually is a drawback in our api] The job(s) I mentioned where we set it to 0 were special cases, where we knew the costs well enough to make the decision to lower it : but I would not recommend it unless users are very sure of what they are doing. While analysing the cost, it should also be kept in mind that transferring data across nodes impacts not just spark job, but every other job in the cluster. was (Author: mridulm80): [~imranr] For almost all cases, delay scheduling dramatically increases performance. The difference even between PROCESS and NODE is significantly high (between NODE and 'lower' levels, it can depend on your network config). For both tasks with short duration and tasks processing large amounts of data, it has non trivial impact : long tasks processing small data, it is not so useful in comparison iirc, same for degenerate cases where locality preference is suboptimal to begin with. [As an aside, the ability to not specify PROCESS level locality actually is a drawback in our api] The job(s) I mentioned where we set it to 0 were special cases, where we knew the costs well enough to make the decision to lower it : but I would not recommend it unless users are very sure of what they are doing. While analysing the cost, it should also be kept in mind that transferring data across nodes impacts not just spark job, but every other job in the cluster. > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752579#comment-15752579 ] Mridul Muralidharan commented on SPARK-18886: - [~imranr] For almost all cases, delay scheduling dramatically increases performance. The difference even between PROCESS and NODE is significantly high (between NODE and 'lower' levels, it can depend on your network config). For both tasks with short duration and tasks processing large amounts of data, it has non trivial impact : long tasks processing small data, it is not so useful in comparison iirc, same for degenerate cases where locality preference is suboptimal to begin with. [As an aside, the ability to not specify PROCESS level locality actually is a drawback in our api] The job(s) I mentioned where we set it to 0 were special cases, where we knew the costs well enough to make the decision to lower it : but I would not recommend it unless users are very sure of what they are doing. While analysing the cost, it should also be kept in mind that transferring data across nodes impacts not just spark job, but every other job in the cluster. > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of partitions. (Shuffle > locality preference is assigned if any node has more than 20% of the shuffle > input data -- by chance, you may have one node just above that threshold, and > all other nodes just below it.) In this case, you can turn off locality > preference for shuffle data by setting > {{spark.shuffle.reduceLocality.enabled=false}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752548#comment-15752548 ] Dongjoon Hyun commented on SPARK-16178: --- Hi, [~rdblue]. The PR seems to be closed. I'm wondering we can close this issue. This issue is currently a subtask of SPARK-16032 for 2.1.0. > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18889) Spark incorrectly reads default columns from a Hive view
Salil Surendran created SPARK-18889: --- Summary: Spark incorrectly reads default columns from a Hive view Key: SPARK-18889 URL: https://issues.apache.org/jira/browse/SPARK-18889 Project: Spark Issue Type: Bug Reporter: Salil Surendran Spark fails to read a view that have columns that are given default names; To reproduce follow the following steps in Hive: * CREATE TABLE IF NOT EXISTS employee_details ( eid int, name String, salary String, destination String, json String) COMMENT 'Employee details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE; * insert into employee_details values(100, "Salil", "100k", "Mumbai", s"""{"Foo":"ABC","Bar":"2009010110","Quux":{"QuuxId":1234,"QuuxName":"Sam"}}""" ) * create view employee_25 as select eid, name, `_c4` from (select eid, name, destination,v1.foo, cast(v1.bar as timestamp) from employee_details LATERAL VIEW json_tuple(json,'Foo','Bar')v1 as foo, bar)v2; * select * from employee_25; You will see an output like this: +--+---+--+--+ | employee_25.eid | employee_25.name | employee_25._c4 | +--+---+--+--+ | 100 | Salil | NULL | +--+---+--+--+ Now go to spark-shell and try to query the view: scala> spark.sql("select * from employee_25").show org.apache.spark.sql.AnalysisException: cannot resolve '`v2._c4`' given input columns: [foo, name, eid, bar, destination]; line 1 pos 32; 'Project [*] +- 'SubqueryAlias employee_25 +- 'Project [eid#56, name#57, 'v2._c4] +- SubqueryAlias v2 +- Project [eid#56, name#57, destination#59, foo#61, cast(bar#62 as timestamp) AS bar#63] +- Generate json_tuple(json#60, Foo, Bar), true, false, v1, [foo#61, bar#62] +- MetastoreRelation default, employee_details at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:308) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125) at scala.collection.immutable.List.foreach(List.scala:381) at
[jira] [Resolved] (SPARK-18826) Make FileStream be able to start with most recent files
[ https://issues.apache.org/jira/browse/SPARK-18826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-18826. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 16251 [https://github.com/apache/spark/pull/16251] > Make FileStream be able to start with most recent files > --- > > Key: SPARK-18826 > URL: https://issues.apache.org/jira/browse/SPARK-18826 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.0 > > > When starting a stream with a lot of backfill and maxFilesPerTrigger, the > user could often want to start with most recent files first. This would let > you keep low latency for recent data and slowly backfill historical data. > It's better to add an option to control this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files
[ https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17119. Resolution: Duplicate This was actually already implemented (without the need for a config option). > Add configuration property to allow the history server to delete .inprogress > files > -- > > Key: SPARK-17119 > URL: https://issues.apache.org/jira/browse/SPARK-17119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bjorn Jonsson >Priority: Minor > Labels: historyserver > > The History Server (HS) currently only considers completed applications when > deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). > This means that over time, .inprogress files (from failed jobs, jobs where > the SparkContext is not closed, spark-shell exits etc...) can accumulate and > impact the HS. > Instead of having to manually delete these files, maybe users could have the > option of telling the HS to delete all files where (now - > attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete > .inprogress files with lastUpdated older then 7d? > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode
[ https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752443#comment-15752443 ] Anbu Cheeralan edited comment on SPARK-17493 at 12/15/16 8:55 PM: -- [~sowen] I faced a similar error while writing to google storage. This issue is specific to object stores and in append mode. In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is an Object Store (S3, GS). {quote} if (mode == SaveMode.Append) \{ val existingPartitionColumns = Try \{ resolveRelation() .asInstanceOf[HadoopFsRelation] .location .partitionSpec() .partitionColumns .fieldNames .toSeq \}.getOrElse(Seq.empty[String]) {quote} There should be a flag to skip Partition Match Check in append mode. I can work on the patch. was (Author: alunarbeach): [~sowen] I faced a similar error while writing to google storage. This issue is specific while writing to object stores. This happens in append mode. In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is on Object Stores (S3, GS). {quote} if (mode == SaveMode.Append) \{ val existingPartitionColumns = Try \{ resolveRelation() .asInstanceOf[HadoopFsRelation] .location .partitionSpec() .partitionColumns .fieldNames .toSeq \}.getOrElse(Seq.empty[String]) {quote} There should be a flag to skip Partition Match Check in append mode. I can work on the patch. > Spark Job hangs while DataFrame writing to HDFS path with parquet mode > -- > > Key: SPARK-17493 > URL: https://issues.apache.org/jira/browse/SPARK-17493 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: AWS Cluster >Reporter: Gautam Solanki > > While saving a RDD to HDFS path in parquet format with the following > rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//") > , the spark job was hanging as the two write tasks with Shuffle Read of size > 0 could not complete. But, the executors notified the driver about the > completion of these two tasks. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode
[ https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752443#comment-15752443 ] Anbu Cheeralan edited comment on SPARK-17493 at 12/15/16 8:54 PM: -- [~sowen] I faced a similar error while writing to google storage. This issue is specific while writing to object stores. This happens in append mode. In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is on Object Stores (S3, GS). {quote} if (mode == SaveMode.Append) \{ val existingPartitionColumns = Try \{ resolveRelation() .asInstanceOf[HadoopFsRelation] .location .partitionSpec() .partitionColumns .fieldNames .toSeq \}.getOrElse(Seq.empty[String]) {quote} There should be a flag to skip Partition Match Check in append mode. I can work on the patch. was (Author: alunarbeach): [~sowen] I faced a similar error while writing to google storage. This issue is specific while writing to object stores. This happens in append mode. In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is on Object Stores (S3, GS). {quote} if (mode == SaveMode.Append) { val existingPartitionColumns = Try { resolveRelation() .asInstanceOf[HadoopFsRelation] .location .partitionSpec() .partitionColumns .fieldNames .toSeq }.getOrElse(Seq.empty[String]) {quote} There should be a flag to skip Partition Match Check in append mode. I can work on the patch. > Spark Job hangs while DataFrame writing to HDFS path with parquet mode > -- > > Key: SPARK-17493 > URL: https://issues.apache.org/jira/browse/SPARK-17493 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: AWS Cluster >Reporter: Gautam Solanki > > While saving a RDD to HDFS path in parquet format with the following > rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:tmp//rddout_parquet_full_hdfs1//") > , the spark job was hanging as the two write tasks with Shuffle Read of size > 0 could not complete. But, the executors notified the driver about the > completion of these two tasks. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8425) Add blacklist mechanism for task scheduling
[ https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-8425. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 14079 https://github.com/apache/spark/pull/14079 > Add blacklist mechanism for task scheduling > --- > > Key: SPARK-8425 > URL: https://issues.apache.org/jira/browse/SPARK-8425 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Assignee: Mao, Wei >Priority: Minor > Fix For: 2.2.0 > > Attachments: DesignDocforBlacklistMechanism.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8425) Add blacklist mechanism for task scheduling
[ https://issues.apache.org/jira/browse/SPARK-8425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-8425: Assignee: Mao, Wei (was: Imran Rashid) > Add blacklist mechanism for task scheduling > --- > > Key: SPARK-8425 > URL: https://issues.apache.org/jira/browse/SPARK-8425 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Reporter: Saisai Shao >Assignee: Mao, Wei >Priority: Minor > Attachments: DesignDocforBlacklistMechanism.pdf > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752462#comment-15752462 ] Imran Rashid commented on SPARK-18886: -- [~mridulm80] good point, perhaps the right answer here is just to turn off delay scheduling completely -- not setting {{"spark.locality.wait.process"}} to a small value, as I had suggested in the initial workaround, but just turning it off completely, to avoid having to futz with tuning that value relative to task runtime. But lemme ask you more or less the same question I just asked mark, phrased a little differently -- given the fragility of this, wouldn't it make more sense for us to turn delay scheduling *off* by default? > Delay scheduling should not delay some executors indefinitely if one task is > scheduled before delay timeout > --- > > Key: SPARK-18886 > URL: https://issues.apache.org/jira/browse/SPARK-18886 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid > > Delay scheduling can introduce an unbounded delay and underutilization of > cluster resources under the following circumstances: > 1. Tasks have locality preferences for a subset of available resources > 2. Tasks finish in less time than the delay scheduling. > Instead of having *one* delay to wait for resources with better locality, > spark waits indefinitely. > As an example, consider a cluster with 100 executors, and a taskset with 500 > tasks. Say all tasks have a preference for one executor, which is by itself > on one host. Given the default locality wait of 3s per level, we end up with > a 6s delay till we schedule on other hosts (process wait + host wait). > If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks > get scheduled on _only one_ executor. This means you're only using a 1% of > your cluster, and you get a ~100x slowdown. You'd actually be better off if > tasks took 7 seconds. > *WORKAROUNDS*: > (1) You can change the locality wait times so that it is shorter than the > task execution time. You need to take into account the sum of all wait times > to use all the resources on your cluster. For example, if you have resources > on different racks, this will include the sum of > "spark.locality.wait.process" + "spark.locality.wait.node" + > "spark.locality.wait.rack". Those each default to "3s". The simplest way to > be to set "spark.locality.wait.process" to your desired wait interval, and > set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". > For example, if your tasks take ~3 seconds on average, you might set > "spark.locality.wait.process" to "1s". > Note that this workaround isn't perfect --with less delay scheduling, you may > not get as good resource locality. After this issue is fixed, you'd most > likely want to undo these configuration changes. > (2) The worst case here will only happen if your tasks have extreme skew in > their locality preferences. Users may be able to modify their job to > controlling the distribution of the original input data. > (2a) A shuffle may end up with very skewed locality preferences, especially > if you do a repartition starting from a small number of partitions. (Shuffle > locality preference is assigned if any node has more than 20% of the shuffle > input data -- by chance, you may have one node just above that threshold, and > all other nodes just below it.) In this case, you can turn off locality > preference for shuffle data by setting > {{spark.shuffle.reduceLocality.enabled=false}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18823: -- Fix Version/s: (was: 2.0.2) > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?
[ https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752456#comment-15752456 ] Joseph K. Bradley commented on SPARK-18823: --- Note: Please don't set the Target Version or Fix Version. Committers can use those fields for tracking releases. Thanks! > Assignation by column name variable not available or bug? > - > > Key: SPARK-18823 > URL: https://issues.apache.org/jira/browse/SPARK-18823 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 2.0.2 > Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr > 4. Or databricks (community.cloud.databricks.com) . >Reporter: Vicente Masip > Original Estimate: 24h > Remaining Estimate: 24h > > I really don't know if this is a bug or can be done with some function: > Sometimes is very important to assign something to a column which name has to > be access trough a variable. Normally, I have always used it with doble > brackets likes this out of SparkR problems: > # df could be faithful normal data frame or data table. > # accesing by variable name: > myname = "waiting" > df[[myname]] <- c(1:nrow(df)) > # or even column number > df[[2]] <- df$eruptions > The error is not caused by the right side of the "<-" operator of assignment. > The problem is that I can't assign to a column name using a variable or > column number as I do in this examples out of spark. Doesn't matter if I am > modifying or creating column. Same problem. > I have also tried to use this with no results: > val df2 = withColumn(df,"tmp", df$eruptions) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org