[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Meng updated SPARK-14323: Description: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. If we do not escape "*", for example, pattern "*f*", it will cause exception (PatternSyntaxException, Dangling meta character) and thus return empty result. try this: val p = "\*f\*".r was: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. If we do not escape "*", for example, pattern "*f*", it will cause exception (PatternSyntaxException, Dangling meta character) and thus return empty result. try this: val p = "*f*".r > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "*" did not get escaped before passing to the regex. If we do > not escape "*", for example, pattern "*f*", it will cause exception > (PatternSyntaxException, Dangling meta character) and thus return empty > result. > try this: > val p = "\*f\*".r -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Meng updated SPARK-14323: Description: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. If we do not escape "*", for example, pattern "*f*", it will cause exception (PatternSyntaxException, Dangling meta character) and thus return empty result. try this: val p = "*f*".r was: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "*" did not get escaped before passing to the regex. If we do > not escape "*", for example, pattern "*f*", it will cause exception > (PatternSyntaxException, Dangling meta character) and thus return empty > result. > try this: > val p = "*f*".r -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Meng updated SPARK-14323: Description: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "\*" did not get escaped before passing to the regex. If we do not escape "\*", for example, pattern "\*f\*", it will cause exception (PatternSyntaxException, Dangling meta character) and thus return empty result. try this: val p = "\*f\*".r was: Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. If we do not escape "*", for example, pattern "*f*", it will cause exception (PatternSyntaxException, Dangling meta character) and thus return empty result. try this: val p = "\*f\*".r > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "\*" did not get escaped before passing to the regex. If we > do not escape "\*", for example, pattern "\*f\*", it will cause exception > (PatternSyntaxException, Dangling meta character) and thus return empty > result. > try this: > val p = "\*f\*".r -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
[ https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221192#comment-15221192 ] Sun Rui commented on SPARK-14037: - Thanks a lot. I will try to figure out another investigation PR. BTW, what cluster mode are you using? > count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame > -- > > Key: SPARK-14037 > URL: https://issues.apache.org/jira/browse/SPARK-14037 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.1 > Environment: Ubuntu 12.04 > RAM : 6 GB > Spark 1.6.1 Standalone >Reporter: Samuel Alexander > Labels: performance, sparkR > Attachments: console.log, spark_ui.png, spark_ui_ray.png > > > Any operations on dataframe created using SparkR::createDataFrame is very > slow. > I have a CSV of size ~ 6MB. Below is the sample content > 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter > 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter > 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter > 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter > 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter > 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter > 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter > 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter > 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter > 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter > I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, > sep=","). And then converted into Spark dataframe using sp_df <- > createDataFrame(sqlContext, r_df) > Now count(sp_df) took more than 30 seconds > When I load the same CSV using spark-csv like, direct_df <- > read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = > "com.databricks.spark.csv", inferSchema = "false", header="true") > count(direct_df) took below 1 sec. > I know performance has been improved in createDataFrame in Spark 1.6. But > other operations like count(), is very slow. > How can I get rid of this performance issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Attachment: threaddump-1459461915668.tdump here is the thread dump taken during the high CPU usage on the executor. > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > Attachments: threaddump-1459461915668.tdump > > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out > {noformat} > The Spark console output: > {noformat} > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage > 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage > 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage > 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) > 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage > 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage > 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) > 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 > on executor id: 4 hostname: bigaperf138.svl.ibm.com. > 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage > 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage > 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) > 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 > on executor id: 2 hostname: bigaperf137.svl.ibm.com. > 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage > 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) > {noformat} > Notice that time durations between tasks are unusually long: 2~5 minutes. > When looking at the Linux 'perf' tool, two top CPU consumers are: > 86.48%java [unknown] > 12.41%libjvm.so > Using the Java hotspot profiling tools, I am able to show what hotspot > methods are (top 5): > {noformat} > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() > 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms > 9,654,179 ms > org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms > (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms > org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 > 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() >4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms > 2,153,910 ms > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() > 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms > 19,967,510 ms > {noformat} > So as you can see, the test has been running for 1.5 hours...with 46% CPU > spent in the > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. > The stacks for top two are: > {noformat} > Marshalling > I > java/io/DataOutputStream.writeInt() line 197 > org.apache.spark.sql > I > org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() > line 60 > org.apache.spark.storage > I > org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 > org.apache.spark.shuffle > I >
[jira] [Updated] (SPARK-14303) Refactor SparkRWrappers
[ https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14303: -- Assignee: Yanbo Liang > Refactor SparkRWrappers > --- > > Key: SPARK-14303 > URL: https://issues.apache.org/jira/browse/SPARK-14303 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > We use a single object `SparkRWrappers` > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala) > to wrap method calls to glm and kmeans in SparkR. This is quite hard to > maintain. We should refactor them into separate wrappers, like > `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`. > The package name should be `spakr.ml.r` instead of `spark.ml.api.r`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14303) Refactor SparkRWrappers
[ https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1522#comment-1522 ] Yanbo Liang edited comment on SPARK-14303 at 4/1/16 4:00 AM: - [~mengxr] I have make the refactor for k-means, I will link the PR here. For glm, I think it can be work with SPARK-12566. was (Author: yanboliang): [~mengxr] I have make the refactor for k-means, I will link the PR here. > Refactor SparkRWrappers > --- > > Key: SPARK-14303 > URL: https://issues.apache.org/jira/browse/SPARK-14303 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Xiangrui Meng > > We use a single object `SparkRWrappers` > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala) > to wrap method calls to glm and kmeans in SparkR. This is quite hard to > maintain. We should refactor them into separate wrappers, like > `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`. > The package name should be `spakr.ml.r` instead of `spark.ml.api.r`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14303) Refactor SparkRWrappers
[ https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14303: Assignee: (was: Apache Spark) > Refactor SparkRWrappers > --- > > Key: SPARK-14303 > URL: https://issues.apache.org/jira/browse/SPARK-14303 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Xiangrui Meng > > We use a single object `SparkRWrappers` > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala) > to wrap method calls to glm and kmeans in SparkR. This is quite hard to > maintain. We should refactor them into separate wrappers, like > `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`. > The package name should be `spakr.ml.r` instead of `spark.ml.api.r`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14303) Refactor SparkRWrappers
[ https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221114#comment-15221114 ] Apache Spark commented on SPARK-14303: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12039 > Refactor SparkRWrappers > --- > > Key: SPARK-14303 > URL: https://issues.apache.org/jira/browse/SPARK-14303 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Xiangrui Meng > > We use a single object `SparkRWrappers` > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala) > to wrap method calls to glm and kmeans in SparkR. This is quite hard to > maintain. We should refactor them into separate wrappers, like > `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`. > The package name should be `spakr.ml.r` instead of `spark.ml.api.r`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14303) Refactor SparkRWrappers
[ https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1522#comment-1522 ] Yanbo Liang commented on SPARK-14303: - [~mengxr] I have make the refactor for k-means, I will link the PR here. > Refactor SparkRWrappers > --- > > Key: SPARK-14303 > URL: https://issues.apache.org/jira/browse/SPARK-14303 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Xiangrui Meng > > We use a single object `SparkRWrappers` > (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala) > to wrap method calls to glm and kmeans in SparkR. This is quite hard to > maintain. We should refactor them into separate wrappers, like > `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`. > The package name should be `spakr.ml.r` instead of `spark.ml.api.r`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221103#comment-15221103 ] Yanbo Liang commented on SPARK-14313: - Sure, please assign it to me. > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14322: Assignee: Apache Spark > Use treeReduce instead of reduce in OnlineLDAOptimizer > -- > > Key: SPARK-14322 > URL: https://issues.apache.org/jira/browse/SPARK-14322 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use > treeReduce. This can cause scalability issues. This should be an easy fix. > See this line: > [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452] > and a few lines below it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221088#comment-15221088 ] Apache Spark commented on SPARK-14322: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/12106 > Use treeReduce instead of reduce in OnlineLDAOptimizer > -- > > Key: SPARK-14322 > URL: https://issues.apache.org/jira/browse/SPARK-14322 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley > > OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use > treeReduce. This can cause scalability issues. This should be an easy fix. > See this line: > [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452] > and a few lines below it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer
[ https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14322: Assignee: (was: Apache Spark) > Use treeReduce instead of reduce in OnlineLDAOptimizer > -- > > Key: SPARK-14322 > URL: https://issues.apache.org/jira/browse/SPARK-14322 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley > > OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use > treeReduce. This can cause scalability issues. This should be an easy fix. > See this line: > [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452] > and a few lines below it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14242) avoid too many copies in network when a network frame is large
[ https://issues.apache.org/jira/browse/SPARK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14242. -- Resolution: Fixed Assignee: Zhang, Liye Fix Version/s: 2.0.0 > avoid too many copies in network when a network frame is large > -- > > Key: SPARK-14242 > URL: https://issues.apache.org/jira/browse/SPARK-14242 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Spark Core >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Zhang, Liye >Assignee: Zhang, Liye > Fix For: 2.0.0 > > > when a shuffle block size is huge, say a large array (array size more than > 128MB), there will be performance issue for getting remote blocks. This is > because network frame size is large, and when we are using a composite > buffer, which will consolidate when the components number reaches maximum > components number (default is 16) in netty underlying, performance issue will > occurs. There will be too many memory copies inside netty's *compositeBuffer*. > How to reproduce: > {code} > sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 > * 1024 * 50)).iterator).reduce((a,b)=> a).length > {code} > In this case, the serialized result size of each task is about 400MB, the > result will be transferred to driver as *indirectResult*. We can see after > the data transferred to driver, on driver side there will still need a lot of > time to process and the 3 CPUs (in this case, parallelism is 3) are fully > utilized with system call very high. And this processing time is calculated > as result getting time on webUI. > Such cases are very common in ML applications, which will return a large > array from each executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14321: Assignee: Apache Spark > Reduce date format cost and string-to-date cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Apache Spark >Priority: Minor > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221074#comment-15221074 ] Apache Spark commented on SPARK-14321: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/12105 > Reduce date format cost and string-to-date cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Minor > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14321: Assignee: (was: Apache Spark) > Reduce date format cost and string-to-date cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Minor > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions
[ https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated SPARK-14321: - Summary: Reduce date format cost and string-to-date cost in date functions (was: Reduce DateFormat cost in datetimeExpressions) > Reduce date format cost and string-to-date cost in date functions > - > > Key: SPARK-14321 > URL: https://issues.apache.org/jira/browse/SPARK-14321 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Minor > > Currently the code generated is > {noformat} > /* 066 */ UTF8String primitive5 = null; > /* 067 */ if (!isNull4) { > /* 068 */ try { > /* 069 */ primitive5 = UTF8String.fromString(new > java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( > /* 070 */ new java.util.Date(primitive7 * 1000L))); > /* 071 */ } catch (java.lang.Throwable e) { > /* 072 */ isNull4 = true; > /* 073 */ } > /* 074 */ } > {noformat} > Instantiation of SimpleDateFormat is fairly expensive. It can be created on > need basis. > I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220954#comment-15220954 ] Apache Spark commented on SPARK-14323: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12104 > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "*" did not get escaped before passing to the regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14323: Assignee: Apache Spark > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Assignee: Apache Spark >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "*" did not get escaped before passing to the regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
[ https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14323: Assignee: (was: Apache Spark) > [SQL] SHOW FUNCTIONS did not work properly > -- > > Key: SPARK-14323 > URL: https://issues.apache.org/jira/browse/SPARK-14323 > Project: Spark > Issue Type: Bug >Reporter: Bo Meng >Priority: Minor > > Show Functions syntax can be found here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions > When use "*" in the LIKE clause, it will not return the expected results. > This is because "*" did not get escaped before passing to the regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly
Bo Meng created SPARK-14323: --- Summary: [SQL] SHOW FUNCTIONS did not work properly Key: SPARK-14323 URL: https://issues.apache.org/jira/browse/SPARK-14323 Project: Spark Issue Type: Bug Reporter: Bo Meng Priority: Minor Show Functions syntax can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions When use "*" in the LIKE clause, it will not return the expected results. This is because "*" did not get escaped before passing to the regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms **(46.8%)**9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I
[jira] [Resolved] (SPARK-14267) Execute multiple Python UDFs in single batch
[ https://issues.apache.org/jira/browse/SPARK-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-14267. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12057 [https://github.com/apache/spark/pull/12057] > Execute multiple Python UDFs in single batch > > > Key: SPARK-14267 > URL: https://issues.apache.org/jira/browse/SPARK-14267 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > {code} > select udf1(a), udf2(b), udf3(a, b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I
[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220864#comment-15220864 ] JESSE CHEN commented on SPARK-14318: Q14 is as follows: {noformat} with cross_items as (select i_item_sk ss_item_sk from item JOIN (select brand_id, class_id, category_id from (select iss.i_brand_id brand_id ,iss.i_class_id class_id ,iss.i_category_id category_id from store_sales ,item iss ,date_dim d1 where ss_item_sk = iss.i_item_sk and ss_sold_date_sk = d1.d_date_sk and d1.d_year between 1999 AND 1999 + 2) x1 JOIN (select ics.i_brand_id ,ics.i_class_id ,ics.i_category_id from catalog_sales ,item ics ,date_dim d2 where cs_item_sk = ics.i_item_sk and cs_sold_date_sk = d2.d_date_sk and d2.d_year between 1999 AND 1999 + 2) x2 ON x1.brand_id = x2.i_brand_id and x1.class_id = x2.i_class_id and x1.category_id = x2.i_category_id JOIN (select iws.i_brand_id ,iws.i_class_id ,iws.i_category_id from web_sales ,item iws ,date_dim d3 where ws_item_sk = iws.i_item_sk and ws_sold_date_sk = d3.d_date_sk and d3.d_year between 1999 AND 1999 + 2) x3 ON x1.brand_id = x3.i_brand_id and x1.class_id = x3.i_class_id and x1.category_id = x3.i_category_id ) x4 where i_brand_id = x4.brand_id and i_class_id = x4.class_id and i_category_id = x4.category_id ), avg_sales as (select avg(quantity*list_price) average_sales from (select ss_quantity quantity ,ss_list_price list_price from store_sales ,date_dim where ss_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2 union all select cs_quantity quantity ,cs_list_price list_price from catalog_sales ,date_dim where cs_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2 union all select ws_quantity quantity ,ws_list_price list_price from web_sales ,date_dim where ws_sold_date_sk = d_date_sk and d_year between 1999 and 1999 + 2) x) select * from (select 'store' channel, i_brand_id,i_class_id,i_category_id ,sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales from store_sales ss1 JOIN item ON ss1.ss_item_sk = i_item_sk JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk JOIN avg_sales JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq where dd2.d_year = 1999 + 1 and dd2.d_moy = 12 and dd2.d_dom = 11 group by average_sales,i_brand_id,i_class_id,i_category_id having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) this_year, (select 'store' channel, i_brand_id,i_class_id ,i_category_id, sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales from store_sales ss1 JOIN item ON ss1.ss_item_sk = i_item_sk JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk JOIN avg_sales JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq where dd2.d_year = 1999 and dd2.d_moy = 12 and dd2.d_dom = 11 group by average_sales, i_brand_id,i_class_id,i_category_id having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) last_year where this_year.i_brand_id= last_year.i_brand_id and this_year.i_class_id = last_year.i_class_id and this_year.i_category_id = last_year.i_category_id order by this_year.channel, this_year.i_brand_id, this_year.i_class_id, this_year.i_category_id limit 100 {noformat} > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > TPCDS Q14 parses successfully, and plans created successfully. Spark tries to > run (I used only 1GB text file), but "hangs". Tasks are extremely slow to > process AND all CPUs are used 100% by the executor JVMs. > It is very easy to reproduce: > 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of > 1GB text file (assuming you know how to generate the csv data). My command is > like this: > {noformat} > /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g > --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 > --executor-memory 8g --num-executors 4 --executor-cores 4 --conf > spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f >
[jira] [Created] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer
Joseph K. Bradley created SPARK-14322: - Summary: Use treeReduce instead of reduce in OnlineLDAOptimizer Key: SPARK-14322 URL: https://issues.apache.org/jira/browse/SPARK-14322 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Joseph K. Bradley OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use treeReduce. This can cause scalability issues. This should be an easy fix. See this line: [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452] and a few lines below it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out {noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I
[jira] [Created] (SPARK-14321) Reduce DateFormat cost in datetimeExpressions
Rajesh Balamohan created SPARK-14321: Summary: Reduce DateFormat cost in datetimeExpressions Key: SPARK-14321 URL: https://issues.apache.org/jira/browse/SPARK-14321 Project: Spark Issue Type: Bug Reporter: Rajesh Balamohan Priority: Minor Currently the code generated is {noformat} /* 066 */ UTF8String primitive5 = null; /* 067 */ if (!isNull4) { /* 068 */ try { /* 069 */ primitive5 = UTF8String.fromString(new java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format( /* 070 */ new java.util.Date(primitive7 * 1000L))); /* 071 */ } catch (java.lang.Throwable e) { /* 072 */ isNull4 = true; /* 073 */ } /* 074 */ } {noformat} Instantiation of SimpleDateFormat is fairly expensive. It can be created on need basis. I will share the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14320) Make ColumnarBatch.Row mutable
[ https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14320: Assignee: (was: Apache Spark) > Make ColumnarBatch.Row mutable > -- > > Key: SPARK-14320 > URL: https://issues.apache.org/jira/browse/SPARK-14320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > > In order to leverage a data structure like `AggregateHashmap` > (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates > with keys, we need to make ColumnarBatch.Row mutable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14320) Make ColumnarBatch.Row mutable
[ https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14320: Assignee: Apache Spark > Make ColumnarBatch.Row mutable > -- > > Key: SPARK-14320 > URL: https://issues.apache.org/jira/browse/SPARK-14320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > > In order to leverage a data structure like `AggregateHashmap` > (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates > with keys, we need to make ColumnarBatch.Row mutable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14320) Make ColumnarBatch.Row mutable
[ https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220834#comment-15220834 ] Apache Spark commented on SPARK-14320: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12103 > Make ColumnarBatch.Row mutable > -- > > Key: SPARK-14320 > URL: https://issues.apache.org/jira/browse/SPARK-14320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal > > In order to leverage a data structure like `AggregateHashmap` > (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates > with keys, we need to make ColumnarBatch.Row mutable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method
[ https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-14277. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12096 [https://github.com/apache/spark/pull/12096] > Significant amount of CPU is being consumed in SnappyNative arrayCopy method > > > Key: SPARK-14277 > URL: https://issues.apache.org/jira/browse/SPARK-14277 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.0.0 > > > While running a Spark job which is spilling a lot of data in reduce phase, we > see that significant amount of CPU is being consumed in native Snappy > ArrayCopy method (Please see the stack trace below). > Stack trace - > org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) > org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > The reason for that is the SpillReader does a lot of small reads from the > underlying snappy compressed stream and SnappyInputStream invokes native jni > ArrayCopy method to copy the data, which is expensive. We should fix Snappy- > java to use with non-JNI based System.arrayCopy method in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14320) Make ColumnarBatch.Row mutable
Sameer Agarwal created SPARK-14320: -- Summary: Make ColumnarBatch.Row mutable Key: SPARK-14320 URL: https://issues.apache.org/jira/browse/SPARK-14320 Project: Spark Issue Type: Sub-task Reporter: Sameer Agarwal In order to leverage a data structure like `AggregateHashmap` (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates with keys, we need to make ColumnarBatch.Row mutable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates
[ https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-14263: --- Issue Type: Sub-task (was: New Feature) Parent: SPARK-14319 > Benchmark Vectorized HashMap for GroupBy Aggregates > --- > > Key: SPARK-14263 > URL: https://issues.apache.org/jira/browse/SPARK-14263 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14319) Speed up group-by aggregates
Sameer Agarwal created SPARK-14319: -- Summary: Speed up group-by aggregates Key: SPARK-14319 URL: https://issues.apache.org/jira/browse/SPARK-14319 Project: Spark Issue Type: Bug Components: SQL Reporter: Sameer Agarwal Aggregates with key in SparkSQL are almost 30x slower than aggregates with key. This master JIRA tracks our attempts to optimize them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14137) Conflict between NullPropagation and InferFiltersFromConstraints
[ https://issues.apache.org/jira/browse/SPARK-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220814#comment-15220814 ] Apache Spark commented on SPARK-14137: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12102 > Conflict between NullPropagation and InferFiltersFromConstraints > > > Key: SPARK-14137 > URL: https://issues.apache.org/jira/browse/SPARK-14137 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > Some optimizer rules conflict with each other, fail this test: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54069/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/union20/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Labels: hangs (was: tpcds-result-mismatch) > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 | 2039 | > | Bad cards must make. | ABDA | 1717 | 1782 | > | Bad cards must
[jira] [Created] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
JESSE CHEN created SPARK-14318: -- Summary: TPCDS query 14 causes Spark SQL to hang Key: SPARK-14318 URL: https://issues.apache.org/jira/browse/SPARK-14318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: JESSE CHEN Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to official result set. This is at 1GB SF (validation run). SparkSQL missing at least one row (grep for ABDA) ; I believe 2 other rows are missing as well. Actual results: {noformat} [null,AABD,2565,1922] [null,AAHD,2956,2052] [null,AALA,2042,1793] [null,ACGC,2373,1771] [null,ACKC,2321,1856] [null,ACOB,1504,1397] [null,ADKB,1820,2163] [null,AEAD,2631,1965] [null,AEOC,1659,1798] [null,AFAC,1965,1705] [null,AFAD,1769,1313] [null,AHDE,2700,1985] [null,AHHA,1578,1082] [null,AIEC,1756,1804] [null,AIMC,3603,2951] [null,AJAC,2109,1989] [null,AJKB,2573,3540] [null,ALBE,3458,2992] [null,ALCE,1720,1810] [null,ALEC,2569,1946] [null,ALNB,2552,1750] [null,ANFE,2022,2269] [null,AOIB,2982,2540] [null,APJB,2344,2593] [null,BAPD,2182,2787] [null,BDCE,2844,2069] [null,BDDD,2417,2537] [null,BDJA,1584,1666] [null,BEOD,2141,2649] [null,BFCC,2745,2020] [null,BFMB,1642,1364] [null,BHPC,1923,1780] [null,BIDB,1956,2836] [null,BIGB,2023,2344] [null,BIJB,1977,2728] [null,BJFE,1891,2390] [null,BLDE,1983,1797] [null,BNID,2485,2324] [null,BNLD,2385,2786] [null,BOMB,2291,2092] [null,CAAA,2233,2560] [null,CBCD,1540,2012] [null,CBIA,2394,2122] [null,CBPB,1790,1661] [null,CCMD,2654,2691] [null,CDBC,1804,2072] [null,CFEA,1941,1567] [null,CGFD,2123,2265] [null,CHPC,2933,2174] [null,CIGD,2618,2399] [null,CJCB,2728,2367] [null,CJLA,1350,1732] [null,CLAE,2578,2329] [null,CLGA,1842,1588] [null,CLLB,3418,2657] [null,CLOB,3115,2560] [null,CMAD,1991,2243] [null,CMJA,1261,1855] [null,CMLA,3288,2753] [null,CMPD,1320,1676] [null,CNGB,2340,2118] [null,CNHD,3519,3348] [null,CNPC,2561,1948] [null,DCPC,2664,2627] [null,DDHA,1313,1926] [null,DDND,1109,835] [null,DEAA,2141,1847] [null,DEJA,3142,2723] [null,DFKB,1470,1650] [null,DGCC,2113,2331] [null,DGFC,2201,2928] [null,DHPA,2467,2133] [null,DMBA,3085,2087] [null,DPAB,3494,3081] [null,EAEC,2133,2148] [null,EAPA,1560,1275] [null,ECGC,2815,3307] [null,EDPD,2731,1883] [null,EEEC,2024,1902] [null,EEMC,2624,2387] [null,EFFA,2047,1878] [null,EGJA,2403,2633] [null,EGMA,2784,2772] [null,EGOC,2389,1753] [null,EHFD,1940,1420] [null,EHLB,2320,2057] [null,EHPA,1898,1853] [null,EIPB,2930,2326] [null,EJAE,2582,1836] [null,EJIB,2257,1681] [null,EJJA,2791,1941] [null,EJJD,3410,2405] [null,EJNC,2472,2067] [null,EJPD,1219,1229] [null,EKEB,2047,1713] [null,EMEA,2502,1897] [null,EMKC,2362,2042] [null,ENAC,2011,1909] [null,ENFB,2507,2162] [null,ENOD,3371,2709] {noformat} Expected results: {noformat} +--+--++---+ | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | +--+--++---+ | Bad cards must make. | AACD | 1889 | 2168 | | Bad cards must make. | AAHD | 2739 | 2039 | | Bad cards must make. | ABDA | 1717 | 1782 | | Bad cards must make. | ACGC | 2296 | 2276 | | Bad cards must make. | ACKC | 2443 | 1878 | | Bad cards must make. | ACOB | 2705 | 2428 | | Bad cards must make. | ADGB | 2242 | 2759 | | Bad cards must make. | ADKB | 2138 | 2456 | | Bad cards must make. | AEAD | 2914 | 2237 | | Bad cards must make. | AEOC | 1797 | 2073 | | Bad
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JESSE CHEN updated SPARK-14318: --- Affects Version/s: 2.0.0 > TPCDS query 14 causes Spark SQL to hang > --- > > Key: SPARK-14318 > URL: https://issues.apache.org/jira/browse/SPARK-14318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: JESSE CHEN > Labels: hangs > > Testing Spark SQL using TPC queries. Query 21 returns wrong results compared > to official result set. This is at 1GB SF (validation run). > SparkSQL missing at least one row (grep for ABDA) ; I believe 2 > other rows are missing as well. > Actual results: > {noformat} > [null,AABD,2565,1922] > [null,AAHD,2956,2052] > [null,AALA,2042,1793] > [null,ACGC,2373,1771] > [null,ACKC,2321,1856] > [null,ACOB,1504,1397] > [null,ADKB,1820,2163] > [null,AEAD,2631,1965] > [null,AEOC,1659,1798] > [null,AFAC,1965,1705] > [null,AFAD,1769,1313] > [null,AHDE,2700,1985] > [null,AHHA,1578,1082] > [null,AIEC,1756,1804] > [null,AIMC,3603,2951] > [null,AJAC,2109,1989] > [null,AJKB,2573,3540] > [null,ALBE,3458,2992] > [null,ALCE,1720,1810] > [null,ALEC,2569,1946] > [null,ALNB,2552,1750] > [null,ANFE,2022,2269] > [null,AOIB,2982,2540] > [null,APJB,2344,2593] > [null,BAPD,2182,2787] > [null,BDCE,2844,2069] > [null,BDDD,2417,2537] > [null,BDJA,1584,1666] > [null,BEOD,2141,2649] > [null,BFCC,2745,2020] > [null,BFMB,1642,1364] > [null,BHPC,1923,1780] > [null,BIDB,1956,2836] > [null,BIGB,2023,2344] > [null,BIJB,1977,2728] > [null,BJFE,1891,2390] > [null,BLDE,1983,1797] > [null,BNID,2485,2324] > [null,BNLD,2385,2786] > [null,BOMB,2291,2092] > [null,CAAA,2233,2560] > [null,CBCD,1540,2012] > [null,CBIA,2394,2122] > [null,CBPB,1790,1661] > [null,CCMD,2654,2691] > [null,CDBC,1804,2072] > [null,CFEA,1941,1567] > [null,CGFD,2123,2265] > [null,CHPC,2933,2174] > [null,CIGD,2618,2399] > [null,CJCB,2728,2367] > [null,CJLA,1350,1732] > [null,CLAE,2578,2329] > [null,CLGA,1842,1588] > [null,CLLB,3418,2657] > [null,CLOB,3115,2560] > [null,CMAD,1991,2243] > [null,CMJA,1261,1855] > [null,CMLA,3288,2753] > [null,CMPD,1320,1676] > [null,CNGB,2340,2118] > [null,CNHD,3519,3348] > [null,CNPC,2561,1948] > [null,DCPC,2664,2627] > [null,DDHA,1313,1926] > [null,DDND,1109,835] > [null,DEAA,2141,1847] > [null,DEJA,3142,2723] > [null,DFKB,1470,1650] > [null,DGCC,2113,2331] > [null,DGFC,2201,2928] > [null,DHPA,2467,2133] > [null,DMBA,3085,2087] > [null,DPAB,3494,3081] > [null,EAEC,2133,2148] > [null,EAPA,1560,1275] > [null,ECGC,2815,3307] > [null,EDPD,2731,1883] > [null,EEEC,2024,1902] > [null,EEMC,2624,2387] > [null,EFFA,2047,1878] > [null,EGJA,2403,2633] > [null,EGMA,2784,2772] > [null,EGOC,2389,1753] > [null,EHFD,1940,1420] > [null,EHLB,2320,2057] > [null,EHPA,1898,1853] > [null,EIPB,2930,2326] > [null,EJAE,2582,1836] > [null,EJIB,2257,1681] > [null,EJJA,2791,1941] > [null,EJJD,3410,2405] > [null,EJNC,2472,2067] > [null,EJPD,1219,1229] > [null,EKEB,2047,1713] > [null,EMEA,2502,1897] > [null,EMKC,2362,2042] > [null,ENAC,2011,1909] > [null,ENFB,2507,2162] > [null,ENOD,3371,2709] > {noformat} > Expected results: > {noformat} > +--+--++---+ > | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER | > +--+--++---+ > | Bad cards must make. | AACD | 1889 | 2168 | > | Bad cards must make. | AAHD | 2739 | 2039 | > | Bad cards must make. | ABDA | 1717 | 1782 | > | Bad cards must make. |
[jira] [Created] (SPARK-14317) Clean up hash join
Davies Liu created SPARK-14317: -- Summary: Clean up hash join Key: SPARK-14317 URL: https://issues.apache.org/jira/browse/SPARK-14317 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14294) Support native execution of ALTER TABLE ... RENAME TO
[ https://issues.apache.org/jira/browse/SPARK-14294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-14294. - Resolution: Duplicate Assignee: Andrew Or > Support native execution of ALTER TABLE ... RENAME TO > - > > Key: SPARK-14294 > URL: https://issues.apache.org/jira/browse/SPARK-14294 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Bo Meng >Assignee: Andrew Or >Priority: Minor > > Support native execution of ALTER TABLE ... RENAME TO > The syntax for ALTER TABLE ... RENAME TO commands is described as following: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RenameTable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties
[ https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220762#comment-15220762 ] Apache Spark commented on SPARK-11327: -- User 'jayv' has created a pull request for this issue: https://github.com/apache/spark/pull/12101 > spark-dispatcher doesn't pass along some spark properties > - > > Key: SPARK-11327 > URL: https://issues.apache.org/jira/browse/SPARK-11327 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Alan Braithwaite > Fix For: 2.0.0 > > > I haven't figured out exactly what's going on yet, but there's something in > the spark-dispatcher which is failing to pass along properties to the > spark-driver when using spark-submit in a clustered mesos docker environment. > Most importantly, it's not passing along spark.mesos.executor.docker.image. > cli: > {code} > docker run -t -i --rm --net=host > --entrypoint=/usr/local/spark/bin/spark-submit > docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf > spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master > mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster > --properties-file /usr/local/spark/conf/spark-defaults.conf --class > com.example.spark.streaming.MyApp > http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 > spark-testing my-stream 40 > {code} > submit output: > {code} > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch > an application in mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server > at http://compute1.example.com:31262/v1/submissions/create: > { > "action" : "CreateSubmissionRequest", > "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ], > "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;, > "clientSparkVersion" : "1.5.0", > "environmentVariables" : { > "SPARK_SCALA_VERSION" : "2.10", > "SPARK_CONF_DIR" : "/usr/local/spark/conf", > "SPARK_HOME" : "/usr/local/spark", > "SPARK_ENV_LOADED" : "1" > }, > "mainClass" : "com.example.spark.streaming.MyApp", > "sparkProperties" : { > "spark.serializer" : "org.apache.spark.serializer.KryoSerializer", > "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : > "/usr/local/lib/libmesos.so", > "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs", > "spark.eventLog.enabled" : "true", > "spark.driver.maxResultSize" : "0", > "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER", > "spark.mesos.deploy.zookeeper.url" : > "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181", > "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;, > "spark.driver.supervise" : "false", > "spark.app.name" : "com.example.spark.streaming.MyApp", > "spark.driver.memory" : "8G", > "spark.logConf" : "true", > "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher", > "spark.mesos.executor.docker.image" : > "docker.example.com/spark-prod:2015.10.2", > "spark.submit.deployMode" : "cluster", > "spark.master" : "mesos://compute1.example.com:31262", > "spark.executor.memory" : "8G", > "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs", > "spark.mesos.docker.executor.network" : "HOST", > "spark.mesos.executor.home" : "/usr/local/spark" > } > } > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created > as driver-20151026220353-0011. Polling submission state... > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20151026220353-0011 in > mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server > at > http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "SubmissionStatusResponse", > "driverState" : "QUEUED", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver > driver-20151026220353-0011 is now QUEUED. > 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with > CreateSubmissionResponse: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } >
[jira] [Commented] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint
[ https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220754#comment-15220754 ] Apache Spark commented on SPARK-14316: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/12100 > StateStoreCoordinator should extend ThreadSafeRpcEndpoint > - > > Key: SPARK-14316 > URL: https://issues.apache.org/jira/browse/SPARK-14316 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > RpcEndpoint is not thread safe and allows multiple messages to be processed > at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint
[ https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14316: Assignee: Shixiong Zhu (was: Apache Spark) > StateStoreCoordinator should extend ThreadSafeRpcEndpoint > - > > Key: SPARK-14316 > URL: https://issues.apache.org/jira/browse/SPARK-14316 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > RpcEndpoint is not thread safe and allows multiple messages to be processed > at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint
[ https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14316: Assignee: Apache Spark (was: Shixiong Zhu) > StateStoreCoordinator should extend ThreadSafeRpcEndpoint > - > > Key: SPARK-14316 > URL: https://issues.apache.org/jira/browse/SPARK-14316 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > RpcEndpoint is not thread safe and allows multiple messages to be processed > at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint
Shixiong Zhu created SPARK-14316: Summary: StateStoreCoordinator should extend ThreadSafeRpcEndpoint Key: SPARK-14316 URL: https://issues.apache.org/jira/browse/SPARK-14316 Project: Spark Issue Type: Improvement Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu RpcEndpoint is not thread safe and allows multiple messages to be processed at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14251) Add SQL command for printing out generated code for debugging
[ https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220737#comment-15220737 ] Apache Spark commented on SPARK-14251: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12099 > Add SQL command for printing out generated code for debugging > - > > Key: SPARK-14251 > URL: https://issues.apache.org/jira/browse/SPARK-14251 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > SPARK-14227 adds a programatic way to dump generated code. In pure SQL > environment this doesn't work. It would be great if we can have > {noformat} > explain codegen select * ... > {noformat} > return the generated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14251) Add SQL command for printing out generated code for debugging
[ https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14251: Assignee: (was: Apache Spark) > Add SQL command for printing out generated code for debugging > - > > Key: SPARK-14251 > URL: https://issues.apache.org/jira/browse/SPARK-14251 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > SPARK-14227 adds a programatic way to dump generated code. In pure SQL > environment this doesn't work. It would be great if we can have > {noformat} > explain codegen select * ... > {noformat} > return the generated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14251) Add SQL command for printing out generated code for debugging
[ https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14251: Assignee: Apache Spark > Add SQL command for printing out generated code for debugging > - > > Key: SPARK-14251 > URL: https://issues.apache.org/jira/browse/SPARK-14251 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > SPARK-14227 adds a programatic way to dump generated code. In pure SQL > environment this doesn't work. It would be great if we can have > {noformat} > explain codegen select * ... > {noformat} > return the generated code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220734#comment-15220734 ] Xiangrui Meng commented on SPARK-14313: --- [~yanboliang] Are you interested working on this? It should contain the basic APIs for ml.save/ml.load in SparkR and save/load implementation of AFTWrapper. > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220732#comment-15220732 ] Xiangrui Meng commented on SPARK-14314: --- Hold until SPARK-14303 is done. > K-means model persistence in SparkR > --- > > Key: SPARK-14314 > URL: https://issues.apache.org/jira/browse/SPARK-14314 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14315) GLMs model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220730#comment-15220730 ] Xiangrui Meng commented on SPARK-14315: --- Hold until SPARK-14303 is done. > GLMs model persistence in SparkR > > > Key: SPARK-14315 > URL: https://issues.apache.org/jira/browse/SPARK-14315 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14311) Model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14311: -- Description: In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive Bayes, and AFT survival regression. Users can fit models, get summary, and make predictions. However, they cannot save/load the models yet. ML models in SparkR are wrappers around ML pipelines. So it should be straightforward to implement model persistence. We need to think more about the API. R uses save/load for objects and datasets (also objects). It is possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure whether load can be overloaded easily. I propose the following API: {code} model <- glm(formula, data = df) ml.save(model, path, mode = "overwrite") model2 <- ml.load(path) {code} We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load is a S3 method (correct me if I'm wrong). was: In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive Bayes, and AFT survival regression. Users can fit models, get summary, and make predictions. However, they cannot save/load the models yet. ML models in SparkR are wrappers around ML pipelines. So it should be straightforward to implement model persistence. We need to think more about the API. R uses save/load for objects and datasets (also objects). It is possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure whether load can be overloaded easily. I propose the following API: {code} model <- glm(formula, data = df) ml.save(model, path, mode = "overwrite") model2 <- ml.load(path) {code} > Model persistence in SparkR > --- > > Key: SPARK-14311 > URL: https://issues.apache.org/jira/browse/SPARK-14311 > Project: Spark > Issue Type: Umbrella > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, > naive Bayes, and AFT survival regression. Users can fit models, get summary, > and make predictions. However, they cannot save/load the models yet. > ML models in SparkR are wrappers around ML pipelines. So it should be > straightforward to implement model persistence. We need to think more about > the API. R uses save/load for objects and datasets (also objects). It is > possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But > I'm not sure whether load can be overloaded easily. I propose the following > API: > {code} > model <- glm(formula, data = df) > ml.save(model, path, mode = "overwrite") > model2 <- ml.load(path) > {code} > We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load > is a S3 method (correct me if I'm wrong). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
Xiangrui Meng created SPARK-14313: - Summary: AFTSurvivalRegression model persistence in SparkR Key: SPARK-14313 URL: https://issues.apache.org/jira/browse/SPARK-14313 Project: Spark Issue Type: Sub-task Components: ML, SparkR Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14315) GLMs model persistence in SparkR
Xiangrui Meng created SPARK-14315: - Summary: GLMs model persistence in SparkR Key: SPARK-14315 URL: https://issues.apache.org/jira/browse/SPARK-14315 Project: Spark Issue Type: Sub-task Components: ML, SparkR Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14314) K-means model persistence in SparkR
Xiangrui Meng created SPARK-14314: - Summary: K-means model persistence in SparkR Key: SPARK-14314 URL: https://issues.apache.org/jira/browse/SPARK-14314 Project: Spark Issue Type: Sub-task Components: ML, SparkR Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14311) Model persistence in SparkR
Xiangrui Meng created SPARK-14311: - Summary: Model persistence in SparkR Key: SPARK-14311 URL: https://issues.apache.org/jira/browse/SPARK-14311 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng Assignee: Xiangrui Meng In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive Bayes, and AFT survival regression. Users can fit models, get summary, and make predictions. However, they cannot save/load the models yet. ML models in SparkR are wrappers around ML pipelines. So it should be straightforward to implement model persistence. We need to think more about the API. R uses save/load for objects and datasets (also objects). It is possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure whether load can be overloaded easily. I propose the following API: {code} model <- glm(formula, data = df) ml.save(model, path, mode = "overwrite") model2 <- ml.load(path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14312) NaiveBayes model persistence in SparkR
Xiangrui Meng created SPARK-14312: - Summary: NaiveBayes model persistence in SparkR Key: SPARK-14312 URL: https://issues.apache.org/jira/browse/SPARK-14312 Project: Spark Issue Type: Sub-task Components: ML, SparkR Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14209) Application failure during preemption.
[ https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220716#comment-15220716 ] Marcelo Vanzin commented on SPARK-14209: Those logs show the same weird issues as the previous one... is there a way you can use the default log configuration from Spark? That would give us much better and non-misleading information. Also, it didn't seem like that application failed. I see some fetch failures, but that's to be expected when executors die. What's odd about your original log is that tasks failed multiple (up to 100) times and eventually failed the application, and that doesn't seem to be happening for this last set of logs. > Application failure during preemption. > -- > > Key: SPARK-14209 > URL: https://issues.apache.org/jira/browse/SPARK-14209 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: Spark on YARN >Reporter: Miles Crawford > > We have a fair-sharing cluster set up, including the external shuffle > service. When a new job arrives, existing jobs are successfully preempted > down to fit. > A spate of these messages arrives: > ExecutorLostFailure (executor 48 exited unrelated to the running tasks) > Reason: Container container_1458935819920_0019_01_000143 on host: > ip-10-12-46-235.us-west-2.compute.internal was preempted. > This seems fine - the problem is that soon thereafter, our whole application > fails because it is unable to fetch blocks from the pre-empted containers: > org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 > locations. Most recent failure cause: > Caused by: java.io.IOException: Failed to connect to > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Caused by: java.net.ConnectException: Connection refused: > ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 > Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf > Spark does not attempt to recreate these blocks - the tasks simply fail over > and over until the maxTaskAttempts value is reached. > It appears to me that there is some fault in the way preempted containers are > being handled - shouldn't these blocks be recreated on demand? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method
[ https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-14277: Description: While running a Spark job which is spilling a lot of data in reduce phase, we see that significant amount of CPU is being consumed in native Snappy ArrayCopy method (Please see the stack trace below). Stack trace - org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) The reason for that is the SpillReader does a lot of small reads from the underlying snappy compressed stream and SnappyInputStream invokes native jni ArrayCopy method to copy the data, which is expensive. We should fix Snappy- java to use with non-JNI based System.arrayCopy method in this case. was: While running a Spark job which is spilling a lot of data in reduce phase, we see that significant amount of CPU is being consumed in native Snappy ArrayCopy method (Please see the stack trace below). Stack trace - org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) The reason for that is the SpillReader does a lot of small reads from the underlying snappy compressed stream and we pay a heavy cost of jni calls for these small reads. The SpillReader should instead do a buffered read from the underlying snappy compressed stream. > Significant amount of CPU is being consumed in SnappyNative arrayCopy method > > > Key: SPARK-14277 > URL: https://issues.apache.org/jira/browse/SPARK-14277 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Sital Kedia > > While running a Spark job which is spilling a lot of data in reduce phase, we > see that significant amount of CPU is being consumed in native Snappy > ArrayCopy method (Please see the stack trace below). > Stack trace - > org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) > org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > The reason for that is the SpillReader does a lot of small reads from the > underlying snappy compressed stream and SnappyInputStream invokes native jni > ArrayCopy method to copy the data, which is expensive. We should fix Snappy- > java to use with non-JNI based System.arrayCopy method in this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Updated] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method
[ https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-14277: Summary: Significant amount of CPU is being consumed in SnappyNative arrayCopy method (was: UnsafeSorterSpillReader should do buffered read from underlying compression stream) > Significant amount of CPU is being consumed in SnappyNative arrayCopy method > > > Key: SPARK-14277 > URL: https://issues.apache.org/jira/browse/SPARK-14277 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Sital Kedia > > While running a Spark job which is spilling a lot of data in reduce phase, we > see that significant amount of CPU is being consumed in native Snappy > ArrayCopy method (Please see the stack trace below). > Stack trace - > org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) > org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > The reason for that is the SpillReader does a lot of small reads from the > underlying snappy compressed stream and we pay a heavy cost of jni calls for > these small reads. The SpillReader should instead do a buffered read from the > underlying snappy compressed stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema
[ https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14310: Assignee: Apache Spark > Fix scan whole stage codegen to determine if batches are produced based on > schema > - > > Key: SPARK-14310 > URL: https://issues.apache.org/jira/browse/SPARK-14310 > Project: Spark > Issue Type: Bug >Reporter: Nong Li >Assignee: Apache Spark > > Currently, this is figured out at runtime by looking at the first value which > is not necessary any more. This simplifies the code and lets us measure > timings better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema
[ https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220690#comment-15220690 ] Apache Spark commented on SPARK-14310: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/12098 > Fix scan whole stage codegen to determine if batches are produced based on > schema > - > > Key: SPARK-14310 > URL: https://issues.apache.org/jira/browse/SPARK-14310 > Project: Spark > Issue Type: Bug >Reporter: Nong Li > > Currently, this is figured out at runtime by looking at the first value which > is not necessary any more. This simplifies the code and lets us measure > timings better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema
[ https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14310: Assignee: (was: Apache Spark) > Fix scan whole stage codegen to determine if batches are produced based on > schema > - > > Key: SPARK-14310 > URL: https://issues.apache.org/jira/browse/SPARK-14310 > Project: Spark > Issue Type: Bug >Reporter: Nong Li > > Currently, this is figured out at runtime by looking at the first value which > is not necessary any more. This simplifies the code and lets us measure > timings better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema
Nong Li created SPARK-14310: --- Summary: Fix scan whole stage codegen to determine if batches are produced based on schema Key: SPARK-14310 URL: https://issues.apache.org/jira/browse/SPARK-14310 Project: Spark Issue Type: Bug Reporter: Nong Li Currently, this is figured out at runtime by looking at the first value which is not necessary any more. This simplifies the code and lets us measure timings better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML
[ https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14308: Assignee: Apache Spark > Remove unused mllib tree classes and move private classes to ML > --- > > Key: SPARK-14308 > URL: https://issues.apache.org/jira/browse/SPARK-14308 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some > mllib tree internal helper classes are no longer used at all. Also, the > private helper classes internal to spark tree training can be ported very > easily to spark.ML without affecting APIs. This is the "low hanging fruit" > for porting tree internals to spark.ML, and will make the other migrations > more tractable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML
[ https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220686#comment-15220686 ] Apache Spark commented on SPARK-14308: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/12097 > Remove unused mllib tree classes and move private classes to ML > --- > > Key: SPARK-14308 > URL: https://issues.apache.org/jira/browse/SPARK-14308 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some > mllib tree internal helper classes are no longer used at all. Also, the > private helper classes internal to spark tree training can be ported very > easily to spark.ML without affecting APIs. This is the "low hanging fruit" > for porting tree internals to spark.ML, and will make the other migrations > more tractable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML
[ https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14308: Assignee: (was: Apache Spark) > Remove unused mllib tree classes and move private classes to ML > --- > > Key: SPARK-14308 > URL: https://issues.apache.org/jira/browse/SPARK-14308 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some > mllib tree internal helper classes are no longer used at all. Also, the > private helper classes internal to spark tree training can be ported very > easily to spark.ML without affecting APIs. This is the "low hanging fruit" > for porting tree internals to spark.ML, and will make the other migrations > more tractable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14309) Dataframe returns wrong results due to parsing incorrectly
Jerry Lam created SPARK-14309: - Summary: Dataframe returns wrong results due to parsing incorrectly Key: SPARK-14309 URL: https://issues.apache.org/jira/browse/SPARK-14309 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Jerry Lam I observed the below behavior using dataframe. The expected answer should be 60 but there is no way to get the value unless to turn dataframe into rdd and access it in the Row. I have include the SQL statement and it returns the correct result because I believe, it is using Hive parser. {code} val base = sc.parallelize(( 0 to 49).zip( 0 to 49) ++ (30 to 79).zip(50 to 99)).toDF("id", "label") val d1 = base.where($"label" < 60).as("d1") val d2 = base.where($"label" === 60).as("d2") d1.join(d2, "id").show +---+-+-+ | id|label|label| +---+-+-+ | 40| 40| 60| +---+-+-+ d1.join(d2, "id").select(d1("label")).show +-+ |label| +-+ | 40| +-+ (expected answer: 40, right!) d1.join(d2, "id").map{row => row.getAs[Int](2)} d1.join(d2, "id").select(d2("label")).show +-+ |label| +-+ | 40| +-+ (expected answer: 60, wrong!) d1.join(d2, "id").select(d2("label")).explain(true) scala> d1.join(d2, "id").select(d2("label")).explain(true) == Parsed Logical Plan == Project [label#3] Project [id#2,label#3,label#7] Join Inner, Some((id#2 = id#6)) Subquery d1 Filter (label#3 < 60) Project [_1#0 AS id#2,_2#1 AS label#3] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 Subquery d2 Filter (label#7 = 60) Project [_1#0 AS id#6,_2#1 AS label#7] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 == Analyzed Logical Plan == label: int Project [label#3] Project [id#2,label#3,label#7] Join Inner, Some((id#2 = id#6)) Subquery d1 Filter (label#3 < 60) Project [_1#0 AS id#2,_2#1 AS label#3] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 Subquery d2 Filter (label#7 = 60) Project [_1#0 AS id#6,_2#1 AS label#7] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 == Optimized Logical Plan == Project [label#3] Join Inner, Some((id#2 = id#6)) Project [_1#0 AS id#2,_2#1 AS label#3] Filter (_2#1 < 60) LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 Project [_1#0 AS id#6] Filter (_2#1 = 60) LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 == Physical Plan == TungstenProject [label#3] SortMergeJoin [id#2], [id#6] TungstenSort [id#2 ASC], false, 0 TungstenExchange hashpartitioning(id#2) TungstenProject [_1#0 AS id#2,_2#1 AS label#3] Filter (_2#1 < 60) Scan PhysicalRDD[_1#0,_2#1] TungstenSort [id#6 ASC], false, 0 TungstenExchange hashpartitioning(id#6) TungstenProject [_1#0 AS id#6] Filter (_2#1 = 60) Scan PhysicalRDD[_1#0,_2#1] def (d1 :DataFrame, d2: DataFrame) base.registerTempTable("base") sqlContext.sql("select d2.label from (select * from base where label < 60) as d1 inner join (select * from base where label = 60) as d2 on d1.id = d2.id").explain(true) == Parsed Logical Plan == 'Project [unresolvedalias('d2.label)] 'Join Inner, Some(('d1.id = 'd2.id)) 'Subquery d1 'Project [unresolvedalias(*)] 'Filter ('label < 60) 'UnresolvedRelation [base], None 'Subquery d2 'Project [unresolvedalias(*)] 'Filter ('label = 60) 'UnresolvedRelation [base], None == Analyzed Logical Plan == label: int Project [label#15] Join Inner, Some((id#2 = id#14)) Subquery d1 Project [id#2,label#3] Filter (label#3 < 60) Subquery base Project [_1#0 AS id#2,_2#1 AS label#3] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 Subquery d2 Project [id#14,label#15] Filter (label#15 = 60) Subquery base Project [_1#0 AS id#14,_2#1 AS label#15] LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 == Optimized Logical Plan == Project [label#15] Join Inner, Some((id#2 = id#14)) Project [_1#0 AS id#2] Filter (_2#1 < 60) LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 Project [_1#0 AS id#14,_2#1 AS label#15] Filter (_2#1 = 60) LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at :21 == Physical Plan == TungstenProject [label#15] SortMergeJoin [id#2], [id#14] TungstenSort [id#2 ASC], false, 0 TungstenExchange hashpartitioning(id#2) TungstenProject [_1#0 AS id#2] Filter (_2#1 < 60) Scan PhysicalRDD[_1#0,_2#1] TungstenSort [id#14 ASC], false, 0 TungstenExchange hashpartitioning(id#14) TungstenProject [_1#0 AS id#14,_2#1 AS label#15] Filter (_2#1 = 60) Scan PhysicalRDD[_1#0,_2#1] {code} -- This message was
[jira] [Updated] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private
[ https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Seth Hendrickson updated SPARK-12381: - Summary: Copy public decision tree helper classes from spark.mllib to spark.ml and make private (was: Move decision tree helper classes from spark.mllib to spark.ml) > Copy public decision tree helper classes from spark.mllib to spark.ml and > make private > -- > > Key: SPARK-12381 > URL: https://issues.apache.org/jira/browse/SPARK-12381 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > The helper classes for decision trees and decision tree ensembles (e.g. > Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) > currently reside in spark.mllib, but as the algorithm implementations are > moved to spark.ml, so should these helper classes. > We should take this opportunity to make some of those helper classes private > when possible (especially if they are only needed during training) and maybe > change the APIs (especially if we can eliminate duplicate data stored in the > final model). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML
Seth Hendrickson created SPARK-14308: Summary: Remove unused mllib tree classes and move private classes to ML Key: SPARK-14308 URL: https://issues.apache.org/jira/browse/SPARK-14308 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Seth Hendrickson Priority: Minor After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some mllib tree internal helper classes are no longer used at all. Also, the private helper classes internal to spark tree training can be ported very easily to spark.ML without affecting APIs. This is the "low hanging fruit" for porting tree internals to spark.ML, and will make the other migrations more tractable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14281) Fix the java8-tests profile and run those tests in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-14281. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12073 [https://github.com/apache/spark/pull/12073] > Fix the java8-tests profile and run those tests in Jenkins > -- > > Key: SPARK-14281 > URL: https://issues.apache.org/jira/browse/SPARK-14281 > Project: Spark > Issue Type: Improvement > Components: Project Infra, Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > Spark has some tests for compilation of Java 8 sources (using lambdas) > guarded behind a {{java8-tests}} maven profile, but we currently do not build > or run those tests. As a result, the tests no longer compile. > We should fix these tests and set up automated CI so that they don't break > again. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream
[ https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220631#comment-15220631 ] Apache Spark commented on SPARK-14277: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/12096 > UnsafeSorterSpillReader should do buffered read from underlying compression > stream > -- > > Key: SPARK-14277 > URL: https://issues.apache.org/jira/browse/SPARK-14277 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Assignee: Sital Kedia > > While running a Spark job which is spilling a lot of data in reduce phase, we > see that significant amount of CPU is being consumed in native Snappy > ArrayCopy method (Please see the stack trace below). > Stack trace - > org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method) > org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > The reason for that is the SpillReader does a lot of small reads from the > underlying snappy compressed stream and we pay a heavy cost of jni calls for > these small reads. The SpillReader should instead do a buffered read from the > underlying snappy compressed stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14129) [Table related commands] Alter table
[ https://issues.apache.org/jira/browse/SPARK-14129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-14129: - Assignee: Andrew Or > [Table related commands] Alter table > > > Key: SPARK-14129 > URL: https://issues.apache.org/jira/browse/SPARK-14129 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Andrew Or > > For alter table command, we have the following tokens. > TOK_ALTERTABLE_RENAME > TOK_ALTERTABLE_LOCATION > TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES > TOK_ALTERTABLE_SERIALIZER > TOK_ALTERTABLE_SERDEPROPERTIES > TOK_ALTERTABLE_CLUSTER_SORT > TOK_ALTERTABLE_SKEWED > For a data source table, let's implement TOK_ALTERTABLE_RENAME, > TOK_ALTERTABLE_LOCATION, and TOK_ALTERTABLE_SERDEPROPERTIES. We need to > decide what we do for > TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES. It will be use to > allow users to correct the data format (e.g. changing csv to > com.databricks.spark.csv to allow the table be accessed by the older versions > of spark). > For a Hive table, we should implement all commands supported by the data > source table and TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES. > For TOK_ALTERTABLE_CLUSTER_SORT and TOK_ALTERTABLE_SKEWED, we should throw > exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220609#comment-15220609 ] Haohai Ma edited comment on SPARK-4906 at 3/31/16 8:27 PM: --- We just hit the similar OOM issue recently by Spark Master v1.6.0. A detailed retained memory report is attached. was (Author: cloneman): We just hit the similar issue recently by Spark Master OOM. A detailed retained memory report is attached. > Spark master OOMs with exception stack trace stored in JobProgressListener > -- > > Key: SPARK-4906 > URL: https://issues.apache.org/jira/browse/SPARK-4906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Mingyu Kim > Attachments: LeakingJobProgressListener2OOM.docx > > > Spark master was OOMing with a lot of stack traces retained in > JobProgressListener. The object dependency goes like the following. > JobProgressListener.stageIdToData => StageUIData.taskData => > TaskUIData.errorMessage > Each error message is ~10kb since it has the entire stack trace. As we have a > lot of tasks, when all of the tasks across multiple stages go bad, these > error messages accounted for 0.5GB of heap at some point. > Please correct me if I'm wrong, but it looks like all the task info for > running applications are kept in memory, which means it's almost always bound > to OOM for long-running applications. Would it make sense to fix this, for > example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220609#comment-15220609 ] Haohai Ma commented on SPARK-4906: -- We just hit the similar issue recently by Spark Master OOM. A detailed retained memory report is attached. > Spark master OOMs with exception stack trace stored in JobProgressListener > -- > > Key: SPARK-4906 > URL: https://issues.apache.org/jira/browse/SPARK-4906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Mingyu Kim > Attachments: LeakingJobProgressListener2OOM.docx > > > Spark master was OOMing with a lot of stack traces retained in > JobProgressListener. The object dependency goes like the following. > JobProgressListener.stageIdToData => StageUIData.taskData => > TaskUIData.errorMessage > Each error message is ~10kb since it has the entire stack trace. As we have a > lot of tasks, when all of the tasks across multiple stages go bad, these > error messages accounted for 0.5GB of heap at some point. > Please correct me if I'm wrong, but it looks like all the task info for > running applications are kept in memory, which means it's almost always bound > to OOM for long-running applications. Would it make sense to fix this, for > example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haohai Ma updated SPARK-4906: - Attachment: LeakingJobProgressListener2OOM.docx > Spark master OOMs with exception stack trace stored in JobProgressListener > -- > > Key: SPARK-4906 > URL: https://issues.apache.org/jira/browse/SPARK-4906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Mingyu Kim > Attachments: LeakingJobProgressListener2OOM.docx > > > Spark master was OOMing with a lot of stack traces retained in > JobProgressListener. The object dependency goes like the following. > JobProgressListener.stageIdToData => StageUIData.taskData => > TaskUIData.errorMessage > Each error message is ~10kb since it has the entire stack trace. As we have a > lot of tasks, when all of the tasks across multiple stages go bad, these > error messages accounted for 0.5GB of heap at some point. > Please correct me if I'm wrong, but it looks like all the task info for > running applications are kept in memory, which means it's almost always bound > to OOM for long-running applications. Would it make sense to fix this, for > example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14259) Add config to control maximum number of files when coalescing partitions
[ https://issues.apache.org/jira/browse/SPARK-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220602#comment-15220602 ] Apache Spark commented on SPARK-14259: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12095 > Add config to control maximum number of files when coalescing partitions > > > Key: SPARK-14259 > URL: https://issues.apache.org/jira/browse/SPARK-14259 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.0.0 > > > The FileSourceStrategy currently has a config to control the maximum byte > size of coalesced partitions. It is helpful to also have a config to control > the maximum number of files as even small files have a non-trivial fixed > cost. The current packing can put a lot of small files together which cases > straggler tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14264) Add feature importances for GBTs in Pyspark
[ https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14264. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12056 [https://github.com/apache/spark/pull/12056] > Add feature importances for GBTs in Pyspark > --- > > Key: SPARK-14264 > URL: https://issues.apache.org/jira/browse/SPARK-14264 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > GBT feature importances are now implemented in scala. We should expose them > in the pyspark API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import
[ https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220519#comment-15220519 ] Xusen Yin commented on SPARK-14306: --- start work on it now. > PySpark ml.classification OneVsRest support export/import > - > > Key: SPARK-14306 > URL: https://issues.apache.org/jira/browse/SPARK-14306 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14260) Increase default value for maxCharsPerColumn
[ https://issues.apache.org/jira/browse/SPARK-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14260. --- Resolution: Won't Fix Yeah I think that would be a very rare case. I also suggest we not increase the default limit. This was motivated I think by SPARK-14103 but I'm not sure the cause is a long line, not yet. (Or if it is, the solution is to raise the limit.) > Increase default value for maxCharsPerColumn > > > Key: SPARK-14260 > URL: https://issues.apache.org/jira/browse/SPARK-14260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > I guess the default value of the option {{maxCharsPerColumn}} looks > relatively small,100 characters meaning 976KB. > It looks some of guys have a problem with this ending up setting the value > manually. > https://github.com/databricks/spark-csv/issues/295 > https://issues.apache.org/jira/browse/SPARK-14103 > According to [univocity > API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)], > this exists to avoid {{OutOfMemoryErrors}}. > If this does not harm performance, then I think it would be better to make > the default value much bigger (eg. 10MB or 100MB) so that users do not take > care of the lengths of each field in CSV file. > Apparently Apache CSV Parser does not have such limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties
[ https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220508#comment-15220508 ] Jo Voordeckers commented on SPARK-11327: So who should I nudge to get it backported into 1.x ? > spark-dispatcher doesn't pass along some spark properties > - > > Key: SPARK-11327 > URL: https://issues.apache.org/jira/browse/SPARK-11327 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Alan Braithwaite > Fix For: 2.0.0 > > > I haven't figured out exactly what's going on yet, but there's something in > the spark-dispatcher which is failing to pass along properties to the > spark-driver when using spark-submit in a clustered mesos docker environment. > Most importantly, it's not passing along spark.mesos.executor.docker.image. > cli: > {code} > docker run -t -i --rm --net=host > --entrypoint=/usr/local/spark/bin/spark-submit > docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf > spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master > mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster > --properties-file /usr/local/spark/conf/spark-defaults.conf --class > com.example.spark.streaming.MyApp > http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 > spark-testing my-stream 40 > {code} > submit output: > {code} > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch > an application in mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server > at http://compute1.example.com:31262/v1/submissions/create: > { > "action" : "CreateSubmissionRequest", > "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ], > "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;, > "clientSparkVersion" : "1.5.0", > "environmentVariables" : { > "SPARK_SCALA_VERSION" : "2.10", > "SPARK_CONF_DIR" : "/usr/local/spark/conf", > "SPARK_HOME" : "/usr/local/spark", > "SPARK_ENV_LOADED" : "1" > }, > "mainClass" : "com.example.spark.streaming.MyApp", > "sparkProperties" : { > "spark.serializer" : "org.apache.spark.serializer.KryoSerializer", > "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : > "/usr/local/lib/libmesos.so", > "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs", > "spark.eventLog.enabled" : "true", > "spark.driver.maxResultSize" : "0", > "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER", > "spark.mesos.deploy.zookeeper.url" : > "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181", > "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;, > "spark.driver.supervise" : "false", > "spark.app.name" : "com.example.spark.streaming.MyApp", > "spark.driver.memory" : "8G", > "spark.logConf" : "true", > "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher", > "spark.mesos.executor.docker.image" : > "docker.example.com/spark-prod:2015.10.2", > "spark.submit.deployMode" : "cluster", > "spark.master" : "mesos://compute1.example.com:31262", > "spark.executor.memory" : "8G", > "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs", > "spark.mesos.docker.executor.network" : "HOST", > "spark.mesos.executor.home" : "/usr/local/spark" > } > } > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created > as driver-20151026220353-0011. Polling submission state... > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20151026220353-0011 in > mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server > at > http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "SubmissionStatusResponse", > "driverState" : "QUEUED", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver > driver-20151026220353-0011 is now QUEUED. > 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with > CreateSubmissionResponse: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > {code} > driver log: > {code} > 15/10/26
[jira] [Resolved] (SPARK-14304) Fix tests that don't create temp files in the `java.io.tmpdir` folder
[ https://issues.apache.org/jira/browse/SPARK-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14304. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Fix tests that don't create temp files in the `java.io.tmpdir` folder > - > > Key: SPARK-14304 > URL: https://issues.apache.org/jira/browse/SPARK-14304 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.0 > > > If I press `CTRL-C` when running these tests, the temp files will be left in > `sql/core` folder and I need to delete them manually. It's annoying. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala
[ https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14279: -- Issue Type: Improvement (was: Story) > Improve the spark build to pick the version information from the pom file > instead of package.scala > -- > > Key: SPARK-14279 > URL: https://issues.apache.org/jira/browse/SPARK-14279 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Sanket Reddy >Assignee: Sanket Reddy >Priority: Minor > > Right now the spark-submit --version and other parts of the code pick up > version information from a static SPARK_VERSION. We would want to pick the > version from the pom.version probably stored inside a properties file. Also, > it might be nice to have other details like branch, build information and > other specific details when having a spark-submit --version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13710. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12043 [https://github.com/apache/spark/pull/12043] > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) > at scala.collection.immutable.Stream.collect(Stream.scala:435) > at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911) > at org.apache.spark.repl.Main$.doMain(Main.scala:64) > at org.apache.spark.repl.Main$.main(Main.scala:47) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737) > at >
[jira] [Updated] (SPARK-13710) Spark shell shows ERROR when launching on Windows
[ https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13710: -- Assignee: Michel Lemay > Spark shell shows ERROR when launching on Windows > - > > Key: SPARK-13710 > URL: https://issues.apache.org/jira/browse/SPARK-13710 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Windows >Reporter: Masayoshi TSUZUKI >Assignee: Michel Lemay >Priority: Minor > Fix For: 2.0.0 > > > On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message > and stacktrace. > {noformat} > C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell > [ERROR] Terminal initialization failed; falling back to unsupported > java.lang.NoClassDefFoundError: Could not initialize class > scala.tools.fusesource_embedded.jansi.internal.Kernel32 > at > scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50) > at > scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204) > at > scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82) > at > scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101) > at > scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221) > at > scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209) > at > scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61) > at > scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862) > at > scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875) > at scala.util.Try$.apply(Try.scala:192) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at > scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) > at scala.collection.immutable.Stream.collect(Stream.scala:435) > at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911) > at > scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97) > at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911) > at org.apache.spark.repl.Main$.doMain(Main.scala:64) > at org.apache.spark.repl.Main$.main(Main.scala:47) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183) > at
[jira] [Resolved] (SPARK-14278) Initialize columnar batch with proper memory mode
[ https://issues.apache.org/jira/browse/SPARK-14278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14278. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12070 [https://github.com/apache/spark/pull/12070] > Initialize columnar batch with proper memory mode > - > > Key: SPARK-14278 > URL: https://issues.apache.org/jira/browse/SPARK-14278 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14278) Initialize columnar batch with proper memory mode
[ https://issues.apache.org/jira/browse/SPARK-14278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14278: - Assignee: Sameer Agarwal > Initialize columnar batch with proper memory mode > - > > Key: SPARK-14278 > URL: https://issues.apache.org/jira/browse/SPARK-14278 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13363) Aggregator not working with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220492#comment-15220492 ] koert kuipers commented on SPARK-13363: --- just doing some digging. the issue seems to be that when the TypedAggregateExpression is created from the Aggregator aEncoder is set to None, and it stays None. then when the check is done that calls resolved on TypedAggregateExpression it returns false because aEncoder is None. > Aggregator not working with DataFrame > - > > Key: SPARK-13363 > URL: https://issues.apache.org/jira/browse/SPARK-13363 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: koert kuipers >Priority: Blocker > > org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class > for user-defined aggregations, which can be used in [[DataFrame]] and > [[Dataset]] > it works well with Dataset/GroupedDataset, but i am having no luck using it > with DataFrame/GroupedData. does anyone have an example how to use it with a > DataFrame? > in particular i would like to use it with this method in GroupedData: > {noformat} > def agg(expr: Column, exprs: Column*): DataFrame > {noformat} > clearly it should be possible, since GroupedDataset uses that very same > method to do the work: > {noformat} > private def agg(exprs: Column*): DataFrame = > groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*) > {noformat} > the trick seems to be the wrapping in withEncoder, which is private. i tried > to do something like it myself, but i had no luck since it uses more private > stuff in TypedColumn. > anyhow, my attempt at using it in DataFrame: > {noformat} > val simpleSum = new Aggregator[Int, Int, Int] { > def zero: Int = 0 // The initial value. > def reduce(b: Int, a: Int) = b + a// Add an element to the running total > def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values. > def finish(b: Int) = b// Return the final result. > }.toColumn > val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v") > df.groupBy("k").agg(simpleSum).show > {noformat} > and the resulting error: > {noformat} > org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate > [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties
[ https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11327. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > spark-dispatcher doesn't pass along some spark properties > - > > Key: SPARK-11327 > URL: https://issues.apache.org/jira/browse/SPARK-11327 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Alan Braithwaite > Fix For: 2.0.0 > > > I haven't figured out exactly what's going on yet, but there's something in > the spark-dispatcher which is failing to pass along properties to the > spark-driver when using spark-submit in a clustered mesos docker environment. > Most importantly, it's not passing along spark.mesos.executor.docker.image. > cli: > {code} > docker run -t -i --rm --net=host > --entrypoint=/usr/local/spark/bin/spark-submit > docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf > spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master > mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster > --properties-file /usr/local/spark/conf/spark-defaults.conf --class > com.example.spark.streaming.MyApp > http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 > spark-testing my-stream 40 > {code} > submit output: > {code} > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch > an application in mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server > at http://compute1.example.com:31262/v1/submissions/create: > { > "action" : "CreateSubmissionRequest", > "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ], > "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;, > "clientSparkVersion" : "1.5.0", > "environmentVariables" : { > "SPARK_SCALA_VERSION" : "2.10", > "SPARK_CONF_DIR" : "/usr/local/spark/conf", > "SPARK_HOME" : "/usr/local/spark", > "SPARK_ENV_LOADED" : "1" > }, > "mainClass" : "com.example.spark.streaming.MyApp", > "sparkProperties" : { > "spark.serializer" : "org.apache.spark.serializer.KryoSerializer", > "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : > "/usr/local/lib/libmesos.so", > "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs", > "spark.eventLog.enabled" : "true", > "spark.driver.maxResultSize" : "0", > "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER", > "spark.mesos.deploy.zookeeper.url" : > "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181", > "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;, > "spark.driver.supervise" : "false", > "spark.app.name" : "com.example.spark.streaming.MyApp", > "spark.driver.memory" : "8G", > "spark.logConf" : "true", > "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher", > "spark.mesos.executor.docker.image" : > "docker.example.com/spark-prod:2015.10.2", > "spark.submit.deployMode" : "cluster", > "spark.master" : "mesos://compute1.example.com:31262", > "spark.executor.memory" : "8G", > "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs", > "spark.mesos.docker.executor.network" : "HOST", > "spark.mesos.executor.home" : "/usr/local/spark" > } > } > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created > as driver-20151026220353-0011. Polling submission state... > 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20151026220353-0011 in > mesos://compute1.example.com:31262. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server > at > http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011. > 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server: > { > "action" : "SubmissionStatusResponse", > "driverState" : "QUEUED", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver > driver-20151026220353-0011 is now QUEUED. > 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with > CreateSubmissionResponse: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151026220353-0011", > "success" : true > } > {code} > driver log: > {code} > 15/10/26 22:08:08 INFO
[jira] [Resolved] (SPARK-14069) Improve SparkStatusTracker to also track executor information
[ https://issues.apache.org/jira/browse/SPARK-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14069. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Improve SparkStatusTracker to also track executor information > - > > Key: SPARK-14069 > URL: https://issues.apache.org/jira/browse/SPARK-14069 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks
[ https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14243: Assignee: Apache Spark (was: jeanlyn) > updatedBlockStatuses does not update correctly when removing blocks > --- > > Key: SPARK-14243 > URL: https://issues.apache.org/jira/browse/SPARK-14243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: jeanlyn >Assignee: Apache Spark > Fix For: 2.0.0 > > > Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly > when removing blocks in *BlockManager.removeBlock* and the method invoke > *removeBlock*. See: > branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 > branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 > We should make sure *updatedBlockStatuses* update correctly when: > * Block removed from BlockManager > * Block dropped from memory to disk > * Block added to BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning
[ https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14168. --- Resolution: Duplicate > Managed Memory Leak Msg Should Only Be a Warning > > > Key: SPARK-14168 > URL: https://issues.apache.org/jira/browse/SPARK-14168 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > When a task is completed, executors check to see if all managed memory for > the task was correctly released, and logs an error when it wasn't. However, > it turns out its OK for there to be memory that wasn't released when an > Iterator isn't read to completion, eg., with {{rdd.take()}}. This results in > a scary error msg in the executor logs: > {noformat} > 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = > 16259594 bytes, TID = 24 > {noformat} > Furthermore, if tasks fails for any reason, this msg is also triggered. This > can lead users to believe that the failure was from the memory leak, when the > root cause could be entirely different. Eg., the same error msg appears in > executor logs with this clearly broken user code run with {{spark-shell > --master 'local-cluster[2,2,1024]'}} > {code} > sc.parallelize(0 to 1000, 2).map(x => x % 1 -> > x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") > }.collect > {code} > We should downgrade the msg to a warning and link to a more detailed > explanation. > See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from > users (and perhaps a true fix) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks
[ https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14243: Assignee: jeanlyn (was: Apache Spark) > updatedBlockStatuses does not update correctly when removing blocks > --- > > Key: SPARK-14243 > URL: https://issues.apache.org/jira/browse/SPARK-14243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: jeanlyn >Assignee: jeanlyn > Fix For: 2.0.0 > > > Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly > when removing blocks in *BlockManager.removeBlock* and the method invoke > *removeBlock*. See: > branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 > branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 > We should make sure *updatedBlockStatuses* update correctly when: > * Block removed from BlockManager > * Block dropped from memory to disk > * Block added to BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks
[ https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-14243: -- Fix Version/s: 2.0.0 > updatedBlockStatuses does not update correctly when removing blocks > --- > > Key: SPARK-14243 > URL: https://issues.apache.org/jira/browse/SPARK-14243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: jeanlyn >Assignee: jeanlyn > Fix For: 2.0.0 > > > Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly > when removing blocks in *BlockManager.removeBlock* and the method invoke > *removeBlock*. See: > branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 > branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 > We should make sure *updatedBlockStatuses* update correctly when: > * Block removed from BlockManager > * Block dropped from memory to disk > * Block added to BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks
[ https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-14243: --- > updatedBlockStatuses does not update correctly when removing blocks > --- > > Key: SPARK-14243 > URL: https://issues.apache.org/jira/browse/SPARK-14243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: jeanlyn >Assignee: jeanlyn > Fix For: 2.0.0 > > > Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly > when removing blocks in *BlockManager.removeBlock* and the method invoke > *removeBlock*. See: > branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 > branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 > We should make sure *updatedBlockStatuses* update correctly when: > * Block removed from BlockManager > * Block dropped from memory to disk > * Block added to BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks
[ https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14243. --- Resolution: Fixed Target Version/s: 1.6.2, 2.0.0 > updatedBlockStatuses does not update correctly when removing blocks > --- > > Key: SPARK-14243 > URL: https://issues.apache.org/jira/browse/SPARK-14243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.1 >Reporter: jeanlyn >Assignee: jeanlyn > > Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly > when removing blocks in *BlockManager.removeBlock* and the method invoke > *removeBlock*. See: > branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108 > branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101 > We should make sure *updatedBlockStatuses* update correctly when: > * Block removed from BlockManager > * Block dropped from memory to disk > * Block added to BlockManager -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14182) Parse DDL command: Alter View
[ https://issues.apache.org/jira/browse/SPARK-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14182. --- Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Parse DDL command: Alter View > - > > Key: SPARK-14182 > URL: https://issues.apache.org/jira/browse/SPARK-14182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > Based on the Hive DDL document > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL > and > https://cwiki.apache.org/confluence/display/Hive/PartitionedViews > The syntax of DDL command for {{ALTER VIEW}} include > {code} > ALTER VIEW view_name AS select_statement > ALTER VIEW view_name RENAME TO new_view_name > ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment); > ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key') > ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, > ...] > ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org