[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Bo Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng updated SPARK-14323:

Description: 
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results. 

This is because "*" did not get escaped before passing to the regex. If we do 
not escape "*", for example, pattern "*f*", it will cause exception 
(PatternSyntaxException, Dangling meta character) and thus return empty result.

try this: 
val p = "\*f\*".r

  was:
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results. 

This is because "*" did not get escaped before passing to the regex. If we do 
not escape "*", for example, pattern "*f*", it will cause exception 
(PatternSyntaxException, Dangling meta character) and thus return empty result.

try this: 
val p = "*f*".r


> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results. 
> This is because "*" did not get escaped before passing to the regex. If we do 
> not escape "*", for example, pattern "*f*", it will cause exception 
> (PatternSyntaxException, Dangling meta character) and thus return empty 
> result.
> try this: 
> val p = "\*f\*".r



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Bo Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng updated SPARK-14323:

Description: 
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results. 

This is because "*" did not get escaped before passing to the regex. If we do 
not escape "*", for example, pattern "*f*", it will cause exception 
(PatternSyntaxException, Dangling meta character) and thus return empty result.

try this: 
val p = "*f*".r

  was:
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results.

This is because "*" did not get escaped before passing to the regex.


> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results. 
> This is because "*" did not get escaped before passing to the regex. If we do 
> not escape "*", for example, pattern "*f*", it will cause exception 
> (PatternSyntaxException, Dangling meta character) and thus return empty 
> result.
> try this: 
> val p = "*f*".r



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Bo Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Meng updated SPARK-14323:

Description: 
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results. 

This is because "\*" did not get escaped before passing to the regex. If we do 
not escape "\*", for example, pattern "\*f\*", it will cause exception 
(PatternSyntaxException, Dangling meta character) and thus return empty result.

try this: 
val p = "\*f\*".r

  was:
Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results. 

This is because "*" did not get escaped before passing to the regex. If we do 
not escape "*", for example, pattern "*f*", it will cause exception 
(PatternSyntaxException, Dangling meta character) and thus return empty result.

try this: 
val p = "\*f\*".r


> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results. 
> This is because "\*" did not get escaped before passing to the regex. If we 
> do not escape "\*", for example, pattern "\*f\*", it will cause exception 
> (PatternSyntaxException, Dangling meta character) and thus return empty 
> result.
> try this: 
> val p = "\*f\*".r



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14037) count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame

2016-03-31 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221192#comment-15221192
 ] 

Sun Rui commented on SPARK-14037:
-

Thanks a lot. I will try to figure out another investigation PR. 
BTW, what cluster mode are you using?

> count(df) is very slow for dataframe constrcuted using SparkR::createDataFrame
> --
>
> Key: SPARK-14037
> URL: https://issues.apache.org/jira/browse/SPARK-14037
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
> Environment: Ubuntu 12.04
> RAM : 6 GB
> Spark 1.6.1 Standalone
>Reporter: Samuel Alexander
>  Labels: performance, sparkR
> Attachments: console.log, spark_ui.png, spark_ui_ray.png
>
>
> Any operations on dataframe created using SparkR::createDataFrame is very 
> slow.
> I have a CSV of size ~ 6MB. Below is the sample content
> 12121212Juej1XC,A_String,5460.8,2016-03-14,7,Quarter
> 12121212K6sZ1XS,A_String,0.0,2016-03-14,7,Quarter
> 12121212K9Xc1XK,A_String,7803.0,2016-03-14,7,Quarter
> 12121212ljXE1XY,A_String,226944.25,2016-03-14,7,Quarter
> 12121212lr8p1XA,A_String,368022.26,2016-03-14,7,Quarter
> 12121212lwip1XA,A_String,84091.0,2016-03-14,7,Quarter
> 12121212lwkn1XA,A_String,54154.0,2016-03-14,7,Quarter
> 12121212lwlv1XA,A_String,11219.09,2016-03-14,7,Quarter
> 12121212lwmL1XQ,A_String,23808.0,2016-03-14,7,Quarter
> 12121212lwnj1XA,A_String,32029.3,2016-03-14,7,Quarter
> I created R data.frame using r_df <- read.csv(file="r_df.csv", head=TRUE, 
> sep=","). And then converted into Spark dataframe using sp_df <- 
> createDataFrame(sqlContext, r_df)
> Now count(sp_df) took more than 30 seconds
> When I load the same CSV using spark-csv like, direct_df <- 
> read.df(sqlContext, "/home/sam/tmp/csv/orig_content.csv", source = 
> "com.databricks.spark.csv", inferSchema = "false", header="true")
> count(direct_df) took below 1 sec.
> I know performance has been improved in createDataFrame in Spark 1.6. But 
> other operations like count(), is very slow.
> How can I get rid of this performance issue? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Attachment: threaddump-1459461915668.tdump

here is the thread dump taken during the high CPU usage on the executor.

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
> Attachments: threaddump-1459461915668.tdump
>
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
> {noformat}
> The Spark console output:
> {noformat}
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
> 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
> 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
> 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
> 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
> 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
> 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 
> on executor id: 4 hostname: bigaperf138.svl.ibm.com.
> 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
> 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
> 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
> 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 
> on executor id: 2 hostname: bigaperf137.svl.ibm.com.
> 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
> 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
> {noformat}
> Notice that time durations between tasks are unusually long: 2~5 minutes.
> When looking at the Linux 'perf' tool, two top CPU consumers are:
> 86.48%java  [unknown]   
> 12.41%libjvm.so
> Using the Java hotspot profiling tools, I am able to show what hotspot 
> methods are (top 5):
> {noformat}
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten()   
> 46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
> 9,654,179 ms
> org.apache.spark.unsafe.Platform.copyMemory() 18.631157   3,848,442 ms 
> (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
> org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185   
> 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue()
>4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms   
>  2,153,910 ms
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()
> 4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
> 19,967,510 ms
> {noformat}
> So as you can see, the test has been running for 1.5 hours...with 46% CPU 
> spent in the 
> org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 
> The stacks for top two are:
> {noformat}
> Marshalling   
> I
> java/io/DataOutputStream.writeInt() line 197
> org.​apache.​spark.​sql   
> I
> org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue()
>  line 60
> org.​apache.​spark.​storage   
> I
> org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
> org.​apache.​spark.​shuffle   
> I
> 

[jira] [Updated] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14303:
--
Assignee: Yanbo Liang

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1522#comment-1522
 ] 

Yanbo Liang edited comment on SPARK-14303 at 4/1/16 4:00 AM:
-

[~mengxr] I have make the refactor for k-means, I will link the PR here.
For glm, I think it can be work with SPARK-12566.


was (Author: yanboliang):
[~mengxr] I have make the refactor for k-means, I will link the PR here.

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14303:


Assignee: (was: Apache Spark)

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221114#comment-15221114
 ] 

Apache Spark commented on SPARK-14303:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12039

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14303) Refactor SparkRWrappers

2016-03-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1522#comment-1522
 ] 

Yanbo Liang commented on SPARK-14303:
-

[~mengxr] I have make the refactor for k-means, I will link the PR here.

> Refactor SparkRWrappers
> ---
>
> Key: SPARK-14303
> URL: https://issues.apache.org/jira/browse/SPARK-14303
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>
> We use a single object `SparkRWrappers` 
> (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/SparkRWrappers.scala)
>  to wrap method calls to glm and kmeans in SparkR. This is quite hard to 
> maintain. We should refactor them into separate wrappers, like 
> `AFTSurvivalRegressionWrapper` and `NaiveBayesWrapper`.
> The package name should be `spakr.ml.r` instead of `spark.ml.api.r`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-03-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221103#comment-15221103
 ] 

Yanbo Liang commented on SPARK-14313:
-

Sure, please assign it to me.

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14322:


Assignee: Apache Spark

> Use treeReduce instead of reduce in OnlineLDAOptimizer
> --
>
> Key: SPARK-14322
> URL: https://issues.apache.org/jira/browse/SPARK-14322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use 
> treeReduce.  This can cause scalability issues.  This should be an easy fix.
> See this line: 
> [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452]
> and a few lines below it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221088#comment-15221088
 ] 

Apache Spark commented on SPARK-14322:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/12106

> Use treeReduce instead of reduce in OnlineLDAOptimizer
> --
>
> Key: SPARK-14322
> URL: https://issues.apache.org/jira/browse/SPARK-14322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use 
> treeReduce.  This can cause scalability issues.  This should be an easy fix.
> See this line: 
> [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452]
> and a few lines below it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14322:


Assignee: (was: Apache Spark)

> Use treeReduce instead of reduce in OnlineLDAOptimizer
> --
>
> Key: SPARK-14322
> URL: https://issues.apache.org/jira/browse/SPARK-14322
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use 
> treeReduce.  This can cause scalability issues.  This should be an easy fix.
> See this line: 
> [https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452]
> and a few lines below it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14242) avoid too many copies in network when a network frame is large

2016-03-31 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14242.
--
   Resolution: Fixed
 Assignee: Zhang, Liye
Fix Version/s: 2.0.0

> avoid too many copies in network when a network frame is large
> --
>
> Key: SPARK-14242
> URL: https://issues.apache.org/jira/browse/SPARK-14242
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Spark Core
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
> Fix For: 2.0.0
>
>
> when a shuffle block size is huge, say a large array (array size more than 
> 128MB), there will be performance issue for getting remote blocks. This is 
> because network frame size is large, and when we are using a composite 
> buffer, which will consolidate when the components number reaches maximum 
> components number (default is 16) in netty underlying, performance issue will 
> occurs. There will be too many memory copies inside netty's *compositeBuffer*.
> How to reproduce:
> {code}
> sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 
> * 1024 * 50)).iterator).reduce((a,b)=> a).length
> {code}
> In this case, the serialized result size of each task is about 400MB, the 
> result will be transferred to driver as *indirectResult*. We can see after 
> the data transferred to driver, on driver side there will still need a lot of 
> time to process and the 3 CPUs (in this case, parallelism is 3) are fully 
> utilized with system call very high. And this processing time is calculated 
> as result getting time on webUI.
> Such cases are very common in ML applications, which will return a large 
> array from each executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14321:


Assignee: Apache Spark

> Reduce date format cost and string-to-date cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15221074#comment-15221074
 ] 

Apache Spark commented on SPARK-14321:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12105

> Reduce date format cost and string-to-date cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14321:


Assignee: (was: Apache Spark)

> Reduce date format cost and string-to-date cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14321) Reduce date format cost and string-to-date cost in date functions

2016-03-31 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-14321:
-
Summary: Reduce date format cost and string-to-date cost in date functions  
(was: Reduce DateFormat cost in datetimeExpressions)

> Reduce date format cost and string-to-date cost in date functions
> -
>
> Key: SPARK-14321
> URL: https://issues.apache.org/jira/browse/SPARK-14321
> Project: Spark
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> Currently the code generated is
> {noformat}
> /* 066 */ UTF8String primitive5 = null;
> /* 067 */ if (!isNull4) {
> /* 068 */   try {
> /* 069 */ primitive5 = UTF8String.fromString(new 
> java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
> /* 070 */ new java.util.Date(primitive7 * 1000L)));
> /* 071 */   } catch (java.lang.Throwable e) {
> /* 072 */ isNull4 = true;
> /* 073 */   }
> /* 074 */ }
> {noformat}
> Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
> need basis. 
> I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220954#comment-15220954
 ] 

Apache Spark commented on SPARK-14323:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12104

> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results.
> This is because "*" did not get escaped before passing to the regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14323:


Assignee: Apache Spark

> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results.
> This is because "*" did not get escaped before passing to the regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14323:


Assignee: (was: Apache Spark)

> [SQL] SHOW FUNCTIONS did not work properly
> --
>
> Key: SPARK-14323
> URL: https://issues.apache.org/jira/browse/SPARK-14323
> Project: Spark
>  Issue Type: Bug
>Reporter: Bo Meng
>Priority: Minor
>
> Show Functions syntax can be found here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions
> When use "*" in the LIKE clause, it will not return the expected results.
> This is because "*" did not get escaped before passing to the regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14323) [SQL] SHOW FUNCTIONS did not work properly

2016-03-31 Thread Bo Meng (JIRA)
Bo Meng created SPARK-14323:
---

 Summary: [SQL] SHOW FUNCTIONS did not work properly
 Key: SPARK-14323
 URL: https://issues.apache.org/jira/browse/SPARK-14323
 Project: Spark
  Issue Type: Bug
Reporter: Bo Meng
Priority: Minor


Show Functions syntax can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowFunctions

When use "*" in the LIKE clause, it will not return the expected results.

This is because "*" did not get escaped before passing to the regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms **(46.8%)**9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I

[jira] [Resolved] (SPARK-14267) Execute multiple Python UDFs in single batch

2016-03-31 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14267.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12057
[https://github.com/apache/spark/pull/12057]

> Execute multiple Python UDFs in single batch
> 
>
> Key: SPARK-14267
> URL: https://issues.apache.org/jira/browse/SPARK-14267
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> {code}
> select udf1(a), udf2(b), udf3(a, b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I

[jira] [Commented] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220864#comment-15220864
 ] 

JESSE CHEN commented on SPARK-14318:


Q14 is as follows:
{noformat}
with  cross_items as
 (select i_item_sk ss_item_sk
 from item
 JOIN
 (select brand_id, class_id, category_id from
 (select iss.i_brand_id brand_id
 ,iss.i_class_id class_id
 ,iss.i_category_id category_id
 from store_sales
 ,item iss
 ,date_dim d1
 where ss_item_sk = iss.i_item_sk
   and ss_sold_date_sk = d1.d_date_sk
   and d1.d_year between 1999 AND 1999 + 2) x1
 JOIN
 (select ics.i_brand_id
 ,ics.i_class_id
 ,ics.i_category_id
 from catalog_sales
 ,item ics
 ,date_dim d2
 where cs_item_sk = ics.i_item_sk
   and cs_sold_date_sk = d2.d_date_sk
   and d2.d_year between 1999 AND 1999 + 2) x2
   ON x1.brand_id = x2.i_brand_id and
  x1.class_id = x2.i_class_id and
  x1.category_id = x2.i_category_id
 JOIN
 (select iws.i_brand_id
 ,iws.i_class_id
 ,iws.i_category_id
 from web_sales
 ,item iws
 ,date_dim d3
 where ws_item_sk = iws.i_item_sk
   and ws_sold_date_sk = d3.d_date_sk
   and d3.d_year between 1999 AND 1999 + 2) x3
   ON x1.brand_id = x3.i_brand_id and
  x1.class_id = x3.i_class_id and
  x1.category_id = x3.i_category_id
 ) x4
 where i_brand_id = x4.brand_id
  and i_class_id = x4.class_id
  and i_category_id = x4.category_id
),
 avg_sales as
 (select avg(quantity*list_price) average_sales
  from (select ss_quantity quantity
 ,ss_list_price list_price
   from store_sales
   ,date_dim
   where ss_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2
   union all
   select cs_quantity quantity
 ,cs_list_price list_price
   from catalog_sales
   ,date_dim
   where cs_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2
   union all
   select ws_quantity quantity
 ,ws_list_price list_price
   from web_sales
   ,date_dim
   where ws_sold_date_sk = d_date_sk
 and d_year between 1999 and 1999 + 2) x)
  select  * from
 (select 'store' channel, i_brand_id,i_class_id,i_category_id
,sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) number_sales
 from store_sales ss1
 JOIN item ON ss1.ss_item_sk = i_item_sk
 JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk
 JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk
 JOIN avg_sales
 JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq
 where dd2.d_year = 1999 + 1
   and dd2.d_moy = 12
   and dd2.d_dom = 11
 group by average_sales,i_brand_id,i_class_id,i_category_id
 having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) 
this_year,
 (select 'store' channel, i_brand_id,i_class_id
,i_category_id, sum(ss1.ss_quantity*ss1.ss_list_price) sales, count(*) 
number_sales
 from store_sales ss1
 JOIN item ON ss1.ss_item_sk = i_item_sk
 JOIN date_dim dd1 ON ss1.ss_sold_date_sk = dd1.d_date_sk
 JOIN cross_items ON ss1.ss_item_sk = cross_items.ss_item_sk
 JOIN avg_sales
 JOIN date_dim dd2 ON dd1.d_week_seq = dd2.d_week_seq
 where dd2.d_year = 1999
   and dd2.d_moy = 12
   and dd2.d_dom = 11
 group by average_sales, i_brand_id,i_class_id,i_category_id
 having sum(ss1.ss_quantity*ss1.ss_list_price) > avg_sales.average_sales) 
last_year
 where this_year.i_brand_id= last_year.i_brand_id
   and this_year.i_class_id = last_year.i_class_id
   and this_year.i_category_id = last_year.i_category_id
 order by this_year.channel, this_year.i_brand_id, this_year.i_class_id, 
this_year.i_category_id
   limit 100
{noformat}



> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
> run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
> process AND all CPUs are used 100% by the executor JVMs.
> It is very easy to reproduce:
> 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 
> 1GB text file (assuming you know how to generate the csv data). My command is 
> like this:
> {noformat}
> /TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
> --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
> --executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
> spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > 

[jira] [Created] (SPARK-14322) Use treeReduce instead of reduce in OnlineLDAOptimizer

2016-03-31 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14322:
-

 Summary: Use treeReduce instead of reduce in OnlineLDAOptimizer
 Key: SPARK-14322
 URL: https://issues.apache.org/jira/browse/SPARK-14322
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Joseph K. Bradley


OnlineLDAOptimizer uses {{RDD.reduce}} in two places where it could use 
treeReduce.  This can cause scalability issues.  This should be an easy fix.

See this line: 
[https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452]
and a few lines below it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Description: 
TPCDS Q14 parses successfully, and plans created successfully. Spark tries to 
run (I used only 1GB text file), but "hangs". Tasks are extremely slow to 
process AND all CPUs are used 100% by the executor JVMs.

It is very easy to reproduce:
1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB 
text file (assuming you know how to generate the csv data). My command is like 
this:

{noformat}
/TestAutomation/downloads/spark-master/bin/spark-sql  --driver-memory 10g 
--verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 
--executor-memory 8g --num-executors 4 --executor-cores 4 --conf 
spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.out
{noformat}

The Spark console output:
{noformat}
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 
17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 
17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200)
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 
17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes)
16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 
17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200)
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 
17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes)
16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on 
executor id: 4 hostname: bigaperf138.svl.ibm.com.
16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 
17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200)
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 
17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes)
16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on 
executor id: 2 hostname: bigaperf137.svl.ibm.com.
16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 
17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200)
{noformat}

Notice that time durations between tasks are unusually long: 2~5 minutes.

When looking at the Linux 'perf' tool, two top CPU consumers are:
86.48%java  [unknown]   
12.41%libjvm.so

Using the Java hotspot profiling tools, I am able to show what hotspot methods 
are (top 5):
{noformat}
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 
46.845276   9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms
9,654,179 ms
org.apache.spark.unsafe.Platform.copyMemory()   18.631157   3,848,442 ms 
(18.6%)3,848,442 ms3,848,442 ms3,848,442 ms
org.apache.spark.util.collection.CompactBuffer.$plus$eq()   6.8570185   
1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 
4.6126328   955,495 ms (4.6%)   955,495 ms  2,153,910 ms
2,153,910 ms
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write()  
4.581077949,930 ms (4.6%)   949,930 ms  19,967,510 ms   
19,967,510 ms
{noformat}
So as you can see, the test has been running for 1.5 hours...with 46% CPU spent 
in the 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. 

The stacks for top two are:
{noformat}
Marshalling 
I
java/io/DataOutputStream.writeInt() line 197
org.​apache.​spark.​sql 
I
org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() 
line 60
org.​apache.​spark.​storage 
I
org/apache/spark/storage/DiskBlockObjectWriter.write() line 185
org.​apache.​spark.​shuffle 
I
org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150
org.​apache.​spark.​scheduler   
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78
I
org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46
I
org/apache/spark/scheduler/Task.run() line 82
org.​apache.​spark.​executor
I
org/apache/spark/executor/Executor$TaskRunner.run() line 231
Dispatching Overhead,​ Standard Library Worker Dispatching  
I
java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142
I
java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617
I
java/lang/Thread.run() line 745
{noformat}

and 

{noformat}
org.​apache.​spark.​unsafe  
I

[jira] [Created] (SPARK-14321) Reduce DateFormat cost in datetimeExpressions

2016-03-31 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14321:


 Summary: Reduce DateFormat cost in datetimeExpressions
 Key: SPARK-14321
 URL: https://issues.apache.org/jira/browse/SPARK-14321
 Project: Spark
  Issue Type: Bug
Reporter: Rajesh Balamohan
Priority: Minor


Currently the code generated is

{noformat}
/* 066 */ UTF8String primitive5 = null;
/* 067 */ if (!isNull4) {
/* 068 */   try {
/* 069 */ primitive5 = UTF8String.fromString(new 
java.text.SimpleDateFormat("-MM-dd HH:mm:ss").format(
/* 070 */ new java.util.Date(primitive7 * 1000L)));
/* 071 */   } catch (java.lang.Throwable e) {
/* 072 */ isNull4 = true;
/* 073 */   }
/* 074 */ }
{noformat}

Instantiation of SimpleDateFormat is fairly expensive. It can be created on 
need basis. 

I will share the patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14320) Make ColumnarBatch.Row mutable

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14320:


Assignee: (was: Apache Spark)

> Make ColumnarBatch.Row mutable
> --
>
> Key: SPARK-14320
> URL: https://issues.apache.org/jira/browse/SPARK-14320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>
> In order to leverage a data structure like `AggregateHashmap` 
> (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates 
> with keys, we need to make ColumnarBatch.Row mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14320) Make ColumnarBatch.Row mutable

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14320:


Assignee: Apache Spark

> Make ColumnarBatch.Row mutable
> --
>
> Key: SPARK-14320
> URL: https://issues.apache.org/jira/browse/SPARK-14320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> In order to leverage a data structure like `AggregateHashmap` 
> (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates 
> with keys, we need to make ColumnarBatch.Row mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14320) Make ColumnarBatch.Row mutable

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220834#comment-15220834
 ] 

Apache Spark commented on SPARK-14320:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12103

> Make ColumnarBatch.Row mutable
> --
>
> Key: SPARK-14320
> URL: https://issues.apache.org/jira/browse/SPARK-14320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>
> In order to leverage a data structure like `AggregateHashmap` 
> (https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates 
> with keys, we need to make ColumnarBatch.Row mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method

2016-03-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14277.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12096
[https://github.com/apache/spark/pull/12096]

> Significant amount of CPU is being consumed in SnappyNative arrayCopy method
> 
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.0
>
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and SnappyInputStream invokes native jni 
> ArrayCopy method to copy the data, which is expensive. We should fix Snappy- 
> java to  use with non-JNI based System.arrayCopy method in this case.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14320) Make ColumnarBatch.Row mutable

2016-03-31 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-14320:
--

 Summary: Make ColumnarBatch.Row mutable
 Key: SPARK-14320
 URL: https://issues.apache.org/jira/browse/SPARK-14320
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal


In order to leverage a data structure like `AggregateHashmap` 
(https://issues.apache.org/jira/browse/SPARK-14263) to speed up aggregates with 
keys, we need to make ColumnarBatch.Row mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14263) Benchmark Vectorized HashMap for GroupBy Aggregates

2016-03-31 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-14263:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-14319

> Benchmark Vectorized HashMap for GroupBy Aggregates
> ---
>
> Key: SPARK-14263
> URL: https://issues.apache.org/jira/browse/SPARK-14263
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14319) Speed up group-by aggregates

2016-03-31 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-14319:
--

 Summary: Speed up group-by aggregates
 Key: SPARK-14319
 URL: https://issues.apache.org/jira/browse/SPARK-14319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Sameer Agarwal


Aggregates with key in SparkSQL are almost 30x slower than aggregates with key. 
This master JIRA tracks our attempts to optimize them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14137) Conflict between NullPropagation and InferFiltersFromConstraints

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220814#comment-15220814
 ] 

Apache Spark commented on SPARK-14137:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12102

> Conflict between NullPropagation and InferFiltersFromConstraints
> 
>
> Key: SPARK-14137
> URL: https://issues.apache.org/jira/browse/SPARK-14137
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> Some optimizer rules conflict with each other, fail this test: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54069/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/union20/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Labels: hangs  (was: tpcds-result-mismatch)

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards must make. | ABDA |   1717 |  1782 |
> | Bad cards must 

[jira] [Created] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)
JESSE CHEN created SPARK-14318:
--

 Summary: TPCDS query 14 causes Spark SQL to hang
 Key: SPARK-14318
 URL: https://issues.apache.org/jira/browse/SPARK-14318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: JESSE CHEN


Testing Spark SQL using TPC queries. Query 21 returns wrong results compared to 
official result set. This is at 1GB SF (validation run).

SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
other rows are missing as well.

Actual results:
{noformat}
[null,AABD,2565,1922]
[null,AAHD,2956,2052]
[null,AALA,2042,1793]
[null,ACGC,2373,1771]
[null,ACKC,2321,1856]
[null,ACOB,1504,1397]
[null,ADKB,1820,2163]
[null,AEAD,2631,1965]
[null,AEOC,1659,1798]
[null,AFAC,1965,1705]
[null,AFAD,1769,1313]
[null,AHDE,2700,1985]
[null,AHHA,1578,1082]
[null,AIEC,1756,1804]
[null,AIMC,3603,2951]
[null,AJAC,2109,1989]
[null,AJKB,2573,3540]
[null,ALBE,3458,2992]
[null,ALCE,1720,1810]
[null,ALEC,2569,1946]
[null,ALNB,2552,1750]
[null,ANFE,2022,2269]
[null,AOIB,2982,2540]
[null,APJB,2344,2593]
[null,BAPD,2182,2787]
[null,BDCE,2844,2069]
[null,BDDD,2417,2537]
[null,BDJA,1584,1666]
[null,BEOD,2141,2649]
[null,BFCC,2745,2020]
[null,BFMB,1642,1364]
[null,BHPC,1923,1780]
[null,BIDB,1956,2836]
[null,BIGB,2023,2344]
[null,BIJB,1977,2728]
[null,BJFE,1891,2390]
[null,BLDE,1983,1797]
[null,BNID,2485,2324]
[null,BNLD,2385,2786]
[null,BOMB,2291,2092]
[null,CAAA,2233,2560]
[null,CBCD,1540,2012]
[null,CBIA,2394,2122]
[null,CBPB,1790,1661]
[null,CCMD,2654,2691]
[null,CDBC,1804,2072]
[null,CFEA,1941,1567]
[null,CGFD,2123,2265]
[null,CHPC,2933,2174]
[null,CIGD,2618,2399]
[null,CJCB,2728,2367]
[null,CJLA,1350,1732]
[null,CLAE,2578,2329]
[null,CLGA,1842,1588]
[null,CLLB,3418,2657]
[null,CLOB,3115,2560]
[null,CMAD,1991,2243]
[null,CMJA,1261,1855]
[null,CMLA,3288,2753]
[null,CMPD,1320,1676]
[null,CNGB,2340,2118]
[null,CNHD,3519,3348]
[null,CNPC,2561,1948]
[null,DCPC,2664,2627]
[null,DDHA,1313,1926]
[null,DDND,1109,835]
[null,DEAA,2141,1847]
[null,DEJA,3142,2723]
[null,DFKB,1470,1650]
[null,DGCC,2113,2331]
[null,DGFC,2201,2928]
[null,DHPA,2467,2133]
[null,DMBA,3085,2087]
[null,DPAB,3494,3081]
[null,EAEC,2133,2148]
[null,EAPA,1560,1275]
[null,ECGC,2815,3307]
[null,EDPD,2731,1883]
[null,EEEC,2024,1902]
[null,EEMC,2624,2387]
[null,EFFA,2047,1878]
[null,EGJA,2403,2633]
[null,EGMA,2784,2772]
[null,EGOC,2389,1753]
[null,EHFD,1940,1420]
[null,EHLB,2320,2057]
[null,EHPA,1898,1853]
[null,EIPB,2930,2326]
[null,EJAE,2582,1836]
[null,EJIB,2257,1681]
[null,EJJA,2791,1941]
[null,EJJD,3410,2405]
[null,EJNC,2472,2067]
[null,EJPD,1219,1229]
[null,EKEB,2047,1713]
[null,EMEA,2502,1897]
[null,EMKC,2362,2042]
[null,ENAC,2011,1909]
[null,ENFB,2507,2162]
[null,ENOD,3371,2709]
{noformat}


Expected results:
{noformat}
+--+--++---+
| W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
+--+--++---+
| Bad cards must make. | AACD |   1889 |  2168 |
| Bad cards must make. | AAHD |   2739 |  2039 |
| Bad cards must make. | ABDA |   1717 |  1782 |
| Bad cards must make. | ACGC |   2296 |  2276 |
| Bad cards must make. | ACKC |   2443 |  1878 |
| Bad cards must make. | ACOB |   2705 |  2428 |
| Bad cards must make. | ADGB |   2242 |  2759 |
| Bad cards must make. | ADKB |   2138 |  2456 |
| Bad cards must make. | AEAD |   2914 |  2237 |
| Bad cards must make. | AEOC |   1797 |  2073 |
| Bad 

[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang

2016-03-31 Thread JESSE CHEN (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JESSE CHEN updated SPARK-14318:
---
Affects Version/s: 2.0.0

> TPCDS query 14 causes Spark SQL to hang
> ---
>
> Key: SPARK-14318
> URL: https://issues.apache.org/jira/browse/SPARK-14318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: JESSE CHEN
>  Labels: hangs
>
> Testing Spark SQL using TPC queries. Query 21 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL missing at least one row (grep for ABDA) ; I believe 2 
> other rows are missing as well.
> Actual results:
> {noformat}
> [null,AABD,2565,1922]
> [null,AAHD,2956,2052]
> [null,AALA,2042,1793]
> [null,ACGC,2373,1771]
> [null,ACKC,2321,1856]
> [null,ACOB,1504,1397]
> [null,ADKB,1820,2163]
> [null,AEAD,2631,1965]
> [null,AEOC,1659,1798]
> [null,AFAC,1965,1705]
> [null,AFAD,1769,1313]
> [null,AHDE,2700,1985]
> [null,AHHA,1578,1082]
> [null,AIEC,1756,1804]
> [null,AIMC,3603,2951]
> [null,AJAC,2109,1989]
> [null,AJKB,2573,3540]
> [null,ALBE,3458,2992]
> [null,ALCE,1720,1810]
> [null,ALEC,2569,1946]
> [null,ALNB,2552,1750]
> [null,ANFE,2022,2269]
> [null,AOIB,2982,2540]
> [null,APJB,2344,2593]
> [null,BAPD,2182,2787]
> [null,BDCE,2844,2069]
> [null,BDDD,2417,2537]
> [null,BDJA,1584,1666]
> [null,BEOD,2141,2649]
> [null,BFCC,2745,2020]
> [null,BFMB,1642,1364]
> [null,BHPC,1923,1780]
> [null,BIDB,1956,2836]
> [null,BIGB,2023,2344]
> [null,BIJB,1977,2728]
> [null,BJFE,1891,2390]
> [null,BLDE,1983,1797]
> [null,BNID,2485,2324]
> [null,BNLD,2385,2786]
> [null,BOMB,2291,2092]
> [null,CAAA,2233,2560]
> [null,CBCD,1540,2012]
> [null,CBIA,2394,2122]
> [null,CBPB,1790,1661]
> [null,CCMD,2654,2691]
> [null,CDBC,1804,2072]
> [null,CFEA,1941,1567]
> [null,CGFD,2123,2265]
> [null,CHPC,2933,2174]
> [null,CIGD,2618,2399]
> [null,CJCB,2728,2367]
> [null,CJLA,1350,1732]
> [null,CLAE,2578,2329]
> [null,CLGA,1842,1588]
> [null,CLLB,3418,2657]
> [null,CLOB,3115,2560]
> [null,CMAD,1991,2243]
> [null,CMJA,1261,1855]
> [null,CMLA,3288,2753]
> [null,CMPD,1320,1676]
> [null,CNGB,2340,2118]
> [null,CNHD,3519,3348]
> [null,CNPC,2561,1948]
> [null,DCPC,2664,2627]
> [null,DDHA,1313,1926]
> [null,DDND,1109,835]
> [null,DEAA,2141,1847]
> [null,DEJA,3142,2723]
> [null,DFKB,1470,1650]
> [null,DGCC,2113,2331]
> [null,DGFC,2201,2928]
> [null,DHPA,2467,2133]
> [null,DMBA,3085,2087]
> [null,DPAB,3494,3081]
> [null,EAEC,2133,2148]
> [null,EAPA,1560,1275]
> [null,ECGC,2815,3307]
> [null,EDPD,2731,1883]
> [null,EEEC,2024,1902]
> [null,EEMC,2624,2387]
> [null,EFFA,2047,1878]
> [null,EGJA,2403,2633]
> [null,EGMA,2784,2772]
> [null,EGOC,2389,1753]
> [null,EHFD,1940,1420]
> [null,EHLB,2320,2057]
> [null,EHPA,1898,1853]
> [null,EIPB,2930,2326]
> [null,EJAE,2582,1836]
> [null,EJIB,2257,1681]
> [null,EJJA,2791,1941]
> [null,EJJD,3410,2405]
> [null,EJNC,2472,2067]
> [null,EJPD,1219,1229]
> [null,EKEB,2047,1713]
> [null,EMEA,2502,1897]
> [null,EMKC,2362,2042]
> [null,ENAC,2011,1909]
> [null,ENFB,2507,2162]
> [null,ENOD,3371,2709]
> {noformat}
> Expected results:
> {noformat}
> +--+--++---+
> | W_WAREHOUSE_NAME | I_ITEM_ID| INV_BEFORE | INV_AFTER |
> +--+--++---+
> | Bad cards must make. | AACD |   1889 |  2168 |
> | Bad cards must make. | AAHD |   2739 |  2039 |
> | Bad cards must make. | ABDA |   1717 |  1782 |
> | Bad cards must make. | 

[jira] [Created] (SPARK-14317) Clean up hash join

2016-03-31 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14317:
--

 Summary: Clean up hash join
 Key: SPARK-14317
 URL: https://issues.apache.org/jira/browse/SPARK-14317
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14294) Support native execution of ALTER TABLE ... RENAME TO

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-14294.
-
Resolution: Duplicate
  Assignee: Andrew Or

> Support native execution of ALTER TABLE ... RENAME TO
> -
>
> Key: SPARK-14294
> URL: https://issues.apache.org/jira/browse/SPARK-14294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Andrew Or
>Priority: Minor
>
> Support native execution of ALTER TABLE ... RENAME TO
> The syntax for ALTER TABLE ... RENAME TO commands is described as following:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RenameTable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220762#comment-15220762
 ] 

Apache Spark commented on SPARK-11327:
--

User 'jayv' has created a pull request for this issue:
https://github.com/apache/spark/pull/12101

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
> Fix For: 2.0.0
>
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image.
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 

[jira] [Commented] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220754#comment-15220754
 ] 

Apache Spark commented on SPARK-14316:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12100

> StateStoreCoordinator should extend ThreadSafeRpcEndpoint
> -
>
> Key: SPARK-14316
> URL: https://issues.apache.org/jira/browse/SPARK-14316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> RpcEndpoint is not thread safe and allows multiple messages to be processed 
> at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14316:


Assignee: Shixiong Zhu  (was: Apache Spark)

> StateStoreCoordinator should extend ThreadSafeRpcEndpoint
> -
>
> Key: SPARK-14316
> URL: https://issues.apache.org/jira/browse/SPARK-14316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> RpcEndpoint is not thread safe and allows multiple messages to be processed 
> at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14316:


Assignee: Apache Spark  (was: Shixiong Zhu)

> StateStoreCoordinator should extend ThreadSafeRpcEndpoint
> -
>
> Key: SPARK-14316
> URL: https://issues.apache.org/jira/browse/SPARK-14316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> RpcEndpoint is not thread safe and allows multiple messages to be processed 
> at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14316) StateStoreCoordinator should extend ThreadSafeRpcEndpoint

2016-03-31 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-14316:


 Summary: StateStoreCoordinator should extend ThreadSafeRpcEndpoint
 Key: SPARK-14316
 URL: https://issues.apache.org/jira/browse/SPARK-14316
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


RpcEndpoint is not thread safe and allows multiple messages to be processed at 
the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220737#comment-15220737
 ] 

Apache Spark commented on SPARK-14251:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12099

> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14251:


Assignee: (was: Apache Spark)

> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14251) Add SQL command for printing out generated code for debugging

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14251:


Assignee: Apache Spark

> Add SQL command for printing out generated code for debugging
> -
>
> Key: SPARK-14251
> URL: https://issues.apache.org/jira/browse/SPARK-14251
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> SPARK-14227 adds a programatic way to dump generated code. In pure SQL 
> environment this doesn't work. It would be great if we can have 
> {noformat}
> explain codegen select * ...
> {noformat}
> return the generated code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220734#comment-15220734
 ] 

Xiangrui Meng commented on SPARK-14313:
---

[~yanboliang] Are you interested working on this? It should contain the basic 
APIs for ml.save/ml.load in SparkR and save/load implementation of AFTWrapper.

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220732#comment-15220732
 ] 

Xiangrui Meng commented on SPARK-14314:
---

Hold until SPARK-14303 is done.

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14315) GLMs model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220730#comment-15220730
 ] 

Xiangrui Meng commented on SPARK-14315:
---

Hold until SPARK-14303 is done.

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14311) Model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14311:
--
Description: 
In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive 
Bayes, and AFT survival regression. Users can fit models, get summary, and make 
predictions. However, they cannot save/load the models yet.

ML models in SparkR are wrappers around ML pipelines. So it should be 
straightforward to implement model persistence. We need to think more about the 
API. R uses save/load for objects and datasets (also objects). It is possible 
to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure 
whether load can be overloaded easily. I propose the following API:

{code}
model <- glm(formula, data = df)
ml.save(model, path, mode = "overwrite")
model2 <- ml.load(path)
{code}

We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load is 
a S3 method (correct me if I'm wrong).

  was:
In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive 
Bayes, and AFT survival regression. Users can fit models, get summary, and make 
predictions. However, they cannot save/load the models yet.

ML models in SparkR are wrappers around ML pipelines. So it should be 
straightforward to implement model persistence. We need to think more about the 
API. R uses save/load for objects and datasets (also objects). It is possible 
to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure 
whether load can be overloaded easily. I propose the following API:

{code}
model <- glm(formula, data = df)
ml.save(model, path, mode = "overwrite")
model2 <- ml.load(path)
{code}


> Model persistence in SparkR
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14313:
-

 Summary: AFTSurvivalRegression model persistence in SparkR
 Key: SPARK-14313
 URL: https://issues.apache.org/jira/browse/SPARK-14313
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14315) GLMs model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14315:
-

 Summary: GLMs model persistence in SparkR
 Key: SPARK-14315
 URL: https://issues.apache.org/jira/browse/SPARK-14315
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14314) K-means model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14314:
-

 Summary: K-means model persistence in SparkR
 Key: SPARK-14314
 URL: https://issues.apache.org/jira/browse/SPARK-14314
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14311) Model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14311:
-

 Summary: Model persistence in SparkR
 Key: SPARK-14311
 URL: https://issues.apache.org/jira/browse/SPARK-14311
 Project: Spark
  Issue Type: Umbrella
  Components: ML, SparkR
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive 
Bayes, and AFT survival regression. Users can fit models, get summary, and make 
predictions. However, they cannot save/load the models yet.

ML models in SparkR are wrappers around ML pipelines. So it should be 
straightforward to implement model persistence. We need to think more about the 
API. R uses save/load for objects and datasets (also objects). It is possible 
to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure 
whether load can be overloaded easily. I propose the following API:

{code}
model <- glm(formula, data = df)
ml.save(model, path, mode = "overwrite")
model2 <- ml.load(path)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-03-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-14312:
-

 Summary: NaiveBayes model persistence in SparkR
 Key: SPARK-14312
 URL: https://issues.apache.org/jira/browse/SPARK-14312
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SparkR
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14209) Application failure during preemption.

2016-03-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220716#comment-15220716
 ] 

Marcelo Vanzin commented on SPARK-14209:


Those logs show the same weird issues as the previous one... is there a way you 
can use the default log configuration from Spark? That would give us much 
better and non-misleading information.

Also, it didn't seem like that application failed. I see some fetch failures, 
but that's to be expected when executors die. What's odd about your original 
log is that tasks failed multiple (up to 100) times and eventually failed the 
application, and that doesn't seem to be happening for this last set of logs.

> Application failure during preemption.
> --
>
> Key: SPARK-14209
> URL: https://issues.apache.org/jira/browse/SPARK-14209
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: Spark on YARN
>Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>   ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
> Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method

2016-03-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14277:

Description: 
While running a Spark job which is spilling a lot of data in reduce phase, we 
see that significant amount of CPU is being consumed in native Snappy ArrayCopy 
method (Please see the stack trace below). 

Stack trace - 
org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readLong(DataInputStream.java:416)
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)

The reason for that is the SpillReader does a lot of small reads from the 
underlying snappy compressed stream and SnappyInputStream invokes native jni 
ArrayCopy method to copy the data, which is expensive. We should fix Snappy- 
java to  use with non-JNI based System.arrayCopy method in this case.   

  was:
While running a Spark job which is spilling a lot of data in reduce phase, we 
see that significant amount of CPU is being consumed in native Snappy ArrayCopy 
method (Please see the stack trace below). 

Stack trace - 
org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readLong(DataInputStream.java:416)
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)

The reason for that is the SpillReader does a lot of small reads from the 
underlying snappy compressed stream and we pay a heavy cost of jni calls for 
these small reads. The SpillReader should instead do a buffered read from the 
underlying snappy compressed stream.


> Significant amount of CPU is being consumed in SnappyNative arrayCopy method
> 
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and SnappyInputStream invokes native jni 
> ArrayCopy method to copy the data, which is expensive. We should fix Snappy- 
> java to  use with non-JNI based System.arrayCopy method in this case.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-14277) Significant amount of CPU is being consumed in SnappyNative arrayCopy method

2016-03-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-14277:

Summary: Significant amount of CPU is being consumed in SnappyNative 
arrayCopy method  (was: UnsafeSorterSpillReader should do buffered read from 
underlying compression stream)

> Significant amount of CPU is being consumed in SnappyNative arrayCopy method
> 
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14310:


Assignee: Apache Spark

> Fix scan whole stage codegen to determine if batches are produced based on 
> schema
> -
>
> Key: SPARK-14310
> URL: https://issues.apache.org/jira/browse/SPARK-14310
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>Assignee: Apache Spark
>
> Currently, this is figured out at runtime by looking at the first value which 
> is not necessary any more. This simplifies the code and lets us measure 
> timings better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220690#comment-15220690
 ] 

Apache Spark commented on SPARK-14310:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/12098

> Fix scan whole stage codegen to determine if batches are produced based on 
> schema
> -
>
> Key: SPARK-14310
> URL: https://issues.apache.org/jira/browse/SPARK-14310
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>
> Currently, this is figured out at runtime by looking at the first value which 
> is not necessary any more. This simplifies the code and lets us measure 
> timings better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14310:


Assignee: (was: Apache Spark)

> Fix scan whole stage codegen to determine if batches are produced based on 
> schema
> -
>
> Key: SPARK-14310
> URL: https://issues.apache.org/jira/browse/SPARK-14310
> Project: Spark
>  Issue Type: Bug
>Reporter: Nong Li
>
> Currently, this is figured out at runtime by looking at the first value which 
> is not necessary any more. This simplifies the code and lets us measure 
> timings better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14310) Fix scan whole stage codegen to determine if batches are produced based on schema

2016-03-31 Thread Nong Li (JIRA)
Nong Li created SPARK-14310:
---

 Summary: Fix scan whole stage codegen to determine if batches are 
produced based on schema
 Key: SPARK-14310
 URL: https://issues.apache.org/jira/browse/SPARK-14310
 Project: Spark
  Issue Type: Bug
Reporter: Nong Li


Currently, this is figured out at runtime by looking at the first value which 
is not necessary any more. This simplifies the code and lets us measure timings 
better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14308:


Assignee: Apache Spark

> Remove unused mllib tree classes and move private classes to ML
> ---
>
> Key: SPARK-14308
> URL: https://issues.apache.org/jira/browse/SPARK-14308
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some 
> mllib tree internal helper classes are no longer used at all. Also, the 
> private helper classes internal to spark tree training can be ported very 
> easily to spark.ML without affecting APIs. This is the "low hanging fruit" 
> for porting tree internals to spark.ML, and will make the other migrations 
> more tractable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220686#comment-15220686
 ] 

Apache Spark commented on SPARK-14308:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/12097

> Remove unused mllib tree classes and move private classes to ML
> ---
>
> Key: SPARK-14308
> URL: https://issues.apache.org/jira/browse/SPARK-14308
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some 
> mllib tree internal helper classes are no longer used at all. Also, the 
> private helper classes internal to spark tree training can be ported very 
> easily to spark.ML without affecting APIs. This is the "low hanging fruit" 
> for porting tree internals to spark.ML, and will make the other migrations 
> more tractable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14308:


Assignee: (was: Apache Spark)

> Remove unused mllib tree classes and move private classes to ML
> ---
>
> Key: SPARK-14308
> URL: https://issues.apache.org/jira/browse/SPARK-14308
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some 
> mllib tree internal helper classes are no longer used at all. Also, the 
> private helper classes internal to spark tree training can be ported very 
> easily to spark.ML without affecting APIs. This is the "low hanging fruit" 
> for porting tree internals to spark.ML, and will make the other migrations 
> more tractable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14309) Dataframe returns wrong results due to parsing incorrectly

2016-03-31 Thread Jerry Lam (JIRA)
Jerry Lam created SPARK-14309:
-

 Summary: Dataframe returns wrong results due to parsing incorrectly
 Key: SPARK-14309
 URL: https://issues.apache.org/jira/browse/SPARK-14309
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Jerry Lam


I observed the below behavior using dataframe. The expected answer should be 60 
but there is no way to get the value unless to turn dataframe into rdd and 
access it in the Row.

I have include the SQL statement and it returns the correct result because I 
believe, it is using Hive parser.

{code}
val base = sc.parallelize(( 0 to 49).zip( 0 to 49) ++ (30 to 79).zip(50 to 
99)).toDF("id", "label")
val d1 = base.where($"label" < 60).as("d1")
val d2 = base.where($"label" === 60).as("d2")
d1.join(d2, "id").show
+---+-+-+
| id|label|label|
+---+-+-+
| 40|   40|   60|
+---+-+-+
d1.join(d2, "id").select(d1("label")).show
+-+
|label|
+-+
|   40|
+-+
(expected answer: 40, right!)

d1.join(d2, "id").map{row => row.getAs[Int](2)}
d1.join(d2, "id").select(d2("label")).show
+-+
|label|
+-+
|   40|
+-+
(expected answer: 60, wrong!)
d1.join(d2, "id").select(d2("label")).explain(true)

scala> d1.join(d2, "id").select(d2("label")).explain(true)
== Parsed Logical Plan ==
Project [label#3]
 Project [id#2,label#3,label#7]
  Join Inner, Some((id#2 = id#6))
   Subquery d1
Filter (label#3 < 60)
 Project [_1#0 AS id#2,_2#1 AS label#3]
  LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21
   Subquery d2
Filter (label#7 = 60)
 Project [_1#0 AS id#6,_2#1 AS label#7]
  LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21

== Analyzed Logical Plan ==
label: int
Project [label#3]
 Project [id#2,label#3,label#7]
  Join Inner, Some((id#2 = id#6))
   Subquery d1
Filter (label#3 < 60)
 Project [_1#0 AS id#2,_2#1 AS label#3]
  LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21
   Subquery d2
Filter (label#7 = 60)
 Project [_1#0 AS id#6,_2#1 AS label#7]
  LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21

== Optimized Logical Plan ==
Project [label#3]
 Join Inner, Some((id#2 = id#6))
  Project [_1#0 AS id#2,_2#1 AS label#3]
   Filter (_2#1 < 60)
LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21
  Project [_1#0 AS id#6]
   Filter (_2#1 = 60)
LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21

== Physical Plan ==
TungstenProject [label#3]
 SortMergeJoin [id#2], [id#6]
  TungstenSort [id#2 ASC], false, 0
   TungstenExchange hashpartitioning(id#2)
TungstenProject [_1#0 AS id#2,_2#1 AS label#3]
 Filter (_2#1 < 60)
  Scan PhysicalRDD[_1#0,_2#1]
  TungstenSort [id#6 ASC], false, 0
   TungstenExchange hashpartitioning(id#6)
TungstenProject [_1#0 AS id#6]
 Filter (_2#1 = 60)
  Scan PhysicalRDD[_1#0,_2#1]

def (d1 :DataFrame, d2: DataFrame)

base.registerTempTable("base")
sqlContext.sql("select d2.label from (select * from base where label < 60) as 
d1 inner join (select * from base where label = 60) as d2 on d1.id = 
d2.id").explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias('d2.label)]
 'Join Inner, Some(('d1.id = 'd2.id))
  'Subquery d1
   'Project [unresolvedalias(*)]
'Filter ('label < 60)
 'UnresolvedRelation [base], None
  'Subquery d2
   'Project [unresolvedalias(*)]
'Filter ('label = 60)
 'UnresolvedRelation [base], None

== Analyzed Logical Plan ==
label: int
Project [label#15]
 Join Inner, Some((id#2 = id#14))
  Subquery d1
   Project [id#2,label#3]
Filter (label#3 < 60)
 Subquery base
  Project [_1#0 AS id#2,_2#1 AS label#3]
   LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21
  Subquery d2
   Project [id#14,label#15]
Filter (label#15 = 60)
 Subquery base
  Project [_1#0 AS id#14,_2#1 AS label#15]
   LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21

== Optimized Logical Plan ==
Project [label#15]
 Join Inner, Some((id#2 = id#14))
  Project [_1#0 AS id#2]
   Filter (_2#1 < 60)
LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21
  Project [_1#0 AS id#14,_2#1 AS label#15]
   Filter (_2#1 = 60)
LogicalRDD [_1#0,_2#1], MapPartitionsRDD[1] at rddToDataFrameHolder at 
:21

== Physical Plan ==
TungstenProject [label#15]
 SortMergeJoin [id#2], [id#14]
  TungstenSort [id#2 ASC], false, 0
   TungstenExchange hashpartitioning(id#2)
TungstenProject [_1#0 AS id#2]
 Filter (_2#1 < 60)
  Scan PhysicalRDD[_1#0,_2#1]
  TungstenSort [id#14 ASC], false, 0
   TungstenExchange hashpartitioning(id#14)
TungstenProject [_1#0 AS id#14,_2#1 AS label#15]
 Filter (_2#1 = 60)
  Scan PhysicalRDD[_1#0,_2#1]

{code}



--
This message was 

[jira] [Updated] (SPARK-12381) Copy public decision tree helper classes from spark.mllib to spark.ml and make private

2016-03-31 Thread Seth Hendrickson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-12381:
-
Summary: Copy public decision tree helper classes from spark.mllib to 
spark.ml and make private  (was: Move decision tree helper classes from 
spark.mllib to spark.ml)

> Copy public decision tree helper classes from spark.mllib to spark.ml and 
> make private
> --
>
> Key: SPARK-12381
> URL: https://issues.apache.org/jira/browse/SPARK-12381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> The helper classes for decision trees and decision tree ensembles (e.g. 
> Impurity, InformationGainStats, ImpurityStats, DTStatsAggregator, etc...) 
> currently reside in spark.mllib, but as the algorithm implementations are 
> moved to spark.ml, so should these helper classes.
> We should take this opportunity to make some of those helper classes private 
> when possible (especially if they are only needed during training) and maybe 
> change the APIs (especially if we can eliminate duplicate data stored in the 
> final model).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14308) Remove unused mllib tree classes and move private classes to ML

2016-03-31 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-14308:


 Summary: Remove unused mllib tree classes and move private classes 
to ML
 Key: SPARK-14308
 URL: https://issues.apache.org/jira/browse/SPARK-14308
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Seth Hendrickson
Priority: Minor


After [SPARK-12183|https://issues.apache.org/jira/browse/SPARK-12183], some 
mllib tree internal helper classes are no longer used at all. Also, the private 
helper classes internal to spark tree training can be ported very easily to 
spark.ML without affecting APIs. This is the "low hanging fruit" for porting 
tree internals to spark.ML, and will make the other migrations more tractable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14281) Fix the java8-tests profile and run those tests in Jenkins

2016-03-31 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14281.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12073
[https://github.com/apache/spark/pull/12073]

> Fix the java8-tests profile and run those tests in Jenkins
> --
>
> Key: SPARK-14281
> URL: https://issues.apache.org/jira/browse/SPARK-14281
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> Spark has some tests for compilation of Java 8 sources (using lambdas) 
> guarded behind a {{java8-tests}} maven profile, but we currently do not build 
> or run those tests. As a result, the tests no longer compile.
> We should fix these tests and set up automated CI so that they don't break 
> again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14277) UnsafeSorterSpillReader should do buffered read from underlying compression stream

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220631#comment-15220631
 ] 

Apache Spark commented on SPARK-14277:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/12096

> UnsafeSorterSpillReader should do buffered read from underlying compression 
> stream
> --
>
> Key: SPARK-14277
> URL: https://issues.apache.org/jira/browse/SPARK-14277
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>
> While running a Spark job which is spilling a lot of data in reduce phase, we 
> see that significant amount of CPU is being consumed in native Snappy 
> ArrayCopy method (Please see the stack trace below). 
> Stack trace - 
> org.xerial.snappy.SnappyNative.$$YJP$$arrayCopy(Native Method)
> org.xerial.snappy.SnappyNative.arrayCopy(SnappyNative.java)
> org.xerial.snappy.Snappy.arrayCopy(Snappy.java:85)
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readLong(DataInputStream.java:416)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71)
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
> The reason for that is the SpillReader does a lot of small reads from the 
> underlying snappy compressed stream and we pay a heavy cost of jni calls for 
> these small reads. The SpillReader should instead do a buffered read from the 
> underlying snappy compressed stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14129) [Table related commands] Alter table

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-14129:
-

Assignee: Andrew Or

> [Table related commands] Alter table
> 
>
> Key: SPARK-14129
> URL: https://issues.apache.org/jira/browse/SPARK-14129
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> For alter table command, we have the following tokens. 
> TOK_ALTERTABLE_RENAME
> TOK_ALTERTABLE_LOCATION
> TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES
> TOK_ALTERTABLE_SERIALIZER
> TOK_ALTERTABLE_SERDEPROPERTIES
> TOK_ALTERTABLE_CLUSTER_SORT
> TOK_ALTERTABLE_SKEWED
> For a data source table, let's implement TOK_ALTERTABLE_RENAME, 
> TOK_ALTERTABLE_LOCATION, and TOK_ALTERTABLE_SERDEPROPERTIES. We need to 
> decide what we do for 
> TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES. It will be use to 
> allow users to correct the data format (e.g. changing csv to 
> com.databricks.spark.csv to allow the table be accessed by the older versions 
> of spark).
> For a Hive table, we should implement all commands supported by the data 
> source table and TOK_ALTERTABLE_PROPERTIES/TOK_ALTERTABLE_DROPPROPERTIES.
> For TOK_ALTERTABLE_CLUSTER_SORT and TOK_ALTERTABLE_SKEWED, we should throw 
> exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2016-03-31 Thread Haohai Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220609#comment-15220609
 ] 

Haohai Ma edited comment on SPARK-4906 at 3/31/16 8:27 PM:
---

We just hit the similar OOM issue recently by Spark Master v1.6.0. A detailed 
retained memory report is attached. 


was (Author: cloneman):
We just hit the similar issue recently by Spark Master OOM. A detailed retained 
memory report is attached. 

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --
>
> Key: SPARK-4906
> URL: https://issues.apache.org/jira/browse/SPARK-4906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Mingyu Kim
> Attachments: LeakingJobProgressListener2OOM.docx
>
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2016-03-31 Thread Haohai Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220609#comment-15220609
 ] 

Haohai Ma commented on SPARK-4906:
--

We just hit the similar issue recently by Spark Master OOM. A detailed retained 
memory report is attached. 

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --
>
> Key: SPARK-4906
> URL: https://issues.apache.org/jira/browse/SPARK-4906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Mingyu Kim
> Attachments: LeakingJobProgressListener2OOM.docx
>
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2016-03-31 Thread Haohai Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohai Ma updated SPARK-4906:
-
Attachment: LeakingJobProgressListener2OOM.docx

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --
>
> Key: SPARK-4906
> URL: https://issues.apache.org/jira/browse/SPARK-4906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Mingyu Kim
> Attachments: LeakingJobProgressListener2OOM.docx
>
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14259) Add config to control maximum number of files when coalescing partitions

2016-03-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220602#comment-15220602
 ] 

Apache Spark commented on SPARK-14259:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12095

> Add config to control maximum number of files when coalescing partitions
> 
>
> Key: SPARK-14259
> URL: https://issues.apache.org/jira/browse/SPARK-14259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.0.0
>
>
> The FileSourceStrategy currently has a config to control the maximum byte 
> size of coalesced partitions. It is helpful to also have a config to control 
> the maximum number of files as even small files have a non-trivial fixed 
> cost. The current packing can put a lot of small files together which cases 
> straggler tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14264) Add feature importances for GBTs in Pyspark

2016-03-31 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14264.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12056
[https://github.com/apache/spark/pull/12056]

> Add feature importances for GBTs in Pyspark
> ---
>
> Key: SPARK-14264
> URL: https://issues.apache.org/jira/browse/SPARK-14264
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
> Fix For: 2.0.0
>
>
> GBT feature importances are now implemented in scala. We should expose them 
> in the pyspark API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220519#comment-15220519
 ] 

Xusen Yin commented on SPARK-14306:
---

start work on it now.

> PySpark ml.classification OneVsRest support export/import
> -
>
> Key: SPARK-14306
> URL: https://issues.apache.org/jira/browse/SPARK-14306
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14260) Increase default value for maxCharsPerColumn

2016-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14260.
---
Resolution: Won't Fix

Yeah I think that would be a very rare case. I also suggest we not increase the 
default limit. This was motivated I think by SPARK-14103 but I'm not sure the 
cause is a long line, not yet. (Or if it is, the solution is to raise the 
limit.)

> Increase default value for maxCharsPerColumn
> 
>
> Key: SPARK-14260
> URL: https://issues.apache.org/jira/browse/SPARK-14260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I guess the default value of the option {{maxCharsPerColumn}} looks 
> relatively small,100 characters meaning 976KB.
> It looks some of guys have a problem with this ending up setting the value 
> manually.
> https://github.com/databricks/spark-csv/issues/295
> https://issues.apache.org/jira/browse/SPARK-14103
> According to [univocity 
> API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)],
>  this exists to avoid {{OutOfMemoryErrors}}.
> If this does not harm performance, then I think it would be better to make 
> the default value much bigger (eg. 10MB or 100MB) so that users do not take 
> care of the lengths of each field in CSV file.
> Apparently Apache CSV Parser does not have such limits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2016-03-31 Thread Jo Voordeckers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220508#comment-15220508
 ] 

Jo Voordeckers commented on SPARK-11327:


So who should I nudge to get it backported into 1.x ?

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
> Fix For: 2.0.0
>
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image.
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 15/10/26 

[jira] [Resolved] (SPARK-14304) Fix tests that don't create temp files in the `java.io.tmpdir` folder

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14304.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Fix tests that don't create temp files in the `java.io.tmpdir` folder
> -
>
> Key: SPARK-14304
> URL: https://issues.apache.org/jira/browse/SPARK-14304
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> If I press `CTRL-C` when running these tests, the temp files will be left in 
> `sql/core` folder and I need to delete them manually. It's annoying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14279:
--
Issue Type: Improvement  (was: Story)

> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13710.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12043
[https://github.com/apache/spark/pull/12043]

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737)
> at 
> 

[jira] [Updated] (SPARK-13710) Spark shell shows ERROR when launching on Windows

2016-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13710:
--
Assignee: Michel Lemay

> Spark shell shows ERROR when launching on Windows
> -
>
> Key: SPARK-13710
> URL: https://issues.apache.org/jira/browse/SPARK-13710
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Michel Lemay
>Priority: Minor
> Fix For: 2.0.0
>
>
> On Windows, when we launch {{bin\spark-shell.cmd}}, it shows ERROR message 
> and stacktrace.
> {noformat}
> C:\Users\tsudukim\Documents\workspace\spark-dev3>bin\spark-shell
> [ERROR] Terminal initialization failed; falling back to unsupported
> java.lang.NoClassDefFoundError: Could not initialize class 
> scala.tools.fusesource_embedded.jansi.internal.Kernel32
> at 
> scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)
> at 
> scala.tools.jline_embedded.WindowsTerminal.getConsoleMode(WindowsTerminal.java:204)
> at 
> scala.tools.jline_embedded.WindowsTerminal.init(WindowsTerminal.java:82)
> at 
> scala.tools.jline_embedded.TerminalFactory.create(TerminalFactory.java:101)
> at 
> scala.tools.jline_embedded.TerminalFactory.get(TerminalFactory.java:158)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:229)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:221)
> at 
> scala.tools.jline_embedded.console.ConsoleReader.(ConsoleReader.java:209)
> at 
> scala.tools.nsc.interpreter.jline_embedded.JLineConsoleReader.(JLineReader.scala:61)
> at 
> scala.tools.nsc.interpreter.jline_embedded.InteractiveReader.(JLineReader.scala:33)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:865)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$scala$tools$nsc$interpreter$ILoop$$instantiate$1$1.apply(ILoop.scala:862)
> at 
> scala.tools.nsc.interpreter.ILoop.scala$tools$nsc$interpreter$ILoop$$mkReader$1(ILoop.scala:871)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15$$anonfun$apply$8.apply(ILoop.scala:875)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$15.apply(ILoop.scala:875)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at 
> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233)
> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223)
> at scala.collection.immutable.Stream.collect(Stream.scala:435)
> at scala.tools.nsc.interpreter.ILoop.chooseReader(ILoop.scala:877)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$2.apply(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:916)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
> at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
> at org.apache.spark.repl.Main$.doMain(Main.scala:64)
> at org.apache.spark.repl.Main$.main(Main.scala:47)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:737)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
> at 

[jira] [Resolved] (SPARK-14278) Initialize columnar batch with proper memory mode

2016-03-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14278.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12070
[https://github.com/apache/spark/pull/12070]

> Initialize columnar batch with proper memory mode
> -
>
> Key: SPARK-14278
> URL: https://issues.apache.org/jira/browse/SPARK-14278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14278) Initialize columnar batch with proper memory mode

2016-03-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14278:
-
Assignee: Sameer Agarwal

> Initialize columnar batch with proper memory mode
> -
>
> Key: SPARK-14278
> URL: https://issues.apache.org/jira/browse/SPARK-14278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13363) Aggregator not working with DataFrame

2016-03-31 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220492#comment-15220492
 ] 

koert kuipers commented on SPARK-13363:
---

just doing some digging. the issue seems to be that when the 
TypedAggregateExpression is created from the Aggregator aEncoder is set to 
None, and it stays None. then when the check is done that calls resolved on 
TypedAggregateExpression it returns false because aEncoder is None. 

> Aggregator not working with DataFrame
> -
>
> Key: SPARK-13363
> URL: https://issues.apache.org/jira/browse/SPARK-13363
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: koert kuipers
>Priority: Blocker
>
> org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class 
> for user-defined aggregations, which can be used in [[DataFrame]] and 
> [[Dataset]]
> it works well with Dataset/GroupedDataset, but i am having no luck using it 
> with DataFrame/GroupedData. does anyone have an example how to use it with a 
> DataFrame?
> in particular i would like to use it with this method in GroupedData:
> {noformat}
>   def agg(expr: Column, exprs: Column*): DataFrame
> {noformat}
> clearly it should be possible, since GroupedDataset uses that very same 
> method to do the work:
> {noformat}
>   private def agg(exprs: Column*): DataFrame =
> groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*)
> {noformat}
> the trick seems to be the wrapping in withEncoder, which is private. i tried 
> to do something like it myself, but i had no luck since it uses more private 
> stuff in TypedColumn.
> anyhow, my attempt at using it in DataFrame:
> {noformat}
> val simpleSum = new Aggregator[Int, Int, Int] {
>   def zero: Int = 0 // The initial value.
>   def reduce(b: Int, a: Int) = b + a// Add an element to the running total
>   def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values.
>   def finish(b: Int) = b// Return the final result.
> }.toColumn
> val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v")
> df.groupBy("k").agg(simpleSum).show
> {noformat}
> and the resulting error:
> {noformat}
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-11327.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
> Fix For: 2.0.0
>
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image.
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 15/10/26 22:08:08 INFO 

[jira] [Resolved] (SPARK-14069) Improve SparkStatusTracker to also track executor information

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14069.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Improve SparkStatusTracker to also track executor information
> -
>
> Key: SPARK-14069
> URL: https://issues.apache.org/jira/browse/SPARK-14069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14243:


Assignee: Apache Spark  (was: jeanlyn)

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14168) Managed Memory Leak Msg Should Only Be a Warning

2016-03-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14168.
---
Resolution: Duplicate

> Managed Memory Leak Msg Should Only Be a Warning
> 
>
> Key: SPARK-14168
> URL: https://issues.apache.org/jira/browse/SPARK-14168
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> When a task is completed, executors check to see if all managed memory for 
> the task was correctly released, and logs an error when it wasn't.  However, 
> it turns out its OK for there to be memory that wasn't released when an 
> Iterator isn't read to completion, eg., with {{rdd.take()}}.  This results in 
> a scary error msg in the executor logs:
> {noformat}
> 16/01/05 17:02:49 ERROR Executor: Managed memory leak detected; size = 
> 16259594 bytes, TID = 24
> {noformat}
> Furthermore, if tasks fails for any reason, this msg is also triggered.  This 
> can lead users to believe that the failure was from the memory leak, when the 
> root cause could be entirely different.  Eg., the same error msg appears in 
> executor logs with this clearly broken user code run with {{spark-shell 
> --master 'local-cluster[2,2,1024]'}}
> {code}
> sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
> x).groupByKey.mapPartitions { it => throw new RuntimeException("user error!") 
> }.collect
> {code}
> We should downgrade the msg to a warning and link to a more detailed 
> explanation.
> See https://issues.apache.org/jira/browse/SPARK-11293 for more reports from 
> users (and perhaps a true fix)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14243:


Assignee: jeanlyn  (was: Apache Spark)

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: jeanlyn
> Fix For: 2.0.0
>
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14243:
--
Fix Version/s: 2.0.0

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: jeanlyn
> Fix For: 2.0.0
>
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-14243:
---

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: jeanlyn
> Fix For: 2.0.0
>
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14243.
---
  Resolution: Fixed
Target Version/s: 1.6.2, 2.0.0

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>Assignee: jeanlyn
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14182) Parse DDL command: Alter View

2016-03-31 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14182.
---
  Resolution: Fixed
Assignee: Xiao Li
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

> Parse DDL command: Alter View
> -
>
> Key: SPARK-14182
> URL: https://issues.apache.org/jira/browse/SPARK-14182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> Based on the Hive DDL document 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
> and 
> https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
> The syntax of DDL command for {{ALTER VIEW}} include
> {code}
> ALTER VIEW view_name AS select_statement
> ALTER VIEW view_name RENAME TO new_view_name
> ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
> ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
> ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, 
> ...]
> ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >