[jira] [Comment Edited] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2015-10-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973829#comment-14973829
 ] 

Yanbo Liang edited comment on SPARK-9265 at 10/26/15 6:59 AM:
--

[~tdas] [~andrewor14] [~rxin] Could you tell me how did you generate the table? 
Is it a Spark SQL temporary table or Hive table? I use external datasource to 
load a test table but can not reproduce this bug.
{code}
val df = sqlContext.read.json("examples/src/main/resources/failed_suites.json")
val recentFailures = df.cache()
val topRecentFailures = 
recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
val mot = topRecentFailures.as("a").join(recentFailures.as("b"), $"a.suiteName" 
=== $"b.suiteName")
(1 to 10).foreach { i => 
  println(s"$i: " + mot.count())
}
1: 1107 
2: 1107
3: 1107
4: 1107
5: 1107
6: 1107
7: 1107
8: 1107
9: 1107
10: 1107
{code}


was (Author: yanboliang):
@Tathagata Das @Andrew Or [~rxin] Could you tell me how did you generate the 
table? It's a Spark SQL temporary table or Hive table? I use external 
datasource to load a test table but can not reproduce this bug.
{code:scala}
val df = sqlContext.read.json("examples/src/main/resources/failed_suites.json")
{code}

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11311) spark cannot describe temporary functions

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11311:


Assignee: Apache Spark

> spark cannot describe temporary functions
> -
>
> Key: SPARK-11311
> URL: https://issues.apache.org/jira/browse/SPARK-11311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Apache Spark
>
> create temporary function aa as ;
> describe function aa;
> Will return 'Unable to find function aa', which is not right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11311) spark cannot describe temporary functions

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11311:


Assignee: (was: Apache Spark)

> spark cannot describe temporary functions
> -
>
> Key: SPARK-11311
> URL: https://issues.apache.org/jira/browse/SPARK-11311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> create temporary function aa as ;
> describe function aa;
> Will return 'Unable to find function aa', which is not right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11311) spark cannot describe temporary functions

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973831#comment-14973831
 ] 

Apache Spark commented on SPARK-11311:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9277

> spark cannot describe temporary functions
> -
>
> Key: SPARK-11311
> URL: https://issues.apache.org/jira/browse/SPARK-11311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> create temporary function aa as ;
> describe function aa;
> Will return 'Unable to find function aa', which is not right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11312) Cannot drop temporary function

2015-10-26 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-11312:
---

 Summary: Cannot drop temporary function
 Key: SPARK-11312
 URL: https://issues.apache.org/jira/browse/SPARK-11312
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


create temporary function is done by executionHive, while DROP TEMPORARY 
FUNCTION is done by metadataHive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11312) Cannot drop temporary function

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973845#comment-14973845
 ] 

Apache Spark commented on SPARK-11312:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9278

> Cannot drop temporary function
> --
>
> Key: SPARK-11312
> URL: https://issues.apache.org/jira/browse/SPARK-11312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> create temporary function is done by executionHive, while DROP TEMPORARY 
> FUNCTION is done by metadataHive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11312) Cannot drop temporary function

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11312:


Assignee: (was: Apache Spark)

> Cannot drop temporary function
> --
>
> Key: SPARK-11312
> URL: https://issues.apache.org/jira/browse/SPARK-11312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> create temporary function is done by executionHive, while DROP TEMPORARY 
> FUNCTION is done by metadataHive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11312) Cannot drop temporary function

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11312:


Assignee: Apache Spark

> Cannot drop temporary function
> --
>
> Key: SPARK-11312
> URL: https://issues.apache.org/jira/browse/SPARK-11312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Apache Spark
>
> create temporary function is done by executionHive, while DROP TEMPORARY 
> FUNCTION is done by metadataHive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11312) Cannot drop temporary function

2015-10-26 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang resolved SPARK-11312.
-
Resolution: Duplicate

> Cannot drop temporary function
> --
>
> Key: SPARK-11312
> URL: https://issues.apache.org/jira/browse/SPARK-11312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> create temporary function is done by executionHive, while DROP TEMPORARY 
> FUNCTION is done by metadataHive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973896#comment-14973896
 ] 

Yanbo Liang commented on SPARK-11303:
-

I think the reason of this bug is the same as SPARK-4963, I will send a PR to 
resolve it.

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11310) only build spark core,Modify spark pom file:delete graphx

2015-10-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11310.
---
Resolution: Invalid

It's not clear what you're trying to ask, but this is not the place anyway. Ask 
at u...@spark.apache.org with a fuller description of what you are trying to 
achieve and what the problem is.

> only build  spark core,Modify spark pom file:delete graphx
> ---
>
> Key: SPARK-11310
> URL: https://issues.apache.org/jira/browse/SPARK-11310
> Project: Spark
>  Issue Type: Question
>Reporter: yindu_asan
>
> only want to build  spark core,Modify spark pom file:delete 
> graphx bagel  ... 
> but the result of build jar  have  graphx `s scala file
>  why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973908#comment-14973908
 ] 

Sean Owen commented on SPARK-11305:
---

I support this and would tack on a few more reasons:

- the Hadoop distributions listed here are quite old at this stage anyhow
- it could be perceived as subtly favoring the listed distributions
- I am not clear that, for example, the CDH4 build continues to work with CDH; 
for all distributions, this might be implying a level of guarantee of 
compatibility that isn't reflected in testing

Related: what about continuing to package and distribute the cdh4 build? For 
similar reasons I think this could go in 1.6.

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5966.
--
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.3

Issue resolved by pull request 9220
[https://github.com/apache/spark/pull/9220]

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2015-10-26 Thread Yuri Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973943#comment-14973943
 ] 

Yuri Saito commented on SPARK-11227:


Resolved myself.
I change from SQLContext to HiveContext.
So, it work well.

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> job, job complete with Success.
> In Addition, I disable High Availability on

[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973974#comment-14973974
 ] 

Sean Owen commented on SPARK-11302:
---

In R, it notes that your covariance matrix isn't positive definite. It isn't -- 
it has negative eigenvalues. It doesn't mean it's wrong, but it could. Are you 
sure this is the right input and isn't maybe a victim of precision problems or 
rounding? in any event, I don't think MLlib is the problem here, since R won't 
compute this either. (I glanced at the implementation and it looked like what I 
expect to see too.)

>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973993#comment-14973993
 ] 

Yanbo Liang commented on SPARK-11303:
-

It looks like this bug caused by mutable row copy related problem similar with 
SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this 
issue. I found *map(_copy())* was removed by 
https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the 
motivation of removing *map(_copy())* in that PR?

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-26 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973993#comment-14973993
 ] 

Yanbo Liang edited comment on SPARK-11303 at 10/26/15 10:29 AM:


It looks like this bug caused by mutable row copy related problem similar with 
SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this 
issue. I found *map(_copy())* was removed by 
https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the 
motivation of removing *map(_copy())* for withReplacement = false in that PR?


was (Author: yanboliang):
It looks like this bug caused by mutable row copy related problem similar with 
SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this 
issue. I found *map(_copy())* was removed by 
https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the 
motivation of removing *map(_copy())* in that PR?

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-26 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11303:

Comment: was deleted

(was: I think the reason of this bug is the same as SPARK-4963, I will send a 
PR to resolve it.)

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11313) Implement cogroup

2015-10-26 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11313:
---

 Summary: Implement cogroup
 Key: SPARK-11313
 URL: https://issues.apache.org/jira/browse/SPARK-11313
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11313) Implement cogroup

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974097#comment-14974097
 ] 

Apache Spark commented on SPARK-11313:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9279

> Implement cogroup
> -
>
> Key: SPARK-11313
> URL: https://issues.apache.org/jira/browse/SPARK-11313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11313) Implement cogroup

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11313:


Assignee: Apache Spark

> Implement cogroup
> -
>
> Key: SPARK-11313
> URL: https://issues.apache.org/jira/browse/SPARK-11313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11313) Implement cogroup

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11313:


Assignee: (was: Apache Spark)

> Implement cogroup
> -
>
> Key: SPARK-11313
> URL: https://issues.apache.org/jira/browse/SPARK-11313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4751) Support dynamic allocation for standalone mode

2015-10-26 Thread Matthias Niehoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974119#comment-14974119
 ] 

Matthias Niehoff commented on SPARK-4751:
-

The PR is merged, but the documentation at 
https://spark.apache.org/docs/1.5.1/job-scheduling.html still says:
"This feature is currently disabled by default and available only on YARN."

Is the documentation just outdated or is not yet available in 1.5.x?

> Support dynamic allocation for standalone mode
> --
>
> Key: SPARK-4751
> URL: https://issues.apache.org/jira/browse/SPARK-4751
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> This is equivalent to SPARK-3822 but for standalone mode.
> This is actually a very tricky issue because the scheduling mechanism in the 
> standalone Master uses different semantics. In standalone mode we allocate 
> resources based on cores. By default, an application will grab all the cores 
> in the cluster unless "spark.cores.max" is specified. Unfortunately, this 
> means an application could get executors of different sizes (in terms of 
> cores) if:
> 1) App 1 kills an executor
> 2) App 2, with "spark.cores.max" set, grabs a subset of cores on a worker
> 3) App 1 requests an executor
> In this case, the new executor that App 1 gets back will be smaller than the 
> rest and can execute fewer tasks in parallel. Further, standalone mode is 
> subject to the constraint that only one executor can be allocated on each 
> worker per application. As a result, it is rather meaningless to request new 
> executors if the existing ones are already spread out across all nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9883) Distance to each cluster given a point (KMeansModel)

2015-10-26 Thread Bertrand Dechoux (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bertrand Dechoux updated SPARK-9883:

Summary: Distance to each cluster given a point (KMeansModel)  (was: 
Distance to each cluster given a point)

> Distance to each cluster given a point (KMeansModel)
> 
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread eyal sharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974191#comment-14974191
 ] 

eyal sharon commented on SPARK-11302:
-

Hi Sean ,

Thanks for reaching out.

For convenience, when  added the Covariance matrix to the ticket I rounded
the numbers.

Below are the real values (should be organize in a 4x4 matrix ). The
covariance matrix, by math definition, is always *positive semi definite ( *and
not positive definite* )*
I checked this values in R with this function *
is.positive.semi.definite (*with
 a tolerance levels  of e-11,e-15,e-20) and it returns true for all cases .

401139.3599484815,387621.07664008765,73902.67897058972,314299.39550677023
,387621.07664008765,408594.15705509897,94234.19718534013,351268.39070671634
 ,73902.67897058972,94234.19718534013,969566.5912689088,125849.1446871119
,314299.39550677023,351268.39070671634,125849.1446871119,393043.68462620175


Best, Eyal




-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11314) Add service API and test service for Yarn Cluster schedulers

2015-10-26 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11314:
--

 Summary: Add service API and test service for Yarn Cluster 
schedulers 
 Key: SPARK-11314
 URL: https://issues.apache.org/jira/browse/SPARK-11314
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.5.1
 Environment: Hadoop 2.2+ cluster
Reporter: Steve Loughran


Provide an the extension model to load and run implementations of 
{{SchedulerExtensionService}} in the yarn cluster scheduler process —and to 
stop them afterwards.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11314) Add service API and test service for Yarn Cluster schedulers

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11314:


Assignee: Apache Spark

> Add service API and test service for Yarn Cluster schedulers 
> -
>
> Key: SPARK-11314
> URL: https://issues.apache.org/jira/browse/SPARK-11314
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.2+ cluster
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> Provide an the extension model to load and run implementations of 
> {{SchedulerExtensionService}} in the yarn cluster scheduler process —and to 
> stop them afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11314) Add service API and test service for Yarn Cluster schedulers

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974208#comment-14974208
 ] 

Apache Spark commented on SPARK-11314:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/9182

> Add service API and test service for Yarn Cluster schedulers 
> -
>
> Key: SPARK-11314
> URL: https://issues.apache.org/jira/browse/SPARK-11314
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.2+ cluster
>Reporter: Steve Loughran
>
> Provide an the extension model to load and run implementations of 
> {{SchedulerExtensionService}} in the yarn cluster scheduler process —and to 
> stop them afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-10-26 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11315:
--

 Summary: Add YARN extension service to publish Spark events to 
YARN timeline service
 Key: SPARK-11315
 URL: https://issues.apache.org/jira/browse/SPARK-11315
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.5.1
 Environment: Hadoop 2.6+
Reporter: Steve Loughran


Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
events, batch them and forward them to the YARN Application Timeline Service. 
This data can then be retrieved by a new back end for the Spark History 
Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11314) Add service API and test service for Yarn Cluster schedulers

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11314:


Assignee: (was: Apache Spark)

> Add service API and test service for Yarn Cluster schedulers 
> -
>
> Key: SPARK-11314
> URL: https://issues.apache.org/jira/browse/SPARK-11314
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.2+ cluster
>Reporter: Steve Loughran
>
> Provide an the extension model to load and run implementations of 
> {{SchedulerExtensionService}} in the yarn cluster scheduler process —and to 
> stop them afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-26 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11265:
---
Summary: YarnClient can't get tokens to talk to Hive 1.2.1 in a secure 
cluster  (was: YarnClient can't get tokens to talk to Hive in a secure cluster)

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2015-10-26 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-11316:
-

 Summary: isEmpty before coalesce seems to cause huge performance 
issue in setupGroups
 Key: SPARK-11316
 URL: https://issues.apache.org/jira/browse/SPARK-11316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Thomas Graves


So I haven't fully debugged this yet but reporting what I'm seeing and think 
might be going on.

I have a graph processing job that is seeing huge slow down in setupGroups in 
the location iterator where its getting the preferred locations for the 
coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ hours 
to do the calculation.  Killed it at this point so don't know total time.

It appears that the job is doing an isEmpty call, a bunch of other 
transformation, then a coalesce (where it takes so long), other 
transformations, then finally a count to trigger it.   

It appears that there is only one node that its finding in the setupGroup call 
and to get to that node it has to first to through the while loop:

while (numCreated < targetLen && tries < expectedCoupons2) {
where expectedCoupons2 is around 19000.  It finds very few or none in this 
loop.  

Then it does the second loop:

while (numCreated < targetLen) {  // if we don't have enough partition groups, 
create duplicates
  var (nxt_replica, nxt_part) = rotIt.next()
  val pgroup = PartitionGroup(nxt_replica)
  groupArr += pgroup
  groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
  var tries = 0
  while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
ensure at least one part
nxt_part = rotIt.next()._2
tries += 1
  }
  numCreated += 1
}

Where it has an inner while loop and both of those are going 1200 times.  
1200*1200 loops.  This is taking a very long time.

The user can work around the issue by adding in a count() call very close to 
after the isEmpty call before the coalesce is called.  I also tried putting in 
a take(1)  right before the isEmpty call and it seems to work around the 
issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2015-10-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-11316:
--
Priority: Critical  (was: Major)

> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-26 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11265:
---
Description: As reported on the dev list, trying to run a YARN client which 
wants to talk to Hive in a Kerberized hadoop cluster fails.  (was: As reported 
on the dev list, trying to run a YARN client which wants to talk to Hive in a 
Kerberized hadoop cluster fails. This appears to be because the constructor of 
the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was made private and 
replaced with a factory method. The YARN client uses reflection to get the 
tokens, so the signature changes weren't picked up in SPARK-8064.)

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2015-10-26 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974236#comment-14974236
 ] 

Thomas Graves commented on SPARK-11316:
---

Note I"m wondering if since the isEmpty call is doing a take(1) if its only 
finding 1 locations and thus throwing off the setupGroups call.

> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974239#comment-14974239
 ] 

Steve Loughran commented on SPARK-11265:


What's changed?

The spark code uses reflection to get the method 
{{("org.apache.hadoop.hive.ql.metadata.Hive#get"), then invokes it with a 
single argument: {{hive = hiveClass.getMethod("get").invoke(null, 
hiveConf.asInstanceOf[Object])}}

Hive 0.13 has >1 method with this name, even in Hive 0.31.1; it has, in order, 
{{get(HiveConf}}, {{get(HiveConf, boolean)}}, and {{get()}}.

Hive 1.2.1 adds one new method {{get(Configuration c, Class clazz)}} 
*before* the others, and now invoke is failing as the returned method doesn't 
take a HiveConf.

What could have been happening here is that the {{Class.get()}} method was 
returning the {{get(HiveConf}} method because it was first in the file, and on 
1.2.1 the new method returned the new one, which didn't take a single 
{{HiveConf}}, hence the stack trace

The fix, under all of it, is simply getting the method {{get(HiveConf.class)}}, 
and invoking it with the configuration created by reflection. That's all: 
explicitly asking for a method that's always been there. The code probably 
worked before just because nobody was looking at it.

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974239#comment-14974239
 ] 

Steve Loughran edited comment on SPARK-11265 at 10/26/15 2:08 PM:
--

What's changed?

The spark code uses reflection to get the method 
{{("org.apache.hadoop.hive.ql.metadata.Hive#get")}}, then invokes it with a 
single argument: {{hive = hiveClass.getMethod("get").invoke(null, 
hiveConf.asInstanceOf[Object])}}

Hive 0.13 has >1 method; it has, in order, {{get(HiveConf}}, {{get(HiveConf, 
boolean)}}, and {{get()}}.

Hive 1.2.1 adds one new method {{get(Configuration c, Class clazz)}} 
*before* the others, and now invoke is failing as the returned method doesn't 
take a HiveConf.

What could have been happening here is that the {{Class.get()}} method was 
returning the {{get(HiveConf}} method because it was first in the file, and on 
1.2.1 the new method returned the new one, which didn't take a single 
{{HiveConf}}, hence the stack trace

The fix, under all of it, is simply getting the method {{get(HiveConf.class)}}, 
and invoking it with the configuration created by reflection. That's all: 
explicitly asking for a method that's always been there. The code probably 
worked before just because nobody was looking at it.


was (Author: ste...@apache.org):
What's changed?

The spark code uses reflection to get the method 
{{("org.apache.hadoop.hive.ql.metadata.Hive#get")}}, then invokes it with a 
single argument: {{hive = hiveClass.getMethod("get").invoke(null, 
hiveConf.asInstanceOf[Object])}}

Hive 0.13 has >1 method with this name, even in Hive 0.31.1; it has, in order, 
{{get(HiveConf}}, {{get(HiveConf, boolean)}}, and {{get()}}.

Hive 1.2.1 adds one new method {{get(Configuration c, Class clazz)}} 
*before* the others, and now invoke is failing as the returned method doesn't 
take a HiveConf.

What could have been happening here is that the {{Class.get()}} method was 
returning the {{get(HiveConf}} method because it was first in the file, and on 
1.2.1 the new method returned the new one, which didn't take a single 
{{HiveConf}}, hence the stack trace

The fix, under all of it, is simply getting the method {{get(HiveConf.class)}}, 
and invoking it with the configuration created by reflection. That's all: 
explicitly asking for a method that's always been there. The code probably 
worked before just because nobody was looking at it.

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11265) YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster

2015-10-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974239#comment-14974239
 ] 

Steve Loughran edited comment on SPARK-11265 at 10/26/15 2:08 PM:
--

What's changed?

The spark code uses reflection to get the method 
{{("org.apache.hadoop.hive.ql.metadata.Hive#get")}}, then invokes it with a 
single argument: {{hive = hiveClass.getMethod("get").invoke(null, 
hiveConf.asInstanceOf[Object])}}

Hive 0.13 has >1 method with this name, even in Hive 0.31.1; it has, in order, 
{{get(HiveConf}}, {{get(HiveConf, boolean)}}, and {{get()}}.

Hive 1.2.1 adds one new method {{get(Configuration c, Class clazz)}} 
*before* the others, and now invoke is failing as the returned method doesn't 
take a HiveConf.

What could have been happening here is that the {{Class.get()}} method was 
returning the {{get(HiveConf}} method because it was first in the file, and on 
1.2.1 the new method returned the new one, which didn't take a single 
{{HiveConf}}, hence the stack trace

The fix, under all of it, is simply getting the method {{get(HiveConf.class)}}, 
and invoking it with the configuration created by reflection. That's all: 
explicitly asking for a method that's always been there. The code probably 
worked before just because nobody was looking at it.


was (Author: ste...@apache.org):
What's changed?

The spark code uses reflection to get the method 
{{("org.apache.hadoop.hive.ql.metadata.Hive#get"), then invokes it with a 
single argument: {{hive = hiveClass.getMethod("get").invoke(null, 
hiveConf.asInstanceOf[Object])}}

Hive 0.13 has >1 method with this name, even in Hive 0.31.1; it has, in order, 
{{get(HiveConf}}, {{get(HiveConf, boolean)}}, and {{get()}}.

Hive 1.2.1 adds one new method {{get(Configuration c, Class clazz)}} 
*before* the others, and now invoke is failing as the returned method doesn't 
take a HiveConf.

What could have been happening here is that the {{Class.get()}} method was 
returning the {{get(HiveConf}} method because it was first in the file, and on 
1.2.1 the new method returned the new one, which didn't take a single 
{{HiveConf}}, hence the stack trace

The fix, under all of it, is simply getting the method {{get(HiveConf.class)}}, 
and invoking it with the configuration created by reflection. That's all: 
explicitly asking for a method that's always been there. The code probably 
worked before just because nobody was looking at it.

> YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
> -
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974275#comment-14974275
 ] 

Sean Owen commented on SPARK-11302:
---

Yeah I recognize that, but is this not the answer then? the covariance matrix 
is invalid. The covariance matrix you have here is very different.
What's mu? where are you computing the pdf? what is the log(pdf) -- that is, is 
it just not very very small? what does R say? I think there is still a lot of 
missing pieces here.

>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9241:
---

Assignee: Apache Spark

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974378#comment-14974378
 ] 

Apache Spark commented on SPARK-9241:
-

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/9280

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9241) Supporting multiple DISTINCT columns

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9241:
---

Assignee: (was: Apache Spark)

> Supporting multiple DISTINCT columns
> 
>
> Key: SPARK-9241
> URL: https://issues.apache.org/jira/browse/SPARK-9241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread eyal sharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974382#comment-14974382
 ] 

eyal sharon commented on SPARK-11302:
-

Sure , I will try to elaborate more

MU  is the mean vector of my data set.

Here is the basic flow of my code with function I used. Each function runs
over the data set arranged in a matrix


*1- Create a  mu vector *

def createMU(mat: DenseMatrix): Vector = {

  val columnsInArray = toArrays(mat,false)

  Vectors.dense(columnsInArray.map(vector => vector.sum/vector.length ))

}


*2- create a cov matrix *

 def createCovSigma(mat: DenseMatrix,mu: Vector) : DenseMatrix = {


  val rowsInArray = toArrays(mat,true)
  val sigmaSubMU = rowsInArray.map(row => {(row.toList zip
mu.toArray).map(elem=>elem._1-elem._2)}.toArray )

  val checkArray = sigmaSubMU.flatMap(row=>row)

  println("Matrix dimensions -  rows: " + mat.numRows + ",cols: "  +
mat.numCols)
  val mat2 = new DenseMatrix(mat.numRows, mat.numCols,checkArray,true)
  val sigmaTmp: DenseMatrix = mat2.transpose.multiply(mat2)
  val sigmaTmpArray=sigmaTmp.toArray
  val sigmaMatrix: DenseMatrix =  new DenseMatrix(mat.numCols,
mat.numCols, sigmaTmpArray.flatMap(x=>List(x/mat.numRows)),true)

  sigmaMatrix
}

* Note the I am using an auxiliary function toArrays, here is the definition:


def toArrays(mat: Matrix,byRow: Boolean): Array[Array[Double]]  = {

  val direction = if (byRow)  mat.numCols else mat.numRows
  mat.toArray.grouped(direction).toArray

}


*3- After having the mu and the sigma, I can no create an instance of the
gaussian *

val mg = new MultivariateGaussian(mu,sigma)


4- Now, I can create a projection using the PDF

E.g-d3=mg.pdf(Vectors.dense(629,640,1.7188,618.19))

The model  returns zero for every data point



 4- For validation, I ran a gaussian implantation on Matlab and the results
are:

- For the case of *non covariance* matrix, the two models yield same result
exactly
- For the case of *covariance*, Matlab yields good result but Mlib doesn't.
( note that I feed the two models with the same input, concretely, the same
MU and covariance matrix  )


Best, Eyal





-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-10-26 Thread Tamas Szuromi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974381#comment-14974381
 ] 

Tamas Szuromi commented on SPARK-10309:
---

I guess same issue here also.
{code} 
15/10/26 15:11:33 INFO UnsafeExternalSorter: Thread 4524 spilling sort data of 
64.0 KB to disk (0  time so far)
15/10/26 15:11:33 INFO Executor: Executor is trying to kill task 135.0 in stage 
394.0 (TID 11069)
15/10/26 15:11:33 INFO UnsafeExternalSorter: Thread 4607 spilling sort data of 
64.0 KB to disk (0  time so far)
15/10/26 15:11:33 ERROR Executor: Managed memory leak detected; size = 67108864 
bytes, TID = 11149
15/10/26 15:11:33 ERROR Executor: Exception in task 92.3 in stage 394.0 (TID 
11149)
java.io.IOException: Unable to acquire 67108864 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
at 
org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:83)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD$$anonfun$tryPrepareParents$1.applyOrElse(ZippedPartitionsRDD.scala:82)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
at scala.collection.AbstractTraversable.collect(Traversable.scala:105)
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.tryPrepareParents(ZippedPartitionsRDD.scala:82)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code} 

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> *=== Update ===*
> This is caused by a mismatch between 
> `Runtime.getRuntime.availableProcessors()` and the number of active tasks in 
> `ShuffleMemoryManager`. A quick reproduction is the following:
> {code}
> // My machine only has 8 cores
> $ bin/spark-shell --master local[32]
> scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
> scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
> Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:3

[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974396#comment-14974396
 ] 

Sean Owen commented on SPARK-11302:
---

I understand mu is the mean vector, but what is the vector? I'm trying to 
quickly reproduce this or not. It's good to share code here but I think even 
better would be just code that starts with your mu / sigma and shows it 
computing something you believe to be non-zero but isn't.  Right now this isn't 
a reproducible test case but it nearly is.

You show one data point but what else? what's the correct answer -- is it very 
small (like smaller than the smallest positive 64-bit float)? what about the 
result of logpdf?

>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions

2015-10-26 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11317:
--

 Summary: YARN HBase token code shouldn't swallow invocation target 
exceptions
 Key: SPARK-11317
 URL: https://issues.apache.org/jira/browse/SPARK-11317
 Project: Spark
  Issue Type: Bug
Reporter: Steve Loughran


As with SPARK-11265; the HBase token retrieval code of SPARK-6918

1. swallows exceptions it should be rethrowing as serious problems (e.g 
NoSuchMethodException)
1. Swallows any exception raised by the HBase client, without even logging the 
details (it logs that an `InvocationTargetException` was caught, but not the 
contents)

As such it is potentially brittle to changes in the HDFS client code, and 
absolutely not going to provide any assistance if HBase won't actually issue 
tokens to the caller.

The code in SPARK-11265 can be re-used to provide consistent and better 
exception processing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11300) Support for string length when writing to JDBC

2015-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974443#comment-14974443
 ] 

Maciej Bryński commented on SPARK-11300:


Not really. SPARK-10101 is about situation where there is no TEXT type in DBMS.

I'm talking about situation where I want to specify independently size of every 
string column in DataFrame.




> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11300) Support for string length when writing to JDBC

2015-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974443#comment-14974443
 ] 

Maciej Bryński edited comment on SPARK-11300 at 10/26/15 3:58 PM:
--

Not really. SPARK-10101 is about situation where there is no TEXT type in DBMS.

I'm talking about situation where I want to specify independently size of every 
string column in DataFrame.




was (Author: maver1ck):
Not really. SPARK-10101 is about situation where there is no TEXT type in DBMS.

I'm talking about situation where I want to specify independently size of every 
string column in DataFrame.
Something similar to Decimal.



> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11300) Support for string length when writing to JDBC

2015-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974443#comment-14974443
 ] 

Maciej Bryński edited comment on SPARK-11300 at 10/26/15 3:57 PM:
--

Not really. SPARK-10101 is about situation where there is no TEXT type in DBMS.

I'm talking about situation where I want to specify independently size of every 
string column in DataFrame.
Something similar to Decimal.




was (Author: maver1ck):
Not really. SPARK-10101 is about situation where there is no TEXT type in DBMS.

I'm talking about situation where I want to specify independently size of every 
string column in DataFrame.




> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974480#comment-14974480
 ] 

Iulian Dragos commented on SPARK-10986:
---

Digging a bit deeper. The problem is that the context class loader is not set 
when running in fine-grained mode. When the Java serializer is created, it uses 
a {{null}} classloader, leading to {{ClassNotFoundException}} 
({{Class.forName}} with a null classloader uses the primordial class loader, 
meaning only the JDK is in there).

In coarse-grained, the context class loader is set by some hadoop classes 
dealing with {{UserGroupInformation}}, via {{runAsSparkUser}}. It's probably 
totally accidental that it works.

I'm not sure where is the right moment to set the context class loader, and who 
else relies on it. This part is totally undocumented.

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.Abs

[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-26 Thread Fabien COMTE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974481#comment-14974481
 ] 

Fabien COMTE commented on SPARK-11193:
--

Same problem with EMR 4.1.0 and Java 8.
I am using spark-streaming-kinesis-asl_2.10 in version 1.5.0.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974480#comment-14974480
 ] 

Iulian Dragos edited comment on SPARK-10986 at 10/26/15 4:17 PM:
-

Digging a bit deeper. The problem is that the context class loader is not set 
when running in fine-grained mode. When the Java serializer is created, it uses 
a {{null}} classloader, leading to {{ClassNotFoundException}} 
({{Class.forName}} with a null classloader uses the primordial class loader, 
meaning only the JDK is in there).

In coarse-grained, the context class loader is set by some hadoop classes 
dealing with {{UserGroupInformation}}, via {{runAsSparkUser}}. It's probably 
totally accidental that it works.

I'm not sure where is the right place to set the context class loader, and who 
else relies on it. This part is totally undocumented.


was (Author: dragos):
Digging a bit deeper. The problem is that the context class loader is not set 
when running in fine-grained mode. When the Java serializer is created, it uses 
a {{null}} classloader, leading to {{ClassNotFoundException}} 
({{Class.forName}} with a null classloader uses the primordial class loader, 
meaning only the JDK is in there).

In coarse-grained, the context class loader is set by some hadoop classes 
dealing with {{UserGroupInformation}}, via {{runAsSparkUser}}. It's probably 
totally accidental that it works.

I'm not sure where is the right moment to set the context class loader, and who 
else relies on it. This part is totally undocumented.

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandl

[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974488#comment-14974488
 ] 

Joseph Wu commented on SPARK-10986:
---

Definitely accidental.  I've tried running in course-grained mode, and it has 
the same error for me.  [~gabriel.hartm...@gmail.com] also tried.

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.Ni

[jira] [Created] (SPARK-11318) [DOC] Include hive profile in make-distribution.sh command

2015-10-26 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11318:
--

 Summary: [DOC] Include hive profile in make-distribution.sh command
 Key: SPARK-11318
 URL: https://issues.apache.org/jira/browse/SPARK-11318
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor


The tgz I built using the current command shown in building-spark.html does not 
produce the datanucleus jars which are included in the "boxed" spark 
distributions.

hive profile should be included so that the tar ball matches spark distribution.

See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11318) [DOC] Include hive profile in make-distribution.sh command

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11318:


Assignee: (was: Apache Spark)

> [DOC] Include hive profile in make-distribution.sh command
> --
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11318) [DOC] Include hive profile in make-distribution.sh command

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974499#comment-14974499
 ] 

Apache Spark commented on SPARK-11318:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9281

> [DOC] Include hive profile in make-distribution.sh command
> --
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11318) [DOC] Include hive profile in make-distribution.sh command

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11318:


Assignee: Apache Spark

> [DOC] Include hive profile in make-distribution.sh command
> --
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-26 Thread Fabien Comte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974481#comment-14974481
 ] 

Fabien Comte edited comment on SPARK-11193 at 10/26/15 4:29 PM:


Same problem with EMR 4.1.0 and Java 8.
I am using spark-streaming-kinesis-asl_2.10 in version 1.5.0.

The ser/de of the receiver fails with the Kryo serializer, everything works 
fine with the default java serializer.


was (Author: comtef):
Same problem with EMR 4.1.0 and Java 8.
I am using spark-streaming-kinesis-asl_2.10 in version 1.5.0.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11283) List column gets additional level of nesting when converted to Spark DataFrame

2015-10-26 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974507#comment-14974507
 ] 

Shivaram Venkataraman commented on SPARK-11283:
---

cc [~sunrui]

> List column gets additional level of nesting when converted to Spark DataFrame
> --
>
> Key: SPARK-11283
> URL: https://issues.apache.org/jira/browse/SPARK-11283
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master 
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
>
> When input data frame contains list column there is an additional level of 
> nesting in a Spark DataFrame and as a result collected data is no longer 
> identical to input:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- list(list(1), list(2))
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: array (containsNull = true)
> ##  |||-- element: double (containsNull = true)
> identical(ldf, collect(sdf))
> ## [1] FALSE
> {code}
> Comparing structure:
> Local df
> {code}
> unclass(ldf)
> ## $x
> ## $x[[1]]
> ## $x[[1]][[1]]
> ## [1] 1
> ##
> ## $x[[2]]
> ## $x[[2]][[1]]
> ## [1] 2
> ##
> ## attr(,"row.names")
> ## [1] 1 2
> {code}
> Collected
> {code}
> unclass(collect(sdf))
> ## $x
> ## $x[[1]]
> ## $x[[1]][[1]]
> ## $x[[1]][[1]][[1]]
> ## [1] 1
> ## 
> ## $x[[2]]
> ## $x[[2]][[1]]
> ## $x[[2]][[1]][[1]]
> ## [1] 2
> ##
> ## attr(,"row.names")
> ## [1] 1 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11300) Support for string length when writing to JDBC

2015-10-26 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974510#comment-14974510
 ] 

Josh Rosen commented on SPARK-11300:


If not an exact duplicate, there's probably some overlap in internal mechanism 
at least. I'll switch the link to "relates to".

> Support for string length when writing to JDBC
> --
>
> Key: SPARK-11300
> URL: https://issues.apache.org/jira/browse/SPARK-11300
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>
> Right now every StringType fields are written to JDBC as TEXT.
> I'd like to have option to write it as VARCHAR(size).
> Maybe we could use StringType(size) ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-26 Thread Rick Hillegas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974509#comment-14974509
 ] 

Rick Hillegas commented on SPARK-5966:
--

Should this issue be assigned to Kevin now, since he submitted the pull 
request? Thanks.

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-10-26 Thread Anna Kepler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974515#comment-14974515
 ] 

Anna Kepler commented on SPARK-5206:


In our case we have a broadcast variable that needs to be accessed in 
updateStateBykey() method. 
How can we resolve that?

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> I guess that an Accumulator is registered to a singleton Accumulators in Line 
> 58 of org.apache.spark.Accumulable:
> Accumulators.register(this, true)
> This code need to be executed in the driver once. But when the application is 
> recovered from checkpoint. It won't be executed in the driver. So when the 
> driver process it at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
>  It can't find the Accumulator because it's not re-register during the 
> recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

2015-10-26 Thread Kevin Cox (JIRA)
Kevin Cox created SPARK-11319:
-

 Summary: PySpark silently Accepts null values in non-nullable 
DataFrame fields.
 Key: SPARK-11319
 URL: https://issues.apache.org/jira/browse/SPARK-11319
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Kevin Cox


Running the following code with a null value in a non-nullable column silently 
works. This makes the code incredibly hard to trust.

{code}
In [2]: from pyspark.sql.types import *
In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
TimestampType(), False)])).collect()
Out[3]: [Row(a=None)]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10986:


Assignee: Apache Spark

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>

[jira] [Assigned] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10986:


Assignee: (was: Apache Spark)

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.uti

[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974548#comment-14974548
 ] 

Apache Spark commented on SPARK-10986:
--

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/9282

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventL

[jira] [Created] (SPARK-11320) DROP TABLE IF EXISTS throws exception if the table does not exist.

2015-10-26 Thread Alex Liu (JIRA)
Alex Liu created SPARK-11320:


 Summary: DROP TABLE IF EXISTS throws exception if the table does 
not exist.
 Key: SPARK-11320
 URL: https://issues.apache.org/jira/browse/SPARK-11320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Alex Liu
Priority: Minor


DROP TABLE IF EXISTS throws exception if the table does not exist.

{code}
scala> val a = hc.sql("use default")
a: org.apache.spark.sql.DataFrame = [result: string]

scala> val b = hc.sql("drop table if exists nope")
ERROR 2015-10-22 09:25:35 hive.ql.metadata.Hive: 
NoSuchObjectException(message:default.nope table not fo
{code}

It's fixed in https://issues.apache.org/jira/browse/HIVE-8564,

We may want to patch 0.13.x as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11321) Allow addition of non-nullable UDFs

2015-10-26 Thread Kevin Cox (JIRA)
Kevin Cox created SPARK-11321:
-

 Summary: Allow addition of non-nullable UDFs
 Key: SPARK-11321
 URL: https://issues.apache.org/jira/browse/SPARK-11321
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Reporter: Kevin Cox
Priority: Minor


It would be really nice if you could create UDFs that had a non-nullable return 
type. This way the schema could continue to match the logical contents of the 
dataframe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11320) DROP TABLE IF EXISTS throws exception if the table does not exist.

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974560#comment-14974560
 ] 

Sean Owen commented on SPARK-11320:
---

Does that mean it's a Hive problem then? Spark can already use >= 0.13.0

> DROP TABLE IF EXISTS throws exception if the table does not exist.
> --
>
> Key: SPARK-11320
> URL: https://issues.apache.org/jira/browse/SPARK-11320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>Priority: Minor
>
> DROP TABLE IF EXISTS throws exception if the table does not exist.
> {code}
> scala> val a = hc.sql("use default")
> a: org.apache.spark.sql.DataFrame = [result: string]
> scala> val b = hc.sql("drop table if exists nope")
> ERROR 2015-10-22 09:25:35 hive.ql.metadata.Hive: 
> NoSuchObjectException(message:default.nope table not fo
> {code}
> It's fixed in https://issues.apache.org/jira/browse/HIVE-8564,
> We may want to patch 0.13.x as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11318) Include hive profile in make-distribution.sh command

2015-10-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11318:
--
   Priority: Trivial  (was: Minor)
Component/s: Documentation
Summary: Include hive profile in make-distribution.sh command  (was: 
[DOC] Include hive profile in make-distribution.sh command)

[~tedyu] I think you've been around long enough to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark -- 
please pay attention to how you create JIRAs vs PRs

> Include hive profile in make-distribution.sh command
> 
>
> Key: SPARK-11318
> URL: https://issues.apache.org/jira/browse/SPARK-11318
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Ted Yu
>Priority: Trivial
>
> The tgz I built using the current command shown in building-spark.html does 
> not produce the datanucleus jars which are included in the "boxed" spark 
> distributions.
> hive profile should be included so that the tar ball matches spark 
> distribution.
> See 'Problem with make-distribution.sh' thread on user@ for background.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

2015-10-26 Thread Kevin Cox (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974579#comment-14974579
 ] 

Kevin Cox commented on SPARK-11319:
---

Furthermore it appears that some functions are "optimized" based on the 
nullability of the column. For example it makes the following expression 
incredibly confusing.

{code}
In [29]: df.withColumn('b', df.a.isNull()).collect()
Out[29]: [Row(a=None, b=False)]
{code}

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread eyal sharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974589#comment-14974589
 ] 

eyal sharon commented on SPARK-11302:
-

Cool , I will try although I hope  I fully captured all questions

1- Logpdf is also returns non reasonable value .   mg.logpdf(Vectors.dense(
629,640,1.7188,618.19))  =-3891330.078277891 ( the exp is zero   )
2 -  my MU vector values -
 [1055.3910505836575,1070.489299610895,1.39020554474708,1040.5907503867697]
3 - the correct answer , as return from Matlab for this given data point
mg.pdf(Vectors.dense(629,640,1.7188,618.19))  is around e-05
4- when running the model with a non covariance matrix , model yields

 pdf - 7.293362507983666E-11, logpdf  -23.341471333876257 . Again , these
results match with the Matlab model


5- these are the printed values from my script

mu:
 [1055.3910505836575,1070.489299610895,1.39020554474708,1040.5907503867697]

sigma:

166769.00466698944  0.00.00.0

0.0 172041.5670061245  0.00.0

0.0 0.00.872524191943962  0.0

0.0 0.00.0161848.9196719207


sigmaCov:
 166769.00466698944  169336.6705268059   12.820670788921873
 164243.93314092053
169336.6705268059   172041.5670061245   21.62590020524533
166678.01075856484
12.820670788921873  21.62590020524533   0.872524191943962
4.283255814732373
164243.93314092053  166678.01075856484  4.283255814732373
161848.9196719207


I hope it helps.

Best, Eyal







-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5966:
-
Assignee: kevin yu  (was: Andrew Or)

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: kevin yu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11320) DROP TABLE IF EXISTS throws exception if the table does not exist.

2015-10-26 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974594#comment-14974594
 ] 

Alex Liu commented on SPARK-11320:
--

It's a hive problem. 

> DROP TABLE IF EXISTS throws exception if the table does not exist.
> --
>
> Key: SPARK-11320
> URL: https://issues.apache.org/jira/browse/SPARK-11320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>Priority: Minor
>
> DROP TABLE IF EXISTS throws exception if the table does not exist.
> {code}
> scala> val a = hc.sql("use default")
> a: org.apache.spark.sql.DataFrame = [result: string]
> scala> val b = hc.sql("drop table if exists nope")
> ERROR 2015-10-22 09:25:35 hive.ql.metadata.Hive: 
> NoSuchObjectException(message:default.nope table not fo
> {code}
> It's fixed in https://issues.apache.org/jira/browse/HIVE-8564,
> We may want to patch 0.13.x as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11320) DROP TABLE IF EXISTS throws exception if the table does not exist.

2015-10-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11320.
---
Resolution: Not A Problem

OK, then this shouldn't be logged as a Spark issue, since it can already work 
with the fixed version of Hive.

> DROP TABLE IF EXISTS throws exception if the table does not exist.
> --
>
> Key: SPARK-11320
> URL: https://issues.apache.org/jira/browse/SPARK-11320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Alex Liu
>Priority: Minor
>
> DROP TABLE IF EXISTS throws exception if the table does not exist.
> {code}
> scala> val a = hc.sql("use default")
> a: org.apache.spark.sql.DataFrame = [result: string]
> scala> val b = hc.sql("drop table if exists nope")
> ERROR 2015-10-22 09:25:35 hive.ql.metadata.Hive: 
> NoSuchObjectException(message:default.nope table not fo
> {code}
> It's fixed in https://issues.apache.org/jira/browse/HIVE-8564,
> We may want to patch 0.13.x as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11322) Keep full stack track in captured exception in PySpark

2015-10-26 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-11322:
---

 Summary: Keep full stack track in captured exception in PySpark
 Key: SPARK-11322
 URL: https://issues.apache.org/jira/browse/SPARK-11322
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Liang-Chi Hsieh


We should keep full stack trace in captured exception in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11322) Keep full stack track in captured exception in PySpark

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11322:


Assignee: Apache Spark

> Keep full stack track in captured exception in PySpark
> --
>
> Key: SPARK-11322
> URL: https://issues.apache.org/jira/browse/SPARK-11322
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We should keep full stack trace in captured exception in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11323) Add History Service Provider to service application histories from YARN timeline server

2015-10-26 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11323:
--

 Summary: Add History Service Provider to service application 
histories from YARN timeline server
 Key: SPARK-11323
 URL: https://issues.apache.org/jira/browse/SPARK-11323
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.5.1
Reporter: Steve Loughran


Add a {{ApplicationHistoryProvider}} provider for enumerating and viewing 
application histories from the YARN timeline server.

As the provider will only run in a YARN cluster, it can take advantage of the 
Yarn Client API to identify those applications which have terminated without 
explicitly declaring this in their event histories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11322) Keep full stack track in captured exception in PySpark

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11322:


Assignee: (was: Apache Spark)

> Keep full stack track in captured exception in PySpark
> --
>
> Key: SPARK-11322
> URL: https://issues.apache.org/jira/browse/SPARK-11322
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> We should keep full stack trace in captured exception in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11322) Keep full stack track in captured exception in PySpark

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974616#comment-14974616
 ] 

Apache Spark commented on SPARK-11322:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9283

> Keep full stack track in captured exception in PySpark
> --
>
> Key: SPARK-11322
> URL: https://issues.apache.org/jira/browse/SPARK-11322
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> We should keep full stack trace in captured exception in PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974653#comment-14974653
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

Hanks for thé update. I'm in vacation up to Wednesday. I will resume my 
investigation/fix on this next Thursday. 

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3342) m3 instances don't get local SSDs

2015-10-26 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974660#comment-14974660
 ] 

Nicholas Chammas commented on SPARK-3342:
-

FWIW, that statement on M3 instances is [no longer 
there|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html], 
so we should be able to drop [this 
logic|https://github.com/apache/spark/blob/07ced43424447699e47106de9ca2fa714756bdeb/ec2/spark_ec2.py#L588-L595]
 in spark-ec2. cc [~shivaram]

> m3 instances don't get local SSDs
> -
>
> Key: SPARK-3342
> URL: https://issues.apache.org/jira/browse/SPARK-3342
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2
>Reporter: Matei Zaharia
>Assignee: Daniel Darabos
> Fix For: 1.1.0
>
>
> As discussed on https://github.com/apache/spark/pull/2081, these instances 
> ignore the block device mapping on the AMI and require ephemeral drives to be 
> added programmatically when launching them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974673#comment-14974673
 ] 

Iulian Dragos commented on SPARK-10986:
---

[~kaysoky] can you try my PR and see if it solves it for you?

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.c

[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974763#comment-14974763
 ] 

Joseph Wu commented on SPARK-10986:
---

It works!  I'll also notify Gabriel to try his more complex/interesting test 
case.  (He's currently OOO until next week.)

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSele

[jira] [Commented] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-26 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974777#comment-14974777
 ] 

Iulian Dragos commented on SPARK-10986:
---

Great! Please leave a note on the PR as well.. It's been particularly difficult 
to get attention from committers lately.

> ClassNotFoundException when running on Client mode, with a Mesos master.
> 
>
> Key: SPARK-10986
> URL: https://issues.apache.org/jira/browse/SPARK-10986
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
> Environment: OSX, Java 8, Mesos 0.25.0
> HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
> Built from source:
> build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>Reporter: Joseph Wu
>Priority: Blocker
>  Labels: mesos, spark
>
> When running an example task on a Mesos cluster (local master, local agent), 
> any Spark tasks will stall with the following error (in the executor's 
> stderr):
> Works fine in coarse-grained mode, only fails in *fine-grained mode*.
> {code}
> 15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 53689.
> 15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
> /10.0.79.8:53673
> java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
>   at 
> org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.proc

[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-10-26 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974793#comment-14974793
 ] 

shane knapp commented on SPARK-11255:
-

[~shivaram] and i are looking in to this right now.  backporting the jenkins 
workers will be a project, with (potentially) a couple of days of downtime as 
my staging instance is gone and i'll need to test on the live jenkins (don't 
ask -- datacenter fire a month and a half ago).

if we do go w/3.1.1, here's what we'll need to do:
* wipe all vestiges of R and friends from the jenkins workers
* compile 3.1.1 from source, create an rpm and dist to all workers
* reinstall all R packages (to ensure they work w/3.1.1)
* test
* fix
* repeat N times
* profit



> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-10-26 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974815#comment-14974815
 ] 

shane knapp commented on SPARK-11255:
-

also, IFF we roll back to 3.1.1,during rollback/deployment/testing, we will 
need to do something about all of the rest of the spark builds as they'll fail 
the sparkR section.

[~joshrosen] for a heads up.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7106) Support model save/load in Python's FPGrowth

2015-10-26 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973525#comment-14973525
 ] 

Kai Jiang edited comment on SPARK-7106 at 10/26/15 7:36 PM:


I would like to take this one once spark-6724 is done.


was (Author: vectorijk):
I would like to do this one after spark-6724 is done.

> Support model save/load in Python's FPGrowth
> 
>
> Key: SPARK-7106
> URL: https://issues.apache.org/jira/browse/SPARK-7106
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions

2015-10-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974905#comment-14974905
 ] 

Steve Loughran commented on SPARK-11317:


I'll do this as soon as SPARK-11265 is in; I've factored the code there to make 
it straightforward.

This is actually a very important patch, because the current code *will not log 
any authentication problems*. All you get is an "invocation target exception" 
message in the log, which isn't enough to fix things

> YARN HBase token code shouldn't swallow invocation target exceptions
> 
>
> Key: SPARK-11317
> URL: https://issues.apache.org/jira/browse/SPARK-11317
> Project: Spark
>  Issue Type: Bug
>Reporter: Steve Loughran
>
> As with SPARK-11265; the HBase token retrieval code of SPARK-6918
> 1. swallows exceptions it should be rethrowing as serious problems (e.g 
> NoSuchMethodException)
> 1. Swallows any exception raised by the HBase client, without even logging 
> the details (it logs that an `InvocationTargetException` was caught, but not 
> the contents)
> As such it is potentially brittle to changes in the HDFS client code, and 
> absolutely not going to provide any assistance if HBase won't actually issue 
> tokens to the caller.
> The code in SPARK-11265 can be re-used to provide consistent and better 
> exception processing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4751) Support dynamic allocation for standalone mode

2015-10-26 Thread Matthias Niehoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974119#comment-14974119
 ] 

Matthias Niehoff edited comment on SPARK-4751 at 10/26/15 8:05 PM:
---

The PR is merged, but the documentation at 
https://spark.apache.org/docs/1.5.1/job-scheduling.html still says:
"This feature is currently disabled by default and available only on YARN."

Is the documentation just outdated or is it not yet available in 1.5.x?


was (Author: j4nu5):
The PR is merged, but the documentation at 
https://spark.apache.org/docs/1.5.1/job-scheduling.html still says:
"This feature is currently disabled by default and available only on YARN."

Is the documentation just outdated or is not yet available in 1.5.x?

> Support dynamic allocation for standalone mode
> --
>
> Key: SPARK-4751
> URL: https://issues.apache.org/jira/browse/SPARK-4751
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> This is equivalent to SPARK-3822 but for standalone mode.
> This is actually a very tricky issue because the scheduling mechanism in the 
> standalone Master uses different semantics. In standalone mode we allocate 
> resources based on cores. By default, an application will grab all the cores 
> in the cluster unless "spark.cores.max" is specified. Unfortunately, this 
> means an application could get executors of different sizes (in terms of 
> cores) if:
> 1) App 1 kills an executor
> 2) App 2, with "spark.cores.max" set, grabs a subset of cores on a worker
> 3) App 1 requests an executor
> In this case, the new executor that App 1 gets back will be smaller than the 
> rest and can execute fewer tasks in parallel. Further, standalone mode is 
> subject to the constraint that only one executor can be allocated on each 
> worker per application. As a result, it is rather meaningless to request new 
> executors if the existing ones are already spread out across all nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4751) Support dynamic allocation for standalone mode

2015-10-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974961#comment-14974961
 ] 

Andrew Or commented on SPARK-4751:
--

it is available, sorry we will update the documentation soon to reflect this.

> Support dynamic allocation for standalone mode
> --
>
> Key: SPARK-4751
> URL: https://issues.apache.org/jira/browse/SPARK-4751
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.5.0
>
>
> This is equivalent to SPARK-3822 but for standalone mode.
> This is actually a very tricky issue because the scheduling mechanism in the 
> standalone Master uses different semantics. In standalone mode we allocate 
> resources based on cores. By default, an application will grab all the cores 
> in the cluster unless "spark.cores.max" is specified. Unfortunately, this 
> means an application could get executors of different sizes (in terms of 
> cores) if:
> 1) App 1 kills an executor
> 2) App 2, with "spark.cores.max" set, grabs a subset of cores on a worker
> 3) App 1 requests an executor
> In this case, the new executor that App 1 gets back will be smaller than the 
> rest and can execute fewer tasks in parallel. Further, standalone mode is 
> subject to the constraint that only one executor can be allocated on each 
> worker per application. As a result, it is rather meaningless to request new 
> executors if the existing ones are already spread out across all nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix return zero always

2015-10-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974982#comment-14974982
 ] 

Sean Owen commented on SPARK-11302:
---

OK I reproduced all of this, thank you. This is roughly the code you can use to 
see the very large logpdf for this value:

{code}
import breeze.linalg.{DenseMatrix => BDM, Matrix => BM, DenseVector => BDV, 
SparseVector => BSV, Vector => BV, diag, max, eigSym}

val breezeMu = new 
BDV(Array(1055.3910505836575,1070.489299610895,1.39020554474708,1040.5907503867697))

val breezeSigma = new BDM(4, 4, Array(166769.00466698944, 169336.6705268059, 
12.820670788921873, 164243.93314092053, 169336.6705268059, 172041.5670061245, 
21.62590020524533, 166678.01075856484, 12.820670788921873, 21.62590020524533, 
0.872524191943962, 4.283255814732373, 164243.93314092053, 166678.01075856484, 
4.283255814732373, 161848.9196719207))

val EPSILON = {
var eps = 1.0
while ((1.0 + (eps / 2.0)) != 1.0) {
  eps /= 2.0
}
eps
  }

val eigSym.EigSym(d, u2) = eigSym(breezeSigma)
val tol = EPSILON * max(d) * d.length
val logPseudoDetSigma = d.activeValuesIterator.filter(_ > tol).map(math.log).sum
val pinvS = diag(new BDV(d.map(v => if (v > tol) math.sqrt(1.0 / v) else 
0.0).toArray))

val (rootSigmaInv: BDM[Double], u: Double) = (pinvS * u2, -0.5 * (breezeMu.size 
* math.log(2.0 * math.Pi) + logPseudoDetSigma))

val x = new BDV(Array(629,640,1.7188,618.19))

val delta = x - breezeMu
val v = rootSigmaInv * delta
u + v.t * v * -0.5
{code}

The problem is the clever trick here to compute, well, delta' * inv(sigma) * 
delta by computing (inv(sigma) * delta)' * (inv(sigma) * delta). The square 
root bit loses too much precision in a case like this.

I think it's pretty easy to avoid entirely. There's no great reason not to 
return u and inv(sigma) directly and compute this in the straightforward way.

>  Multivariate Gaussian Model with Covariance  matrix return zero always 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11324) Flag to close Write Ahead Log after writing

2015-10-26 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-11324:
---

 Summary: Flag to close Write Ahead Log after writing
 Key: SPARK-11324
 URL: https://issues.apache.org/jira/browse/SPARK-11324
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Burak Yavuz


Currently the Write Ahead Log in Spark Streaming flushes data as writes need to 
be made. S3 does not support flushing of data, data is written once the stream 
is actually closed. 

In case of failure, the data for the last minute (default rolling interval) 
will not be properly written. Therefore we need a flag to close the stream 
after the write, so that we achieve read after write consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11324) Flag to close Write Ahead Log after writing

2015-10-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974993#comment-14974993
 ] 

Apache Spark commented on SPARK-11324:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/9285

> Flag to close Write Ahead Log after writing
> ---
>
> Key: SPARK-11324
> URL: https://issues.apache.org/jira/browse/SPARK-11324
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> Currently the Write Ahead Log in Spark Streaming flushes data as writes need 
> to be made. S3 does not support flushing of data, data is written once the 
> stream is actually closed. 
> In case of failure, the data for the last minute (default rolling interval) 
> will not be properly written. Therefore we need a flag to close the stream 
> after the write, so that we achieve read after write consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11324) Flag to close Write Ahead Log after writing

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11324:


Assignee: Apache Spark

> Flag to close Write Ahead Log after writing
> ---
>
> Key: SPARK-11324
> URL: https://issues.apache.org/jira/browse/SPARK-11324
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Currently the Write Ahead Log in Spark Streaming flushes data as writes need 
> to be made. S3 does not support flushing of data, data is written once the 
> stream is actually closed. 
> In case of failure, the data for the last minute (default rolling interval) 
> will not be properly written. Therefore we need a flag to close the stream 
> after the write, so that we achieve read after write consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11324) Flag to close Write Ahead Log after writing

2015-10-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11324:


Assignee: (was: Apache Spark)

> Flag to close Write Ahead Log after writing
> ---
>
> Key: SPARK-11324
> URL: https://issues.apache.org/jira/browse/SPARK-11324
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>
> Currently the Write Ahead Log in Spark Streaming flushes data as writes need 
> to be made. S3 does not support flushing of data, data is written once the 
> stream is actually closed. 
> In case of failure, the data for the last minute (default rolling interval) 
> will not be properly written. Therefore we need a flag to close the stream 
> after the write, so that we achieve read after write consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11316) isEmpty before coalesce seems to cause huge performance issue in setupGroups

2015-10-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-11316:
--
Description: 
So I haven't fully debugged this yet but reporting what I'm seeing and think 
might be going on.

I have a graph processing job that is seeing huge slow down in setupGroups in 
the location iterator where its getting the preferred locations for the 
coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ hours 
to do the calculation.  Killed it at this point so don't know total time.

It appears that the job is doing an isEmpty call, a bunch of other 
transformation, then a coalesce (where it takes so long), other 
transformations, then finally a count to trigger it.   

It appears that there is only one node that its finding in the setupGroup call 
and to get to that node it has to first to through the while loop:

while (numCreated < targetLen && tries < expectedCoupons2) {
where expectedCoupons2 is around 19000.  It finds very few or none in this 
loop.  

Then it does the second loop:

while (numCreated < targetLen) {  // if we don't have enough partition groups, 
create duplicates
  var (nxt_replica, nxt_part) = rotIt.next()
  val pgroup = PartitionGroup(nxt_replica)
  groupArr += pgroup
  groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
  var tries = 0
  while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
ensure at least one part
nxt_part = rotIt.next()._2
tries += 1
  }
  numCreated += 1
}

Where it has an inner while loop and both of those are going 1200 times.  
1200*1200 loops.  This is taking a very long time.

The user can work around the issue by adding in a count() call very close to 
after the isEmpty call before the coalesce is called.  I also tried putting in 
a take(1)  right before the isEmpty call and it seems to work around the 
issue, took 1 hours with the take vs a few minutes with the count().

  was:
So I haven't fully debugged this yet but reporting what I'm seeing and think 
might be going on.

I have a graph processing job that is seeing huge slow down in setupGroups in 
the location iterator where its getting the preferred locations for the 
coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ hours 
to do the calculation.  Killed it at this point so don't know total time.

It appears that the job is doing an isEmpty call, a bunch of other 
transformation, then a coalesce (where it takes so long), other 
transformations, then finally a count to trigger it.   

It appears that there is only one node that its finding in the setupGroup call 
and to get to that node it has to first to through the while loop:

while (numCreated < targetLen && tries < expectedCoupons2) {
where expectedCoupons2 is around 19000.  It finds very few or none in this 
loop.  

Then it does the second loop:

while (numCreated < targetLen) {  // if we don't have enough partition groups, 
create duplicates
  var (nxt_replica, nxt_part) = rotIt.next()
  val pgroup = PartitionGroup(nxt_replica)
  groupArr += pgroup
  groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
  var tries = 0
  while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
ensure at least one part
nxt_part = rotIt.next()._2
tries += 1
  }
  numCreated += 1
}

Where it has an inner while loop and both of those are going 1200 times.  
1200*1200 loops.  This is taking a very long time.

The user can work around the issue by adding in a count() call very close to 
after the isEmpty call before the coalesce is called.  I also tried putting in 
a take(1)  right before the isEmpty call and it seems to work around the 
issue.


> isEmpty before coalesce seems to cause huge performance issue in setupGroups
> 
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Critical
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its

[jira] [Created] (SPARK-11325) Alias alias in Scala's DataFrame to as to match python

2015-10-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-11325:


 Summary: Alias alias in Scala's DataFrame to as to match python
 Key: SPARK-11325
 URL: https://issues.apache.org/jira/browse/SPARK-11325
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11325) Alias alias in Scala's DataFrame to as to match python

2015-10-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975145#comment-14975145
 ] 

Yin Huai edited comment on SPARK-11325 at 10/26/15 9:38 PM:


[~nongli]


was (Author: yhuai):
@nongli

> Alias alias in Scala's DataFrame to as to match python
> --
>
> Key: SPARK-11325
> URL: https://issues.apache.org/jira/browse/SPARK-11325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11325) Alias alias in Scala's DataFrame to as to match python

2015-10-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975145#comment-14975145
 ] 

Yin Huai commented on SPARK-11325:
--

@nongli

> Alias alias in Scala's DataFrame to as to match python
> --
>
> Key: SPARK-11325
> URL: https://issues.apache.org/jira/browse/SPARK-11325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >