[jira] [Updated] (SPARK-3997) scalastyle does not output the error location

2014-10-17 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3997:
---
Description: 
{{./dev/scalastyle}} =>
{noformat}
Scalastyle checks failed at following occurrences:
java.lang.RuntimeException: exists error
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
[error] (mllib/*:scalastyle) exists error
{noformat}



  was:
{{./dev/scalastyle }} =>
{noformat}
Scalastyle checks failed at following occurrences:
java.lang.RuntimeException: exists error
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
[error] (mllib/*:scalastyle) exists error
{noformat}




> scalastyle does not output the error location
> -
>
> Key: SPARK-3997
> URL: https://issues.apache.org/jira/browse/SPARK-3997
> Project: Spark
>  Issue Type: Bug
>Reporter: Guoqiang Li
>
> {{./dev/scalastyle}} =>
> {noformat}
> Scalastyle checks failed at following occurrences:
> java.lang.RuntimeException: exists error
>   at scala.sys.package$.error(package.scala:27)
>   at scala.Predef$.error(Predef.scala:142)
> [error] (mllib/*:scalastyle) exists error
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3997) scalastyle does not output the error location

2014-10-17 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-3997:
--

 Summary: scalastyle does not output the error location
 Key: SPARK-3997
 URL: https://issues.apache.org/jira/browse/SPARK-3997
 Project: Spark
  Issue Type: Bug
Reporter: Guoqiang Li


{noformat}
Scalastyle checks failed at following occurrences:
java.lang.RuntimeException: exists error
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
[error] (mllib/*:scalastyle) exists error
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3997) scalastyle does not output the error location

2014-10-17 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3997:
---
Description: 
{{./dev/scalastyle }} =>
{noformat}
Scalastyle checks failed at following occurrences:
java.lang.RuntimeException: exists error
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
[error] (mllib/*:scalastyle) exists error
{noformat}



  was:
{noformat}
Scalastyle checks failed at following occurrences:
java.lang.RuntimeException: exists error
at scala.sys.package$.error(package.scala:27)
at scala.Predef$.error(Predef.scala:142)
[error] (mllib/*:scalastyle) exists error
{noformat}




> scalastyle does not output the error location
> -
>
> Key: SPARK-3997
> URL: https://issues.apache.org/jira/browse/SPARK-3997
> Project: Spark
>  Issue Type: Bug
>Reporter: Guoqiang Li
>
> {{./dev/scalastyle }} =>
> {noformat}
> Scalastyle checks failed at following occurrences:
> java.lang.RuntimeException: exists error
>   at scala.sys.package$.error(package.scala:27)
>   at scala.Predef$.error(Predef.scala:142)
> [error] (mllib/*:scalastyle) exists error
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization

2014-10-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175838#comment-14175838
 ] 

Reynold Xin commented on SPARK-3135:


Is the memory usage lower on the master branch vs 1.1? If yes, we can certainly 
backport. This is a medium size fix.

> Avoid memory copy in TorrentBroadcast serialization
> ---
>
> Key: SPARK-3135
> URL: https://issues.apache.org/jira/browse/SPARK-3135
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: starter
> Fix For: 1.2.0
>
>
> TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize 
> broadcast object into a single giant byte array, and then separates it into 
> smaller chunks.  We should implement a new OutputStream that writes 
> serialized bytes directly into chunks of byte arrays so we don't need the 
> extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers

2014-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175811#comment-14175811
 ] 

Apache Spark commented on SPARK-3822:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2840

> Expose a mechanism for SparkContext to ask for / remove Yarn containers
> ---
>
> Key: SPARK-3822
> URL: https://issues.apache.org/jira/browse/SPARK-3822
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is one of the core components for the umbrella issue SPARK-3174. 
> Currently, the only agent in Spark that communicates directly with the RM is 
> the AM. This means the only way for the SparkContext to ask for / remove 
> containers from the RM is through the AM. The communication link between the 
> SparkContext and the AM needs to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3822:
-
Affects Version/s: 1.1.0

> Expose a mechanism for SparkContext to ask for / remove Yarn containers
> ---
>
> Key: SPARK-3822
> URL: https://issues.apache.org/jira/browse/SPARK-3822
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> This is one of the core components for the umbrella issue SPARK-3174. 
> Currently, the only agent in Spark that communicates directly with the RM is 
> the AM. This means the only way for the SparkContext to ask for / remove 
> containers from the RM is through the AM. The communication link between the 
> SparkContext and the AM needs to be added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3888) Limit the memory used by python worker

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3888:
-
Affects Version/s: 1.1.0

> Limit the memory used by python worker
> --
>
> Key: SPARK-3888
> URL: https://issues.apache.org/jira/browse/SPARK-3888
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we did not limit the memory by Python workers, then it maybe run 
> out of memory and freeze the OS. it's safe to have a configurable hard 
> limitation for it, which should be large than spark.executor.python.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3935) Unused variable in PairRDDFunctions.scala

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-3935:
--
  Assignee: wangfei

> Unused variable in PairRDDFunctions.scala
> -
>
> Key: SPARK-3935
> URL: https://issues.apache.org/jira/browse/SPARK-3935
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
>Assignee: wangfei
>Priority: Minor
> Fix For: 1.2.0
>
>
> There is a unused variable (count) in saveAsHadoopDataset function in 
> PairRDDFunctions.scala. 
> It is better to add a log statement to record the line of output. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3935) Unused variable in PairRDDFunctions.scala

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3935.

Resolution: Fixed

> Unused variable in PairRDDFunctions.scala
> -
>
> Key: SPARK-3935
> URL: https://issues.apache.org/jira/browse/SPARK-3935
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
>Assignee: wangfei
>Priority: Minor
> Fix For: 1.2.0
>
>
> There is a unused variable (count) in saveAsHadoopDataset function in 
> PairRDDFunctions.scala. 
> It is better to add a log statement to record the line of output. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

2014-10-17 Thread Sung Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175752#comment-14175752
 ] 

Sung Chung commented on SPARK-1673:
---

This is eventually going to be taken care of by the OWLQN addition. 
https://issues.apache.org/jira/browse/SPARK-1892

> GLMNET implementation in Spark
> --
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training

2014-10-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3161:
-
Assignee: Sung Chung

> Cache example-node map for DecisionTree training
> 
>
> Key: SPARK-3161
> URL: https://issues.apache.org/jira/browse/SPARK-3161
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Sung Chung
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped 
> to a node in the current level (or to none if it does not reach that level).  
> This is currently done via the function predictNodeIndex(), which traces from 
> the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests | 
> https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
> array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
> Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
> node indices to an array of instances.  This has more overhead in data 
> structures but could be more efficient: not all nodes are split on each 
> iteration, and this would allow each executor to ignore instances which are 
> not used for the current node set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2014-10-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175750#comment-14175750
 ] 

Andrew Ash commented on SPARK-2243:
---

Found a couple potentially-related tickets; I'm going to link these as blocking 
this one:

SPARK-534
SPARK-3148

> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInp

[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175748#comment-14175748
 ] 

Sean Owen commented on SPARK-3967:
--

You guys should make PRs for these. I am also not sure if it's so necessary to 
download the file into a temp directory and move it... it may cause a copy 
instead of rename, and in fact does here, and so is not like the file appears 
in the target dir atomically anyway. I'm not sure the code here cleans up the 
partially downloaded file in case of error and that could leave a broken file 
in the target dir instead of just a temp dir.

The change to not copy the file when identical looks sound; I bet you can avoid 
checking if it exists twice.

> Spark applications fail in yarn-cluster mode when the directories configured 
> in yarn.nodemanager.local-dirs are located on different disks/partitions
> -
>
> Key: SPARK-3967
> URL: https://issues.apache.org/jira/browse/SPARK-3967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Christophe PRÉAUD
> Attachments: spark-1.1.0-utils-fetch.patch, 
> spark-1.1.0-yarn_cluster_tmpdir.patch
>
>
> Spark applications fail from time to time in yarn-cluster mode (but not in 
> yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
> set to a comma-separated list of directories which are located on different 
> disks/partitions.
> Steps to reproduce:
> 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
> directories located on different partitions (the more you set, the more 
> likely it will be to reproduce the bug):
> (...)
> 
>   yarn.nodemanager.local-dirs
>   
> file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir
> 
> (...)
> 2. Launch (several times) an application in yarn-cluster mode, it will fail 
> (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3996) Shade Jetty in Spark deliverables

2014-10-17 Thread Mingyu Kim (JIRA)
Mingyu Kim created SPARK-3996:
-

 Summary: Shade Jetty in Spark deliverables
 Key: SPARK-3996
 URL: https://issues.apache.org/jira/browse/SPARK-3996
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Mingyu Kim


We'd like to use Spark in a Jetty 9 server, and it's causing a version 
conflict. Given that Spark's dependency on Jetty is light, it'd be a good idea 
to shade this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-534) Make SparkContext thread-safe

2014-10-17 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-534:
-
Description: 
SparkEnv (used by SparkContext) is not thread-safe and it causes issues with 
scala's Futures and parrallel collections.
For example, this will not work:

{code}
val f = Futures.future({
  sc.textFile("hdfs://")
})
f.apply()
{code}

Workaround for now:

{code}
val f = Futures.future({
  SparkEnv.set(sc.env)
  sc.textFile("hdfs://")
})
f.apply()
{code}

  was:
SparkEnv (used by SparkContext) is not thread-safe and it causes issues with 
scala's Futures and parrallel collections.
For example, this will not work:

val f = Futures.future({
  sc.textFile("hdfs://")
})
f.apply()

Workaround for now:

val f = Futures.future({
  SparkEnv.set(sc.env)
  sc.textFile("hdfs://")
})
f.apply()


> Make SparkContext thread-safe
> -
>
> Key: SPARK-534
> URL: https://issues.apache.org/jira/browse/SPARK-534
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.5.0, 0.5.1, 0.6.0, 0.6.1, 0.7.0, 0.6.2, 0.5.2, 0.7.1, 
> 0.7.2, 0.7.3
>Reporter: tjhunter
>Priority: Blocker
>
> SparkEnv (used by SparkContext) is not thread-safe and it causes issues with 
> scala's Futures and parrallel collections.
> For example, this will not work:
> {code}
> val f = Futures.future({
>   sc.textFile("hdfs://")
> })
> f.apply()
> {code}
> Workaround for now:
> {code}
> val f = Futures.future({
>   SparkEnv.set(sc.env)
>   sc.textFile("hdfs://")
> })
> f.apply()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3164) Store DecisionTree Split.categories as Set

2014-10-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3164:
-
Description: 
Improvement: computation

For categorical features with many categories, it could be more efficient to 
store Split.categories as a Set, not a List.  (It is currently a List.)  A Set 
might be more scalable (for log n lookups), though tests would need to be done 
to ensure that Sets do not incur too much more overhead than Lists.


  was:
Improvement: computation

For categorical features with many categories, it could be more efficient to 
store Split.categories as a Set, not a List.  (It is currently a List.)  A Set 
might be more scalable (for log(n) lookups), though tests would need to be done 
to ensure that Sets do not incur too much more overhead than Lists.



> Store DecisionTree Split.categories as Set
> --
>
> Key: SPARK-3164
> URL: https://issues.apache.org/jira/browse/SPARK-3164
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> Improvement: computation
> For categorical features with many categories, it could be more efficient to 
> store Split.categories as a Set, not a List.  (It is currently a List.)  A 
> Set might be more scalable (for log n lookups), though tests would need to be 
> done to ensure that Sets do not incur too much more overhead than Lists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-10-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175671#comment-14175671
 ] 

Xiangrui Meng commented on SPARK-3990:
--

In PySpark 1.1, we switched to Kryo serialization by default. However, ALS code 
requires special registration with Kryo in order to work. The error happens 
when there is not enough memory and ALS needs to store ratings or in/out blocks 
to disk. I will work on this issue. For now, the workaround is to use a cluster 
with enough memory.

> kryo.KryoException caused by ALS.trainImplicit in pyspark
> -
>
> Key: SPARK-3990
> URL: https://issues.apache.org/jira/browse/SPARK-3990
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0
> Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2
> Linux
> Python 2.6.8
>Reporter: Gen TANG
>  Labels: test
>
> When we tried ALS.trainImplicit() in pyspark environment, it only works for 
> iterations = 1. What is more strange, it is that if we try the same code in 
> Scala, it works very well.(I did several test, by now, in Scala 
> ALS.trainImplicit works)
> For example, the following code:
> {code:title=test.py|borderStyle=solid}
>   r1 = (1, 1, 1.0) 
>   r2 = (1, 2, 2.0) 
>   r3 = (2, 1, 2.0) 
>   ratings = sc.parallelize([r1, r2, r3]) 
>   model = ALS.trainImplicit(ratings, 1) 
> '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)'''
> {code}
> It will cause the failed stage at count at ALS.scala:314 Info as:
> {code:title=error information provided by ganglia}
> Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
> ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
> java.lang.ArrayStoreException: scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecut

[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-10-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3990:
-
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

> kryo.KryoException caused by ALS.trainImplicit in pyspark
> -
>
> Key: SPARK-3990
> URL: https://issues.apache.org/jira/browse/SPARK-3990
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0
> Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2
> Linux
> Python 2.6.8
>Reporter: Gen TANG
>  Labels: test
>
> When we tried ALS.trainImplicit() in pyspark environment, it only works for 
> iterations = 1. What is more strange, it is that if we try the same code in 
> Scala, it works very well.(I did several test, by now, in Scala 
> ALS.trainImplicit works)
> For example, the following code:
> {code:title=test.py|borderStyle=solid}
>   r1 = (1, 1, 1.0) 
>   r2 = (1, 2, 2.0) 
>   r3 = (2, 1, 2.0) 
>   ratings = sc.parallelize([r1, r2, r3]) 
>   model = ALS.trainImplicit(ratings, 1) 
> '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)'''
> {code}
> It will cause the failed stage at count at ALS.scala:314 Info as:
> {code:title=error information provided by ganglia}
> Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
> ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
> java.lang.ArrayStoreException: scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> {code}
> In the log of slave which failed the task, it has:
> {code:title=error information in the log of slave}
> 14/10/17 13:20:54 ERROR exe

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark (unless the seed is set 
manually).

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK] Py

[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-17 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-3967:
-
Attachment: spark-1.1.0-utils-fetch.patch

Don't redundantly copy executor dependency files in {{Utils.fetchFile}}.

> Spark applications fail in yarn-cluster mode when the directories configured 
> in yarn.nodemanager.local-dirs are located on different disks/partitions
> -
>
> Key: SPARK-3967
> URL: https://issues.apache.org/jira/browse/SPARK-3967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Christophe PRÉAUD
> Attachments: spark-1.1.0-utils-fetch.patch, 
> spark-1.1.0-yarn_cluster_tmpdir.patch
>
>
> Spark applications fail from time to time in yarn-cluster mode (but not in 
> yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
> set to a comma-separated list of directories which are located on different 
> disks/partitions.
> Steps to reproduce:
> 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
> directories located on different partitions (the more you set, the more 
> likely it will be to reproduce the bug):
> (...)
> 
>   yarn.nodemanager.local-dirs
>   
> file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir
> 
> (...)
> 2. Launch (several times) an application in yarn-cluster mode, it will fail 
> (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-17 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175647#comment-14175647
 ] 

Ryan Williams commented on SPARK-3967:
--

I've been debugging this issue as well and I think I've found an issue in 
{{org.apache.spark.util.Utils}} that is contributing to / causing the problem:

{{Files.move}} on [line 
390|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L390]
 is called even if {{targetFile}} exists and {{tempFile}} and {{targetFile}} 
are equal.

The check on [line 
379|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L379]
 seems to imply the desire to skip a redundant overwrite if the file is already 
there and has the contents that it should have.

Gating the {{Files.move}} call on a further {{if (!targetFile.exists)}} fixes 
the issue for me; attached is a patch of the change.

In practice all of my executors that hit this code path are finding every 
dependency JAR to already exist and be exactly equal to what they need it to 
be, meaning they were all needlessly overwriting all of their dependency JARs, 
and now are all basically no-op-ing in {{Utils.fetchFile}}; I've not determined 
who/what is putting the JARs there, why the issue only crops up in 
{{yarn-cluster}} mode (or {{--master yarn --deploy-mode cluster}}), etc., but 
it seems like either way this patch is probably desirable.


> Spark applications fail in yarn-cluster mode when the directories configured 
> in yarn.nodemanager.local-dirs are located on different disks/partitions
> -
>
> Key: SPARK-3967
> URL: https://issues.apache.org/jira/browse/SPARK-3967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Christophe PRÉAUD
> Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch
>
>
> Spark applications fail from time to time in yarn-cluster mode (but not in 
> yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
> set to a comma-separated list of directories which are located on different 
> disks/partitions.
> Steps to reproduce:
> 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
> directories located on different partitions (the more you set, the more 
> likely it will be to reproduce the bug):
> (...)
> 
>   yarn.nodemanager.local-dirs
>   
> file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir
> 
> (...)
> 2. Launch (several times) an application in yarn-cluster mode, it will fail 
> (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3994) countByKey / countByValue do not go through Aggregator

2014-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175635#comment-14175635
 ] 

Apache Spark commented on SPARK-3994:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/2839

> countByKey / countByValue do not go through Aggregator
> --
>
> Key: SPARK-3994
> URL: https://issues.apache.org/jira/browse/SPARK-3994
> Project: Spark
>  Issue Type: Bug
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>
> The implementations of these methods are historical remnants of Spark from a 
> time when the shuffle may have been nonexistent. Now, they can be simplified 
> by plugging into reduceByKey(), potentially seeing performance and stability 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK] PySpark's sampl

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSP

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK]

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPAR

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Summary: [PYSPARK] PySpark's sample methods do not work with NumPy 1.9  
(was: pyspark's sample methods do not work with NumPy 1.9)

> [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
> -
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Jeremy Freeman
>Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
> for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
> for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
> if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
> self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:
> {code}
> self._seed = seed if seed is not None else random.randint(0, sys.maxint)
> {code}
> In previous versions of NumPy a random seed larger than 2 ** 32 would 
> silently get truncated to 2 ** 32. This was fixed in a recent patch 
> (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
>  But it means that PySpark’s code now causes an error. That line often yields 
> ints larger than 2 ** 32, which will reliably break any sampling operation in 
> PySpark.
> I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very sim

[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very simple!).




> pyspark's sample methods do not work with NumPy 1.9
> ---
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Jeremy Freeman
>Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
> for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
> for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
> if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
> self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In previous versions of NumPy a random seed larger than 32 would silently get 
> truncated to 2 ** 32

[jira] [Resolved] (SPARK-3934) RandomForest bug in sanity check in DTStatsAggregator

2014-10-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3934.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2785
[https://github.com/apache/spark/pull/2785]

> RandomForest bug in sanity check in DTStatsAggregator
> -
>
> Key: SPARK-3934
> URL: https://issues.apache.org/jira/browse/SPARK-3934
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.2.0
>
>
> When run with a mix of unordered categorical and continuous features, on 
> multiclass classification, RandomForest fails.  The bug is in the sanity 
> checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the 
> wrong indices for checking whether features are unordered.
> Proposal: Remove the sanity checks since they are not really needed, and 
> since they would require DTStatsAggregator to keep track of an extra set of 
> indices (for the feature subset).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-3995:
-

 Summary: pyspark's sample methods do not work with NumPy 1.9
 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2546) Configuration object thread safety issue

2014-10-17 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175608#comment-14175608
 ] 

Andrew Ash commented on SPARK-2546:
---

We tested Josh's patch, confirming the fix and measuring the perf regression at 
~8%

> Configuration object thread safety issue
> 
>
> Key: SPARK-2546
> URL: https://issues.apache.org/jira/browse/SPARK-2546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
>
> // observed in 0.9.1 but expected to exist in 1.0.1 as well
> This ticket is copy-pasted from a thread on the dev@ list:
> {quote}
> We discovered a very interesting bug in Spark at work last week in Spark 
> 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
> thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
> Let me explain:
> Observations
>  - Was running a relatively simple job (read from Avro files, do a map, do 
> another map, write back to Avro files)
>  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
>  - The 412 successful tasks completed in median time 3.4s
>  - The last hung task didn't finish even in 20 hours
>  - The executor with the hung task was responsible for 100% of one core of 
> CPU usage
>  - Jstack of the executor attached (relevant thread pasted below)
> Diagnosis
> After doing some code spelunking, we determined the issue was concurrent use 
> of a Configuration object for each task on an executor.  In Hadoop each task 
> runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
> the single-threaded access assumptions of the Configuration object no longer 
> hold in Spark.
> The specific issue is that the AvroRecordReader actually _modifies_ the 
> JobConf it's given when it's instantiated!  It adds a key for the RPC 
> protocol engine in the process of connecting to the Hadoop FileSystem.  When 
> many tasks start at the same time (like at the start of a job), many tasks 
> are adding this configuration item to the one Configuration object at once.  
> Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
> The below post is an excellent explanation of what happens in the situation 
> where multiple threads insert into a HashMap at the same time.
> http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
> The gist is that you have a thread following a cycle of linked list nodes 
> indefinitely.  This exactly matches our observations of the 100% CPU core and 
> also the final location in the stack trace.
> So it seems the way Spark shares a Configuration object between task threads 
> in an executor is incorrect.  We need some way to prevent concurrent access 
> to a single Configuration object.
> Proposed fix
> We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
> its own JobConf object (and thus Configuration object).  The optimization of 
> broadcasting the Configuration object across the cluster can remain, but on 
> the other side I think it needs to be cloned for each task to allow for 
> concurrent access.  I'm not sure the performance implications, but the 
> comments suggest that the Configuration object is ~10KB so I would expect a 
> clone on the object to be relatively speedy.
> Has this been observed before?  Does my suggested fix make sense?  I'd be 
> happy to file a Jira ticket and continue discussion there for the right way 
> to fix.
> Thanks!
> Andrew
> P.S.  For others seeing this issue, our temporary workaround is to enable 
> spark.speculation, which retries failed (or hung) tasks on other machines.
> {noformat}
> "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000 
> nid=0x54b1 runnable [0x7f92d74f1000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.transfer(HashMap.java:601)
> at java.util.HashMap.resize(HashMap.java:581)
> at java.util.HashMap.addEntry(HashMap.java:879)
> at java.util.HashMap.put(HashMap.java:505)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
> at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
> at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:436)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:403)

[jira] [Created] (SPARK-3994) countByKey / countByValue do not go through Aggregator

2014-10-17 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3994:
-

 Summary: countByKey / countByValue do not go through Aggregator
 Key: SPARK-3994
 URL: https://issues.apache.org/jira/browse/SPARK-3994
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson


The implementations of these methods are historical remnants of Spark from a 
time when the shuffle may have been nonexistent. Now, they can be simplified by 
plugging into reduceByKey(), potentially seeing performance and stability 
improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3985) json file path is not right

2014-10-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3985.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2834
[https://github.com/apache/spark/pull/2834]

> json file path is not right
> ---
>
> Key: SPARK-3985
> URL: https://issues.apache.org/jira/browse/SPARK-3985
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.2.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.2.0
>
>
> in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." 
> together instead of using "os.path.join", would cause a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3994) countByKey / countByValue do not go through Aggregator

2014-10-17 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson reassigned SPARK-3994:
-

Assignee: Aaron Davidson

> countByKey / countByValue do not go through Aggregator
> --
>
> Key: SPARK-3994
> URL: https://issues.apache.org/jira/browse/SPARK-3994
> Project: Spark
>  Issue Type: Bug
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>
> The implementations of these methods are historical remnants of Spark from a 
> time when the shuffle may have been nonexistent. Now, they can be simplified 
> by plugging into reduceByKey(), potentially seeing performance and stability 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3985) json file path is not right

2014-10-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3985:
--
Affects Version/s: 1.2.0

> json file path is not right
> ---
>
> Key: SPARK-3985
> URL: https://issues.apache.org/jira/browse/SPARK-3985
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.2.0
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.2.0
>
>
> in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." 
> together instead of using "os.path.join", would cause a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-17 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175557#comment-14175557
 ] 

Reza Zadeh commented on SPARK-3434:
---

Thanks Shivaram! As discussed over the phone, we will use your design and build 
upon it, so that you can focus on the linear algebraic operations such as TSQR.

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Shivaram Venkataraman
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3855) Binding Exception when running PythonUDFs

2014-10-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3855.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Binding Exception when running PythonUDFs
> -
>
> Key: SPARK-3855
> URL: https://issues.apache.org/jira/browse/SPARK-3855
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.2.0
>
>
> {code}
> from pyspark import *
> from pyspark.sql import *
> sc = SparkContext()
> sqlContext = SQLContext(sc)
> sqlContext.registerFunction("strlen", lambda string: len(string))
> sqlContext.inferSchema(sc.parallelize([Row(a="test")])).registerTempTable("test")
> srdd = sqlContext.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 1")
> print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString()
> print srdd.collect()
> {code}
> output:
> {code}
> == Parsed Logical Plan ==
> Project ['strlen('a) AS c0#1]
>  Filter ('strlen('a) > 1)
>   UnresolvedRelation None, test, None
> == Analyzed Logical Plan ==
> Project [c0#1]
>  Project [pythonUDF#2 AS c0#1]
>   EvaluatePython PythonUDF#strlen(a#0)
>Project [a#0]
> Filter (CAST(pythonUDF#3, DoubleType) > CAST(1, DoubleType))
>  EvaluatePython PythonUDF#strlen(a#0)
>   SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
> mapPartitions at SQLContext.scala:525)
> == Optimized Logical Plan ==
> Project [pythonUDF#2 AS c0#1]
>  EvaluatePython PythonUDF#strlen(a#0)
>   Project [a#0]
>Filter (CAST(pythonUDF#3, DoubleType) > 1.0)
> EvaluatePython PythonUDF#strlen(a#0)
>  SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
> mapPartitions at SQLContext.scala:525)
> == Physical Plan ==
> Project [pythonUDF#2 AS c0#1]
>  BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5]
>   Project [a#0]
>Filter (CAST(pythonUDF#3, DoubleType) > 1.0)
> BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3]
>  ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at 
> SQLContext.scala:525
> Code Generation: false
> == RDD ==
> 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF#2
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLi

[jira] [Created] (SPARK-3993) python worker may hang after reused from take()

2014-10-17 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3993:
-

 Summary: python worker may hang after reused from take()
 Key: SPARK-3993
 URL: https://issues.apache.org/jira/browse/SPARK-3993
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Priority: Blocker


After take(), maybe there are some garbage left in the socket, then next task 
assigned to this worker will hang because of corrupted data.

We should make sure the socket is clean before reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3993) python worker may hang after reused from take()

2014-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175536#comment-14175536
 ] 

Apache Spark commented on SPARK-3993:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2838

> python worker may hang after reused from take()
> ---
>
> Key: SPARK-3993
> URL: https://issues.apache.org/jira/browse/SPARK-3993
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> After take(), maybe there are some garbage left in the socket, then next task 
> assigned to this worker will hang because of corrupted data.
> We should make sure the socket is clean before reuse it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175377#comment-14175377
 ] 

Sean Owen commented on SPARK-2426:
--

Regarding licensing, if the code is BSD licensed then it does not require an 
entry in NOTICE file (it's a "Category A" license), and entries shouldn't be 
added to NOTICE unless required. I believe that in this case we will need to 
reproduce the text of the license in LICENSE since it will not be included 
otherwise from a Maven artifact. So I suggest: don't change NOTICE, and move 
the license in LICENSE up to the section where other licenses are reproduced in 
full. It's a complex issue but this is my best understanding of the right thing 
to do.


> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one

2014-10-17 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3979.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Yarn backend's default file replication should match HDFS's default one
> ---
>
> Key: SPARK-3979
> URL: https://issues.apache.org/jira/browse/SPARK-3979
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.2.0
>
>
> This code in ClientBase.scala sets the replication used for files uploaded to 
> HDFS:
> {code}
> val replication = sparkConf.getInt("spark.yarn.submit.file.replication", 
> 3).toShort
> {code}
> Instead of a hardcoded "3" (which is the default value for HDFS), it should 
> be using the default value from the HDFS conf ("dfs.replication").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3992) Spark 1.1.0 python binding cannot use any collections but list as Accumulators

2014-10-17 Thread Walter Bogorad (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Bogorad updated SPARK-3992:
--
Description: 
A dictionary accumulator defined as a global variable is not visible inside a 
function called by "foreach()".
Here is the minimal code snippet:

from collections import defaultdict
from pyspark import SparkContext
from pyspark.accumulators import AccumulatorParam

class DictAccumParam(AccumulatorParam):
def zero(self, value):
value.clear()
def addInPlace(self, val1, val2):
return val1

sc = SparkContext("local", "Dict Accumulator Bug")
va = sc.accumulator(defaultdict(int), DictAccumParam())

def foo(x):
global va
print "va is:", va

rdd = sc.parallelize([1,2,3]).foreach(foo)

When ran the code snippet produced the following results:
...
va is: None
va is: None
va is: None
...
I have verified that the global variables are visible inside foo() called by 
foreach only if they are for scalars or lists like in the API doc  at 
http://spark.apache.org/docs/latest/api/python/
The problem exists with standard dictionaries and collections.

I also verified that if foo() is called directly, i.e. outside foreach, then 
the global variables are visible OK.

  was:
A dictionary accumulator defined as a global variable is not visible inside a 
function called by "foreach()".
Here is the minimal code snippet:

from collections import defaultdict
from pyspark import SparkContext
from pyspark.accumulators import AccumulatorParam

class DictAccumParam(AccumulatorParam):
def zero(self, value):
value.clear()
def addInPlace(self, val1, val2):
return val1

sc = SparkContext("local", "Dict Accumulator Bug")
va = sc.accumulator(defaultdict(int), DictAccumParam())

def foo(x):
global va
print "va is:", va

rdd = sc.parallelize([1,2,3]).foreach(foo)

When ran the code snippet produced the following results:
...
va is: None
va is: None
va is: None
...
I have verified that the global variables are visible inside foo() called by 
foreach only if they are for scalars or lists like in the API doc  at 
http://spark.apache.org/docs/latest/api/python/
The problem exists with standard dictionaries and collections.


> Spark 1.1.0 python binding cannot use any collections but list as Accumulators
> --
>
> Key: SPARK-3992
> URL: https://issues.apache.org/jira/browse/SPARK-3992
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Walter Bogorad
>
> A dictionary accumulator defined as a global variable is not visible inside a 
> function called by "foreach()".
> Here is the minimal code snippet:
> from collections import defaultdict
> from pyspark import SparkContext
> from pyspark.accumulators import AccumulatorParam
> class DictAccumParam(AccumulatorParam):
> def zero(self, value):
> value.clear()
> def addInPlace(self, val1, val2):
> return val1
> sc = SparkContext("local", "Dict Accumulator Bug")
> va = sc.accumulator(defaultdict(int), DictAccumParam())
> def foo(x):
> global va
> print "va is:", va
> rdd = sc.parallelize([1,2,3]).foreach(foo)
> When ran the code snippet produced the following results:
> ...
> va is: None
> va is: None
> va is: None
> ...
> I have verified that the global variables are visible inside foo() called by 
> foreach only if they are for scalars or lists like in the API doc  at 
> http://spark.apache.org/docs/latest/api/python/
> The problem exists with standard dictionaries and collections.
> I also verified that if foo() is called directly, i.e. outside foreach, then 
> the global variables are visible OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3992) Spark 1.1.0 python binding cannot use any collections but list as Accumulators

2014-10-17 Thread Walter Bogorad (JIRA)
Walter Bogorad created SPARK-3992:
-

 Summary: Spark 1.1.0 python binding cannot use any collections but 
list as Accumulators
 Key: SPARK-3992
 URL: https://issues.apache.org/jira/browse/SPARK-3992
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 
2014 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Walter Bogorad


A dictionary accumulator defined as a global variable is not visible inside a 
function called by "foreach()".
Here is the minimal code snippet:

from collections import defaultdict
from pyspark import SparkContext
from pyspark.accumulators import AccumulatorParam

class DictAccumParam(AccumulatorParam):
def zero(self, value):
value.clear()
def addInPlace(self, val1, val2):
return val1

sc = SparkContext("local", "Dict Accumulator Bug")
va = sc.accumulator(defaultdict(int), DictAccumParam())

def foo(x):
global va
print "va is:", va

rdd = sc.parallelize([1,2,3]).foreach(foo)

When ran the code snippet produced the following results:
...
va is: None
va is: None
va is: None
...
I have verified that the global variables are visible inside foo() called by 
foreach only if they are for scalars or lists like in the API doc  at 
http://spark.apache.org/docs/latest/api/python/
The problem exists with standard dictionaries and collections.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3935) Unused variable in PairRDDFunctions.scala

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3935.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Unused variable in PairRDDFunctions.scala
> -
>
> Key: SPARK-3935
> URL: https://issues.apache.org/jira/browse/SPARK-3935
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: wangfei
>Priority: Minor
> Fix For: 1.2.0
>
>
> There is a unused variable (count) in saveAsHadoopDataset function in 
> PairRDDFunctions.scala. 
> It is better to add a log statement to record the line of output. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3817) BlockManagerMasterActor: Got two different block manager registrations with Mesos

2014-10-17 Thread Brenden Matthews (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175279#comment-14175279
 ] 

Brenden Matthews commented on SPARK-3817:
-

Appears to be a duplicate of SPARK-2445 and SPARK-3535.

> BlockManagerMasterActor: Got two different block manager registrations with 
> Mesos
> -
>
> Key: SPARK-3817
> URL: https://issues.apache.org/jira/browse/SPARK-3817
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Timothy Chen
>
> 14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different block
> manager registrations on 20140711-081617-711206558-5050-2543-5
> Here is the log from the mesos-slave where this container was running.
> http://pastebin.com/Q1Cuzm6Q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175167#comment-14175167
 ] 

Debasish Das commented on SPARK-2426:
-

1. [~mengxr] Our legal was clear that Stanford and Verizon copyright should 
show up on the COPYRIGHT.txt file...I saw other company's copyrights and I did 
not think it will be a big issue...

2. For the new interface, we have two more requirements: Convex loss function 
(supporting huber loss / hinge loss etc) and no explicit AtA construction since 
once we start scaling to 1 factors for LSA then AtA construction will start 
to choke...Can I work on your branch ? 
https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

3. I agree to refactor the core solver including NNLS to breeze. That was the 
initial plan but since we wanted to test out the features in our internal 
datasets, integrating with mllib was faster. I am testing NNLS's CG 
implementation since as soon as explicit AtA construction is taken out, we need 
to rely on CG in-place of direct solvers...But I will refactor the solver out 
to breeze and that will take the copyright msgs to breeze as well.

4. Let me add the Matlab scripts and point to the repository. ECOS and MOSEK 
will need Matlab to run. PDCO and Proximal variants will run fine on Octave. I 
am not sure if MOSEK is supported on Octave.

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode

2014-10-17 Thread eblaas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

eblaas updated SPARK-3991:
--
Attachment: not_serializable_exception.patch

> Not Serializable , Nullpinter Exceptions in SQL server mode
> ---
>
> Key: SPARK-3991
> URL: https://issues.apache.org/jira/browse/SPARK-3991
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: eblaas
>Priority: Blocker
> Attachments: not_serializable_exception.patch
>
>
> I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it 
> works but there are some bugs to fix.
> I customized the HiveThriftServer2 class to load, transform and register 
> tables (ETL) with the HiveContext. Data tables are generated from Cassandra 
> and from a relational database.
> * 1 st problem : 
> hiveContext.registerRDDAsTable(treeSchema,"tree") , does not register the 
> table in hive metastore ("show tables;" via JDBC does not list the table, but 
> I can query it e.g. select * from tree) dirty workaround create a table with 
> same name and schema, this was necessary because mondrian validates table 
> existence 
> hiveContext.sql("CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3 
> STRING)")
> * 2 nd problem :
> mondrian creates complex joins, witch results in Serialization Exceptions
> 2 classes in hibeUdfs.scala have to be serializable
> - DeferredObjectAdapter and HiveGenericUdaf
> * 3 td  problem
> Nullpointer Exception in InMemoryRelation
> 42: override lazy val statistics =  Statistics(sizeInBytes = 
> child.sqlContext.defaultSizeInBytes)
> the sqlContext in child was null, quick fix set default value from 
> SparkContext
> override lazy val statistics = Statistics(sizeInBytes = 1)
> I'm not sure how to fix this bugs but with the patch file it works at least. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode

2014-10-17 Thread eblaas (JIRA)
eblaas created SPARK-3991:
-

 Summary: Not Serializable , Nullpinter Exceptions in SQL server 
mode
 Key: SPARK-3991
 URL: https://issues.apache.org/jira/browse/SPARK-3991
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: eblaas
Priority: Blocker


I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it works 
but there are some bugs to fix.

I customized the HiveThriftServer2 class to load, transform and register tables 
(ETL) with the HiveContext. Data tables are generated from Cassandra and from a 
relational database.

* 1 st problem : 
hiveContext.registerRDDAsTable(treeSchema,"tree") , does not register the table 
in hive metastore ("show tables;" via JDBC does not list the table, but I can 
query it e.g. select * from tree) dirty workaround create a table with same 
name and schema, this was necessary because mondrian validates table existence 

hiveContext.sql("CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3 
STRING)")

* 2 nd problem :
mondrian creates complex joins, witch results in Serialization Exceptions
2 classes in hibeUdfs.scala have to be serializable
- DeferredObjectAdapter and HiveGenericUdaf

* 3 td  problem
Nullpointer Exception in InMemoryRelation
42: override lazy val statistics =  Statistics(sizeInBytes = 
child.sqlContext.defaultSizeInBytes)

the sqlContext in child was null, quick fix set default value from SparkContext
override lazy val statistics = Statistics(sizeInBytes = 1)

I'm not sure how to fix this bugs but with the patch file it works at least. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-10-17 Thread Gen TANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gen TANG updated SPARK-3990:


I think this problem is related to 
https://issues.apache.org/jira/browse/SPARK-1977

> kryo.KryoException caused by ALS.trainImplicit in pyspark
> -
>
> Key: SPARK-3990
> URL: https://issues.apache.org/jira/browse/SPARK-3990
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0
> Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2
> Linux
> Python 2.6.8
>Reporter: Gen TANG
>  Labels: test
>
> When we tried ALS.trainImplicit() in pyspark environment, it only works for 
> iterations = 1. What is more strange, it is that if we try the same code in 
> Scala, it works very well.(I did several test, by now, in Scala 
> ALS.trainImplicit works)
> For example, the following code:
> {code:title=test.py|borderStyle=solid}
>   r1 = (1, 1, 1.0) 
>   r2 = (1, 2, 2.0) 
>   r3 = (2, 1, 2.0) 
>   ratings = sc.parallelize([r1, r2, r3]) 
>   model = ALS.trainImplicit(ratings, 1) 
> '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)'''
> {code}
> It will cause the failed stage at count at ALS.scala:314 Info as:
> {code:title=error information provided by ganglia}
> Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
> ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
> java.lang.ArrayStoreException: scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> {code}
> In the log of slave which failed the task, it has:
> {code:title=error information in the log of slave}
> 

[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-10-17 Thread Gen TANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gen TANG updated SPARK-3990:

Description: 
When we tried ALS.trainImplicit() in pyspark environment, it only works for 
iterations = 1. What is more strange, it is that if we try the same code in 
Scala, it works very well.(I did several test, by now, in Scala 
ALS.trainImplicit works)

For example, the following code:
{code:title=test.py|borderStyle=solid}
  r1 = (1, 1, 1.0) 
  r2 = (1, 2, 2.0) 
  r3 = (2, 1, 2.0) 
  ratings = sc.parallelize([r1, r2, r3]) 
  model = ALS.trainImplicit(ratings, 1) 
'''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)'''
{code}


It will cause the failed stage at count at ALS.scala:314 Info as:
{code:title=error information provided by ganglia}
Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
java.lang.ArrayStoreException: scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
{code}
In the log of slave which failed the task, it has:

{code:title=error information in the log of slave}
14/10/17 13:20:54 ERROR executor.Executor: Exception in task 6.0 in stage 90.0 
(TID 465)
com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
at com.

[jira] [Created] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark

2014-10-17 Thread Gen TANG (JIRA)
Gen TANG created SPARK-3990:
---

 Summary: kryo.KryoException caused by ALS.trainImplicit in pyspark
 Key: SPARK-3990
 URL: https://issues.apache.org/jira/browse/SPARK-3990
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.1.0
 Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2
Linux
Python 2.6.8
Reporter: Gen TANG


When we tried ALS.trainImplicit() in pyspark environment, it only works for 
iterations = 1. For example, the following code:

r1 = (1, 1, 1.0) 
r2 = (1, 2, 2.0) 
r3 = (2, 1, 2.0) 
ratings = sc.parallelize([r1, r2, r3]) 
model = ALS.trainImplicit(ratings, 1) [by default iterations = 5] or model = 
ALS.trainImplicit(ratings, 1, 2)

It will cause the failed stage at count at ALS.scala:314 Info as:

Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 90.0 (TID 484, 
ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: 
java.lang.ArrayStoreException: scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)

com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)

com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

In the log of slave which failed the task, it has:
14/10/17 13:20:54 ERROR executor.Executor: Exception in task 6.0 in stage 90.0 
(TID 465)
com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
scala.collection.mutable.HashSet
Serialization trace:
shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
at com.esotericsoftware.kryo.Kryo.re

[jira] [Commented] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge

2014-10-17 Thread Fred Cons (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175079#comment-14175079
 ] 

Fred Cons commented on SPARK-2649:
--

[~npanj] [~rxin] I ran into this issue, and the following diff worked for me : 
https://github.com/mesos/spark-ec2/pull/76


> EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
> ---
>
> Key: SPARK-2649
> URL: https://issues.apache.org/jira/browse/SPARK-2649
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.1.0
>Reporter: npanj
>Priority: Minor
>
> On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
> machines like r3.4xlarge( deployed by spark-ec2 script,), 
> Here is an example error (it seems to be an issue with default ami 
> "spark.ami.hvm.v14 (ami-35b1885c)" ). Here is error message:
> --
> Starting httpd: httpd: Syntax error on line 153 of 
> /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into 
> server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object 
> file: No such file or directory
> --



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-576) Design and develop a more precise progress estimator

2014-10-17 Thread Dev Lakhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175023#comment-14175023
 ] 

Dev Lakhani commented on SPARK-576:
---

I've created a PR for this: https://github.com/apache/spark/pull/2837/

> Design and develop a more precise progress estimator
> 
>
> Key: SPARK-576
> URL: https://issues.apache.org/jira/browse/SPARK-576
> Project: Spark
>  Issue Type: Improvement
>Reporter: Mosharaf Chowdhury
>
> In addition to /, we need to have something that 
> says .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2014-10-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174999#comment-14174999
 ] 

Thomas Graves commented on SPARK-3877:
--

[~vanzin]  I agree. The user code should be exiting with non-zero or throwing 
on failure.  If they aren't then there is nothing we can do about it, other 
then tell them to change their code to properly exit if they want to see 
failure status. Perhaps we should better document what they should do on 
failure too.   Its basically the same I did for the exit codes in 
ApplicationMaster. It relies on user code exiting non-zero and throwing.

The only other option would be for us to actually look at the details in the 
scheduler ourselves to try to determine what happened.  ie we see Stage X 
failed or Y tasks failed, etc.  I would say we do that later if its needed. 



> The exit code of spark-submit is still 0 when an yarn application fails
> ---
>
> Key: SPARK-3877
> URL: https://issues.apache.org/jira/browse/SPARK-3877
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: yarn
>
> When an yarn application fails (yarn-cluster mode), the exit code of 
> spark-submit is still 0. It's hard for people to write some automatic scripts 
> to run spark jobs in yarn because the failure can not be detected in these 
> scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

2014-10-17 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174981#comment-14174981
 ] 

Ilya Ganelin commented on SPARK-3694:
-

Hello. I would like to work on this. Can you please assign it to me? Thank you. 

> Allow printing object graph of tasks/RDD's with a debug flag
> 
>
> Key: SPARK-3694
> URL: https://issues.apache.org/jira/browse/SPARK-3694
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>  Labels: starter
>
> This would be useful for debugging extra references inside of RDD's
> Here is an example for inspiration:
> http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
> We'd want to print this trace for both the RDD serialization inside of the 
> DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql

2014-10-17 Thread Yash Datta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-3968:
--
Description: 
The parquet-mr project has introduced a new filter api , along with several 
fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself (will create a separate ticket for that) .

  was:
The parquet-mr project has introduced a new filter api , along with several 
fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a 
custom filter for the optimized In clause (InSet) , so that elimination happens 
in the ParquetRecordReader itself.

Summary: Use parquet-mr filter2 api in spark sql  (was: Using 
parquet-mr filter2 api in spark sql, add a custom filter for InSet clause)

> Use parquet-mr filter2 api in spark sql
> ---
>
> Key: SPARK-3968
> URL: https://issues.apache.org/jira/browse/SPARK-3968
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yash Datta
>Priority: Minor
> Fix For: 1.1.1
>
>
> The parquet-mr project has introduced a new filter api , along with several 
> fixes , like filtering on OPTIONAL columns as well. It can also eliminate 
> entire RowGroups depending on certain statistics like min/max
> We can leverage that to further improve performance of queries with filters.
> Also filter2 api introduces ability to create custom filters. We can create a 
> custom filter for the optimized In clause (InSet) , so that elimination 
> happens in the ParquetRecordReader itself (will create a separate ticket for 
> that) .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command

2014-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174879#comment-14174879
 ] 

Apache Spark commented on SPARK-3989:
-

User 'ziky90' has created a pull request for this issue:
https://github.com/apache/spark/pull/2836

> Added the possibility to install Python packages via pip for pyspark directly 
> from the ./spark_ec2 command 
> ---
>
> Key: SPARK-3989
> URL: https://issues.apache.org/jira/browse/SPARK-3989
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2, PySpark
>Reporter: Jan Zikeš
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command

2014-10-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174876#comment-14174876
 ] 

Jan Zikeš commented on SPARK-3989:
--

Implemented and sent pull request here: 
https://github.com/apache/spark/pull/2836

> Added the possibility to install Python packages via pip for pyspark directly 
> from the ./spark_ec2 command 
> ---
>
> Key: SPARK-3989
> URL: https://issues.apache.org/jira/browse/SPARK-3989
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2, PySpark
>Reporter: Jan Zikeš
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command

2014-10-17 Thread JIRA
Jan Zikeš created SPARK-3989:


 Summary: Added the possibility to install Python packages via pip 
for pyspark directly from the ./spark_ec2 command 
 Key: SPARK-3989
 URL: https://issues.apache.org/jira/browse/SPARK-3989
 Project: Spark
  Issue Type: New Feature
  Components: EC2, PySpark
Reporter: Jan Zikeš
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3900) ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.

2014-10-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174857#comment-14174857
 ] 

Apache Spark commented on SPARK-3900:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2755

> ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.
> 
>
> Key: SPARK-3900
> URL: https://issues.apache.org/jira/browse/SPARK-3900
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
> Environment: Hadoop 0.23
>Reporter: Kousuke Saruta
>Priority: Critical
>
> ApplicationMaster registers a shutdown hook and it calls 
> ApplicationMaster#cleanupStagingDir.
> cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes 
> FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook.
> In FileSystem of hadoop 0.23, the shutdown hook registration does not 
> consider whether shutdown is in progress or not (In 2.2, it's considered).
> {code}
> // 0.23 
> if (map.isEmpty() ) {
>   ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
> SHUTDOWN_HOOK_PRIORITY);
> }
> {code}
> {code}
> // 2.2
> if (map.isEmpty()
> && !ShutdownHookManager.get().isShutdownInProgress()) {
>ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
> SHUTDOWN_HOOK_PRIORITY);
> }
> {code}
> Thus, in 0.23, another shutdown hook can be registered when 
> ApplicationMaster's shutdown hook run.
> This issue cause IllegalStateException as follows.
> {code}
> java.lang.IllegalStateException: Shutdown in progress, cannot add a 
> shutdownHook
> at 
> org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3606.

   Resolution: Fixed
Fix Version/s: 1.1.1

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.1.1, 1.2.0
>
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174824#comment-14174824
 ] 

Xiangrui Meng commented on SPARK-2426:
--

[~debasish83] Thanks for working on this feature! This is definitely lots of 
work. We need to figure out couple high-level questions before looking into the 
code:

1. License. There are two files that requires special license: proximal, which 
ports cvxgrp/proximal (BSD) and QPMinimizer:

{code}
... distributed with Copyright (c) 2014, Debasish Das (Verizon), all rights 
reserved.
{code}

Code contribution to Apache follows ICLA: 
http://www.apache.org/licenses/icla.txt . I'm not familiar with the terms. I saw

{code}
Except for the license granted herein to the Foundation and recipients of
software distributed by the Foundation, You reserve all right, title,
and interest in and to Your Contributions.
{code}

My understand is that if you want your code distributed with Apache License, we 
don't need special notice about your rights. Please check with Verizon's legal 
team to make sure they are okay with it. It would be really helpful If someone 
can explain in more details.

2. Interface. I'm doing a refactoring of ALS (SPARK-3541). I hope we can 
decouple the solvers (LS, QP) from ALS. In

https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

The subproblem is wrapped in a NormalEquation, which stores AtA, Atb, and n. A 
Cholesky solver takes a NormalEquation instance, solves it, and returns the 
solution. We can plug-in other solvers as long as NormalEquation provides all 
information we need. Does it apply to your use cases?

For public APIs, we should restrict parameters to simple types. For example, 
constraint = "none" | "nonnegative" | "box". This is good for adding Python 
APIs. Those options should be sufficient for normal use cases. We can provide a 
developer API that allows advanced users to plug-in their own solvers. You can 
check the current proposal of parameters at SPARK-3530.

3. Where to put the implementation? Including MLlib's NNLS, those solvers are 
for local problems. What sounds ideal to me is breeze.optimize, which already 
contains several optimization solvers and we use LBFGS implemented there and 
maybe OWLQN soon.

4. This PR definitely needs some time to testing. The feature freeze deadline 
for v1.2 is Oct 31. I cannot promise time for code review given my current 
bandwidth. It would be great if you can share your MATLAB code (hopefully 
Octave compatible) and some performance results. So more developers can help 
test.

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3541) Improve ALS internal storage

2014-10-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3541:
-
Shepherd:   (was: Xiangrui Meng)

> Improve ALS internal storage
> 
>
> Key: SPARK-3541
> URL: https://issues.apache.org/jira/browse/SPARK-3541
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> The internal storage of ALS uses many small objects, which increases the GC 
> pressure and makes ALS difficult to scale to very large scale, e.g., 50 
> billion ratings. In such cases, the full GC may take more than 10 minutes to 
> finish. That is longer than the default heartbeat timeout and hence executors 
> will be removed under default settings.
> We can use primitive arrays to reduce the number of objects significantly. 
> This requires big change to the ALS implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org