[jira] [Updated] (SPARK-3997) scalastyle does not output the error location
[ https://issues.apache.org/jira/browse/SPARK-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3997: --- Description: {{./dev/scalastyle}} => {noformat} Scalastyle checks failed at following occurrences: java.lang.RuntimeException: exists error at scala.sys.package$.error(package.scala:27) at scala.Predef$.error(Predef.scala:142) [error] (mllib/*:scalastyle) exists error {noformat} was: {{./dev/scalastyle }} => {noformat} Scalastyle checks failed at following occurrences: java.lang.RuntimeException: exists error at scala.sys.package$.error(package.scala:27) at scala.Predef$.error(Predef.scala:142) [error] (mllib/*:scalastyle) exists error {noformat} > scalastyle does not output the error location > - > > Key: SPARK-3997 > URL: https://issues.apache.org/jira/browse/SPARK-3997 > Project: Spark > Issue Type: Bug >Reporter: Guoqiang Li > > {{./dev/scalastyle}} => > {noformat} > Scalastyle checks failed at following occurrences: > java.lang.RuntimeException: exists error > at scala.sys.package$.error(package.scala:27) > at scala.Predef$.error(Predef.scala:142) > [error] (mllib/*:scalastyle) exists error > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3997) scalastyle does not output the error location
Guoqiang Li created SPARK-3997: -- Summary: scalastyle does not output the error location Key: SPARK-3997 URL: https://issues.apache.org/jira/browse/SPARK-3997 Project: Spark Issue Type: Bug Reporter: Guoqiang Li {noformat} Scalastyle checks failed at following occurrences: java.lang.RuntimeException: exists error at scala.sys.package$.error(package.scala:27) at scala.Predef$.error(Predef.scala:142) [error] (mllib/*:scalastyle) exists error {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3997) scalastyle does not output the error location
[ https://issues.apache.org/jira/browse/SPARK-3997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3997: --- Description: {{./dev/scalastyle }} => {noformat} Scalastyle checks failed at following occurrences: java.lang.RuntimeException: exists error at scala.sys.package$.error(package.scala:27) at scala.Predef$.error(Predef.scala:142) [error] (mllib/*:scalastyle) exists error {noformat} was: {noformat} Scalastyle checks failed at following occurrences: java.lang.RuntimeException: exists error at scala.sys.package$.error(package.scala:27) at scala.Predef$.error(Predef.scala:142) [error] (mllib/*:scalastyle) exists error {noformat} > scalastyle does not output the error location > - > > Key: SPARK-3997 > URL: https://issues.apache.org/jira/browse/SPARK-3997 > Project: Spark > Issue Type: Bug >Reporter: Guoqiang Li > > {{./dev/scalastyle }} => > {noformat} > Scalastyle checks failed at following occurrences: > java.lang.RuntimeException: exists error > at scala.sys.package$.error(package.scala:27) > at scala.Predef$.error(Predef.scala:142) > [error] (mllib/*:scalastyle) exists error > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization
[ https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175838#comment-14175838 ] Reynold Xin commented on SPARK-3135: Is the memory usage lower on the master branch vs 1.1? If yes, we can certainly backport. This is a medium size fix. > Avoid memory copy in TorrentBroadcast serialization > --- > > Key: SPARK-3135 > URL: https://issues.apache.org/jira/browse/SPARK-3135 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: starter > Fix For: 1.2.0 > > > TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize > broadcast object into a single giant byte array, and then separates it into > smaller chunks. We should implement a new OutputStream that writes > serialized bytes directly into chunks of byte arrays so we don't need the > extra memory copy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers
[ https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175811#comment-14175811 ] Apache Spark commented on SPARK-3822: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2840 > Expose a mechanism for SparkContext to ask for / remove Yarn containers > --- > > Key: SPARK-3822 > URL: https://issues.apache.org/jira/browse/SPARK-3822 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is one of the core components for the umbrella issue SPARK-3174. > Currently, the only agent in Spark that communicates directly with the RM is > the AM. This means the only way for the SparkContext to ask for / remove > containers from the RM is through the AM. The communication link between the > SparkContext and the AM needs to be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3822) Expose a mechanism for SparkContext to ask for / remove Yarn containers
[ https://issues.apache.org/jira/browse/SPARK-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3822: - Affects Version/s: 1.1.0 > Expose a mechanism for SparkContext to ask for / remove Yarn containers > --- > > Key: SPARK-3822 > URL: https://issues.apache.org/jira/browse/SPARK-3822 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is one of the core components for the umbrella issue SPARK-3174. > Currently, the only agent in Spark that communicates directly with the RM is > the AM. This means the only way for the SparkContext to ask for / remove > containers from the RM is through the AM. The communication link between the > SparkContext and the AM needs to be added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3888) Limit the memory used by python worker
[ https://issues.apache.org/jira/browse/SPARK-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3888: - Affects Version/s: 1.1.0 > Limit the memory used by python worker > -- > > Key: SPARK-3888 > URL: https://issues.apache.org/jira/browse/SPARK-3888 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we did not limit the memory by Python workers, then it maybe run > out of memory and freeze the OS. it's safe to have a configurable hard > limitation for it, which should be large than spark.executor.python.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3935) Unused variable in PairRDDFunctions.scala
[ https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reopened SPARK-3935: -- Assignee: wangfei > Unused variable in PairRDDFunctions.scala > - > > Key: SPARK-3935 > URL: https://issues.apache.org/jira/browse/SPARK-3935 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei >Assignee: wangfei >Priority: Minor > Fix For: 1.2.0 > > > There is a unused variable (count) in saveAsHadoopDataset function in > PairRDDFunctions.scala. > It is better to add a log statement to record the line of output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3935) Unused variable in PairRDDFunctions.scala
[ https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3935. Resolution: Fixed > Unused variable in PairRDDFunctions.scala > - > > Key: SPARK-3935 > URL: https://issues.apache.org/jira/browse/SPARK-3935 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei >Assignee: wangfei >Priority: Minor > Fix For: 1.2.0 > > > There is a unused variable (count) in saveAsHadoopDataset function in > PairRDDFunctions.scala. > It is better to add a log statement to record the line of output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark
[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175752#comment-14175752 ] Sung Chung commented on SPARK-1673: --- This is eventually going to be taken care of by the OWLQN addition. https://issues.apache.org/jira/browse/SPARK-1892 > GLMNET implementation in Spark > -- > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training
[ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3161: - Assignee: Sung Chung > Cache example-node map for DecisionTree training > > > Key: SPARK-3161 > URL: https://issues.apache.org/jira/browse/SPARK-3161 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Sung Chung > > Improvement: worker computation > When training each level of a DecisionTree, each example needs to be mapped > to a node in the current level (or to none if it does not reach that level). > This is currently done via the function predictNodeIndex(), which traces from > the current tree’s root node to the given level. > Proposal: Cache this mapping. > * Pro: O(1) lookup instead of O(level). > * Con: Extra RDD which must share the same partitioning as the training data. > Design: > * (option 1) This could be done as in [Sequoia Forests | > https://github.com/AlpineNow/SparkML2] where each instance is stored with an > array of node indices (1 node per tree). > * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, > Array\[TreePoint\]\]\]\], where each partition stores an array of maps from > node indices to an array of instances. This has more overhead in data > structures but could be more efficient: not all nodes are split on each > iteration, and this would allow each executor to ignore instances which are > not used for the current node set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM
[ https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175750#comment-14175750 ] Andrew Ash commented on SPARK-2243: --- Found a couple potentially-related tickets; I'm going to link these as blocking this one: SPARK-534 SPARK-3148 > Support multiple SparkContexts in the same JVM > -- > > Key: SPARK-2243 > URL: https://issues.apache.org/jira/browse/SPARK-2243 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Miguel Angel Fernandez Diaz > > We're developing a platform where we create several Spark contexts for > carrying out different calculations. Is there any restriction when using > several Spark contexts? We have two contexts, one for Spark calculations and > another one for Spark Streaming jobs. The next error arises when we first > execute a Spark calculation and, once the execution is finished, a Spark > Streaming job is launched: > {code} > 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0 > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63) > at > org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139) > at > java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0) > 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to > java.io.FileNotFoundException > java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0 > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624) > at > org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156) > at > org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInp
[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175748#comment-14175748 ] Sean Owen commented on SPARK-3967: -- You guys should make PRs for these. I am also not sure if it's so necessary to download the file into a temp directory and move it... it may cause a copy instead of rename, and in fact does here, and so is not like the file appears in the target dir atomically anyway. I'm not sure the code here cleans up the partially downloaded file in case of error and that could leave a broken file in the target dir instead of just a temp dir. The change to not copy the file when identical looks sound; I bet you can avoid checking if it exists twice. > Spark applications fail in yarn-cluster mode when the directories configured > in yarn.nodemanager.local-dirs are located on different disks/partitions > - > > Key: SPARK-3967 > URL: https://issues.apache.org/jira/browse/SPARK-3967 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Christophe PRÉAUD > Attachments: spark-1.1.0-utils-fetch.patch, > spark-1.1.0-yarn_cluster_tmpdir.patch > > > Spark applications fail from time to time in yarn-cluster mode (but not in > yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is > set to a comma-separated list of directories which are located on different > disks/partitions. > Steps to reproduce: > 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of > directories located on different partitions (the more you set, the more > likely it will be to reproduce the bug): > (...) > > yarn.nodemanager.local-dirs > > file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir > > (...) > 2. Launch (several times) an application in yarn-cluster mode, it will fail > (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3996) Shade Jetty in Spark deliverables
Mingyu Kim created SPARK-3996: - Summary: Shade Jetty in Spark deliverables Key: SPARK-3996 URL: https://issues.apache.org/jira/browse/SPARK-3996 Project: Spark Issue Type: Task Components: Spark Core Reporter: Mingyu Kim We'd like to use Spark in a Jetty 9 server, and it's causing a version conflict. Given that Spark's dependency on Jetty is light, it'd be a good idea to shade this dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-534) Make SparkContext thread-safe
[ https://issues.apache.org/jira/browse/SPARK-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-534: - Description: SparkEnv (used by SparkContext) is not thread-safe and it causes issues with scala's Futures and parrallel collections. For example, this will not work: {code} val f = Futures.future({ sc.textFile("hdfs://") }) f.apply() {code} Workaround for now: {code} val f = Futures.future({ SparkEnv.set(sc.env) sc.textFile("hdfs://") }) f.apply() {code} was: SparkEnv (used by SparkContext) is not thread-safe and it causes issues with scala's Futures and parrallel collections. For example, this will not work: val f = Futures.future({ sc.textFile("hdfs://") }) f.apply() Workaround for now: val f = Futures.future({ SparkEnv.set(sc.env) sc.textFile("hdfs://") }) f.apply() > Make SparkContext thread-safe > - > > Key: SPARK-534 > URL: https://issues.apache.org/jira/browse/SPARK-534 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.5.0, 0.5.1, 0.6.0, 0.6.1, 0.7.0, 0.6.2, 0.5.2, 0.7.1, > 0.7.2, 0.7.3 >Reporter: tjhunter >Priority: Blocker > > SparkEnv (used by SparkContext) is not thread-safe and it causes issues with > scala's Futures and parrallel collections. > For example, this will not work: > {code} > val f = Futures.future({ > sc.textFile("hdfs://") > }) > f.apply() > {code} > Workaround for now: > {code} > val f = Futures.future({ > SparkEnv.set(sc.env) > sc.textFile("hdfs://") > }) > f.apply() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3164) Store DecisionTree Split.categories as Set
[ https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3164: - Description: Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log n lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. was: Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log(n) lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. > Store DecisionTree Split.categories as Set > -- > > Key: SPARK-3164 > URL: https://issues.apache.org/jira/browse/SPARK-3164 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Trivial > > Improvement: computation > For categorical features with many categories, it could be more efficient to > store Split.categories as a Set, not a List. (It is currently a List.) A > Set might be more scalable (for log n lookups), though tests would need to be > done to ensure that Sets do not incur too much more overhead than Lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175671#comment-14175671 ] Xiangrui Meng commented on SPARK-3990: -- In PySpark 1.1, we switched to Kryo serialization by default. However, ALS code requires special registration with Kryo in order to work. The error happens when there is not enough memory and ALS needs to store ratings or in/out blocks to disk. I will work on this issue. For now, the workaround is to use a cluster with enough memory. > kryo.KryoException caused by ALS.trainImplicit in pyspark > - > > Key: SPARK-3990 > URL: https://issues.apache.org/jira/browse/SPARK-3990 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0 > Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2 > Linux > Python 2.6.8 >Reporter: Gen TANG > Labels: test > > When we tried ALS.trainImplicit() in pyspark environment, it only works for > iterations = 1. What is more strange, it is that if we try the same code in > Scala, it works very well.(I did several test, by now, in Scala > ALS.trainImplicit works) > For example, the following code: > {code:title=test.py|borderStyle=solid} > r1 = (1, 1, 1.0) > r2 = (1, 2, 2.0) > r3 = (2, 1, 2.0) > ratings = sc.parallelize([r1, r2, r3]) > model = ALS.trainImplicit(ratings, 1) > '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)''' > {code} > It will cause the failed stage at count at ALS.scala:314 Info as: > {code:title=error information provided by ganglia} > Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 90.0 (TID 484, > ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: > java.lang.ArrayStoreException: scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecut
[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3990: - Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) > kryo.KryoException caused by ALS.trainImplicit in pyspark > - > > Key: SPARK-3990 > URL: https://issues.apache.org/jira/browse/SPARK-3990 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0 > Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2 > Linux > Python 2.6.8 >Reporter: Gen TANG > Labels: test > > When we tried ALS.trainImplicit() in pyspark environment, it only works for > iterations = 1. What is more strange, it is that if we try the same code in > Scala, it works very well.(I did several test, by now, in Scala > ALS.trainImplicit works) > For example, the following code: > {code:title=test.py|borderStyle=solid} > r1 = (1, 1, 1.0) > r2 = (1, 2, 2.0) > r3 = (2, 1, 2.0) > ratings = sc.parallelize([r1, r2, r3]) > model = ALS.trainImplicit(ratings, 1) > '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)''' > {code} > It will cause the failed stage at count at ALS.scala:314 Info as: > {code:title=error information provided by ganglia} > Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 90.0 (TID 484, > ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: > java.lang.ArrayStoreException: scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > {code} > In the log of slave which failed the task, it has: > {code:title=error information in the log of slave} > 14/10/17 13:20:54 ERROR exe
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually). I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK] Py
[jira] [Updated] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams updated SPARK-3967: - Attachment: spark-1.1.0-utils-fetch.patch Don't redundantly copy executor dependency files in {{Utils.fetchFile}}. > Spark applications fail in yarn-cluster mode when the directories configured > in yarn.nodemanager.local-dirs are located on different disks/partitions > - > > Key: SPARK-3967 > URL: https://issues.apache.org/jira/browse/SPARK-3967 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Christophe PRÉAUD > Attachments: spark-1.1.0-utils-fetch.patch, > spark-1.1.0-yarn_cluster_tmpdir.patch > > > Spark applications fail from time to time in yarn-cluster mode (but not in > yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is > set to a comma-separated list of directories which are located on different > disks/partitions. > Steps to reproduce: > 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of > directories located on different partitions (the more you set, the more > likely it will be to reproduce the bug): > (...) > > yarn.nodemanager.local-dirs > > file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir > > (...) > 2. Launch (several times) an application in yarn-cluster mode, it will fail > (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175647#comment-14175647 ] Ryan Williams commented on SPARK-3967: -- I've been debugging this issue as well and I think I've found an issue in {{org.apache.spark.util.Utils}} that is contributing to / causing the problem: {{Files.move}} on [line 390|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L390] is called even if {{targetFile}} exists and {{tempFile}} and {{targetFile}} are equal. The check on [line 379|https://github.com/apache/spark/blob/v1.1.0/core/src/main/scala/org/apache/spark/util/Utils.scala#L379] seems to imply the desire to skip a redundant overwrite if the file is already there and has the contents that it should have. Gating the {{Files.move}} call on a further {{if (!targetFile.exists)}} fixes the issue for me; attached is a patch of the change. In practice all of my executors that hit this code path are finding every dependency JAR to already exist and be exactly equal to what they need it to be, meaning they were all needlessly overwriting all of their dependency JARs, and now are all basically no-op-ing in {{Utils.fetchFile}}; I've not determined who/what is putting the JARs there, why the issue only crops up in {{yarn-cluster}} mode (or {{--master yarn --deploy-mode cluster}}), etc., but it seems like either way this patch is probably desirable. > Spark applications fail in yarn-cluster mode when the directories configured > in yarn.nodemanager.local-dirs are located on different disks/partitions > - > > Key: SPARK-3967 > URL: https://issues.apache.org/jira/browse/SPARK-3967 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Christophe PRÉAUD > Attachments: spark-1.1.0-yarn_cluster_tmpdir.patch > > > Spark applications fail from time to time in yarn-cluster mode (but not in > yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is > set to a comma-separated list of directories which are located on different > disks/partitions. > Steps to reproduce: > 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of > directories located on different partitions (the more you set, the more > likely it will be to reproduce the bug): > (...) > > yarn.nodemanager.local-dirs > > file:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir > > (...) > 2. Launch (several times) an application in yarn-cluster mode, it will fail > (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3994) countByKey / countByValue do not go through Aggregator
[ https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175635#comment-14175635 ] Apache Spark commented on SPARK-3994: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/2839 > countByKey / countByValue do not go through Aggregator > -- > > Key: SPARK-3994 > URL: https://issues.apache.org/jira/browse/SPARK-3994 > Project: Spark > Issue Type: Bug >Reporter: Aaron Davidson >Assignee: Aaron Davidson > > The implementations of these methods are historical remnants of Spark from a > time when the shuffle may have been nonexistent. Now, they can be simplified > by plugging into reduceByKey(), potentially seeing performance and stability > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK] PySpark's sampl
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSP
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK]
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPAR
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Summary: [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 (was: pyspark's sample methods do not work with NumPy 1.9) > [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 > - > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Jeremy Freeman >Priority: Critical > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: > {code} > self._seed = seed if seed is not None else random.randint(0, sys.maxint) > {code} > In previous versions of NumPy a random seed larger than 2 ** 32 would > silently get truncated to 2 ** 32. This was fixed in a recent patch > (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). > But it means that PySpark’s code now causes an error. That line often yields > ints larger than 2 ** 32, which will reliably break any sampling operation in > PySpark. > I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version. I am putting a PR together now (the fix is very sim
[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version. I am putting a PR together now (the fix is very simple!). > pyspark's sample methods do not work with NumPy 1.9 > --- > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Jeremy Freeman >Priority: Critical > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In previous versions of NumPy a random seed larger than 32 would silently get > truncated to 2 ** 32
[jira] [Resolved] (SPARK-3934) RandomForest bug in sanity check in DTStatsAggregator
[ https://issues.apache.org/jira/browse/SPARK-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3934. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2785 [https://github.com/apache/spark/pull/2785] > RandomForest bug in sanity check in DTStatsAggregator > - > > Key: SPARK-3934 > URL: https://issues.apache.org/jira/browse/SPARK-3934 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 1.2.0 > > > When run with a mix of unordered categorical and continuous features, on > multiclass classification, RandomForest fails. The bug is in the sanity > checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the > wrong indices for checking whether features are unordered. > Proposal: Remove the sanity checks since they are not really needed, and > since they would require DTStatsAggregator to keep track of an extra set of > indices (for the feature subset). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
Jeremy Freeman created SPARK-3995: - Summary: pyspark's sample methods do not work with NumPy 1.9 Key: SPARK-3995 URL: https://issues.apache.org/jira/browse/SPARK-3995 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2546) Configuration object thread safety issue
[ https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175608#comment-14175608 ] Andrew Ash commented on SPARK-2546: --- We tested Josh's patch, confirming the fix and measuring the perf regression at ~8% > Configuration object thread safety issue > > > Key: SPARK-2546 > URL: https://issues.apache.org/jira/browse/SPARK-2546 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.1 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > > // observed in 0.9.1 but expected to exist in 1.0.1 as well > This ticket is copy-pasted from a thread on the dev@ list: > {quote} > We discovered a very interesting bug in Spark at work last week in Spark > 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to > thread safety issues. I believe it still applies in Spark 1.0.1 as well. > Let me explain: > Observations > - Was running a relatively simple job (read from Avro files, do a map, do > another map, write back to Avro files) > - 412 of 413 tasks completed, but the last task was hung in RUNNING state > - The 412 successful tasks completed in median time 3.4s > - The last hung task didn't finish even in 20 hours > - The executor with the hung task was responsible for 100% of one core of > CPU usage > - Jstack of the executor attached (relevant thread pasted below) > Diagnosis > After doing some code spelunking, we determined the issue was concurrent use > of a Configuration object for each task on an executor. In Hadoop each task > runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so > the single-threaded access assumptions of the Configuration object no longer > hold in Spark. > The specific issue is that the AvroRecordReader actually _modifies_ the > JobConf it's given when it's instantiated! It adds a key for the RPC > protocol engine in the process of connecting to the Hadoop FileSystem. When > many tasks start at the same time (like at the start of a job), many tasks > are adding this configuration item to the one Configuration object at once. > Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… > The below post is an excellent explanation of what happens in the situation > where multiple threads insert into a HashMap at the same time. > http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html > The gist is that you have a thread following a cycle of linked list nodes > indefinitely. This exactly matches our observations of the 100% CPU core and > also the final location in the stack trace. > So it seems the way Spark shares a Configuration object between task threads > in an executor is incorrect. We need some way to prevent concurrent access > to a single Configuration object. > Proposed fix > We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets > its own JobConf object (and thus Configuration object). The optimization of > broadcasting the Configuration object across the cluster can remain, but on > the other side I think it needs to be cloned for each task to allow for > concurrent access. I'm not sure the performance implications, but the > comments suggest that the Configuration object is ~10KB so I would expect a > clone on the object to be relatively speedy. > Has this been observed before? Does my suggested fix make sense? I'd be > happy to file a Jira ticket and continue discussion there for the right way > to fix. > Thanks! > Andrew > P.S. For others seeing this issue, our temporary workaround is to enable > spark.speculation, which retries failed (or hung) tasks on other machines. > {noformat} > "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000 > nid=0x54b1 runnable [0x7f92d74f1000] >java.lang.Thread.State: RUNNABLE > at java.util.HashMap.transfer(HashMap.java:601) > at java.util.HashMap.resize(HashMap.java:581) > at java.util.HashMap.addEntry(HashMap.java:879) > at java.util.HashMap.put(HashMap.java:505) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:803) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:783) > at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662) > at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:436) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:403)
[jira] [Created] (SPARK-3994) countByKey / countByValue do not go through Aggregator
Aaron Davidson created SPARK-3994: - Summary: countByKey / countByValue do not go through Aggregator Key: SPARK-3994 URL: https://issues.apache.org/jira/browse/SPARK-3994 Project: Spark Issue Type: Bug Reporter: Aaron Davidson The implementations of these methods are historical remnants of Spark from a time when the shuffle may have been nonexistent. Now, they can be simplified by plugging into reduceByKey(), potentially seeing performance and stability improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3985) json file path is not right
[ https://issues.apache.org/jira/browse/SPARK-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3985. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2834 [https://github.com/apache/spark/pull/2834] > json file path is not right > --- > > Key: SPARK-3985 > URL: https://issues.apache.org/jira/browse/SPARK-3985 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.2.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.2.0 > > > in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." > together instead of using "os.path.join", would cause a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3994) countByKey / countByValue do not go through Aggregator
[ https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson reassigned SPARK-3994: - Assignee: Aaron Davidson > countByKey / countByValue do not go through Aggregator > -- > > Key: SPARK-3994 > URL: https://issues.apache.org/jira/browse/SPARK-3994 > Project: Spark > Issue Type: Bug >Reporter: Aaron Davidson >Assignee: Aaron Davidson > > The implementations of these methods are historical remnants of Spark from a > time when the shuffle may have been nonexistent. Now, they can be simplified > by plugging into reduceByKey(), potentially seeing performance and stability > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3985) json file path is not right
[ https://issues.apache.org/jira/browse/SPARK-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3985: -- Affects Version/s: 1.2.0 > json file path is not right > --- > > Key: SPARK-3985 > URL: https://issues.apache.org/jira/browse/SPARK-3985 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.2.0 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.2.0 > > > in examples/src/main/python/sql.py, we just add SPARK_HOME and "examples/..." > together instead of using "os.path.join", would cause a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175557#comment-14175557 ] Reza Zadeh commented on SPARK-3434: --- Thanks Shivaram! As discussed over the phone, we will use your design and build upon it, so that you can focus on the linear algebraic operations such as TSQR. > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Shivaram Venkataraman > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3855) Binding Exception when running PythonUDFs
[ https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3855. Resolution: Fixed Fix Version/s: 1.2.0 > Binding Exception when running PythonUDFs > - > > Key: SPARK-3855 > URL: https://issues.apache.org/jira/browse/SPARK-3855 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.2.0 > > > {code} > from pyspark import * > from pyspark.sql import * > sc = SparkContext() > sqlContext = SQLContext(sc) > sqlContext.registerFunction("strlen", lambda string: len(string)) > sqlContext.inferSchema(sc.parallelize([Row(a="test")])).registerTempTable("test") > srdd = sqlContext.sql("SELECT strlen(a) FROM test WHERE strlen(a) > 1") > print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString() > print srdd.collect() > {code} > output: > {code} > == Parsed Logical Plan == > Project ['strlen('a) AS c0#1] > Filter ('strlen('a) > 1) > UnresolvedRelation None, test, None > == Analyzed Logical Plan == > Project [c0#1] > Project [pythonUDF#2 AS c0#1] > EvaluatePython PythonUDF#strlen(a#0) >Project [a#0] > Filter (CAST(pythonUDF#3, DoubleType) > CAST(1, DoubleType)) > EvaluatePython PythonUDF#strlen(a#0) > SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at > mapPartitions at SQLContext.scala:525) > == Optimized Logical Plan == > Project [pythonUDF#2 AS c0#1] > EvaluatePython PythonUDF#strlen(a#0) > Project [a#0] >Filter (CAST(pythonUDF#3, DoubleType) > 1.0) > EvaluatePython PythonUDF#strlen(a#0) > SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at > mapPartitions at SQLContext.scala:525) > == Physical Plan == > Project [pythonUDF#2 AS c0#1] > BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5] > Project [a#0] >Filter (CAST(pythonUDF#3, DoubleType) > 1.0) > BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3] > ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at > SQLContext.scala:525 > Code Generation: false > == RDD == > 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: pythonUDF#2 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLi
[jira] [Created] (SPARK-3993) python worker may hang after reused from take()
Davies Liu created SPARK-3993: - Summary: python worker may hang after reused from take() Key: SPARK-3993 URL: https://issues.apache.org/jira/browse/SPARK-3993 Project: Spark Issue Type: Bug Components: PySpark Reporter: Davies Liu Priority: Blocker After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data. We should make sure the socket is clean before reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3993) python worker may hang after reused from take()
[ https://issues.apache.org/jira/browse/SPARK-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175536#comment-14175536 ] Apache Spark commented on SPARK-3993: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2838 > python worker may hang after reused from take() > --- > > Key: SPARK-3993 > URL: https://issues.apache.org/jira/browse/SPARK-3993 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > After take(), maybe there are some garbage left in the socket, then next task > assigned to this worker will hang because of corrupted data. > We should make sure the socket is clean before reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175377#comment-14175377 ] Sean Owen commented on SPARK-2426: -- Regarding licensing, if the code is BSD licensed then it does not require an entry in NOTICE file (it's a "Category A" license), and entries shouldn't be added to NOTICE unless required. I believe that in this case we will need to reproduce the text of the license in LICENSE since it will not be included otherwise from a Maven artifact. So I suggest: don't change NOTICE, and move the license in LICENSE up to the section where other licenses are reproduced in full. It's a complex issue but this is my best understanding of the right thing to do. > Quadratic Minimization for MLlib ALS > > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Debasish Das >Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3979) Yarn backend's default file replication should match HDFS's default one
[ https://issues.apache.org/jira/browse/SPARK-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3979. -- Resolution: Fixed Fix Version/s: 1.2.0 > Yarn backend's default file replication should match HDFS's default one > --- > > Key: SPARK-3979 > URL: https://issues.apache.org/jira/browse/SPARK-3979 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 1.2.0 > > > This code in ClientBase.scala sets the replication used for files uploaded to > HDFS: > {code} > val replication = sparkConf.getInt("spark.yarn.submit.file.replication", > 3).toShort > {code} > Instead of a hardcoded "3" (which is the default value for HDFS), it should > be using the default value from the HDFS conf ("dfs.replication"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3992) Spark 1.1.0 python binding cannot use any collections but list as Accumulators
[ https://issues.apache.org/jira/browse/SPARK-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Bogorad updated SPARK-3992: -- Description: A dictionary accumulator defined as a global variable is not visible inside a function called by "foreach()". Here is the minimal code snippet: from collections import defaultdict from pyspark import SparkContext from pyspark.accumulators import AccumulatorParam class DictAccumParam(AccumulatorParam): def zero(self, value): value.clear() def addInPlace(self, val1, val2): return val1 sc = SparkContext("local", "Dict Accumulator Bug") va = sc.accumulator(defaultdict(int), DictAccumParam()) def foo(x): global va print "va is:", va rdd = sc.parallelize([1,2,3]).foreach(foo) When ran the code snippet produced the following results: ... va is: None va is: None va is: None ... I have verified that the global variables are visible inside foo() called by foreach only if they are for scalars or lists like in the API doc at http://spark.apache.org/docs/latest/api/python/ The problem exists with standard dictionaries and collections. I also verified that if foo() is called directly, i.e. outside foreach, then the global variables are visible OK. was: A dictionary accumulator defined as a global variable is not visible inside a function called by "foreach()". Here is the minimal code snippet: from collections import defaultdict from pyspark import SparkContext from pyspark.accumulators import AccumulatorParam class DictAccumParam(AccumulatorParam): def zero(self, value): value.clear() def addInPlace(self, val1, val2): return val1 sc = SparkContext("local", "Dict Accumulator Bug") va = sc.accumulator(defaultdict(int), DictAccumParam()) def foo(x): global va print "va is:", va rdd = sc.parallelize([1,2,3]).foreach(foo) When ran the code snippet produced the following results: ... va is: None va is: None va is: None ... I have verified that the global variables are visible inside foo() called by foreach only if they are for scalars or lists like in the API doc at http://spark.apache.org/docs/latest/api/python/ The problem exists with standard dictionaries and collections. > Spark 1.1.0 python binding cannot use any collections but list as Accumulators > -- > > Key: SPARK-3992 > URL: https://issues.apache.org/jira/browse/SPARK-3992 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC > 2014 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Walter Bogorad > > A dictionary accumulator defined as a global variable is not visible inside a > function called by "foreach()". > Here is the minimal code snippet: > from collections import defaultdict > from pyspark import SparkContext > from pyspark.accumulators import AccumulatorParam > class DictAccumParam(AccumulatorParam): > def zero(self, value): > value.clear() > def addInPlace(self, val1, val2): > return val1 > sc = SparkContext("local", "Dict Accumulator Bug") > va = sc.accumulator(defaultdict(int), DictAccumParam()) > def foo(x): > global va > print "va is:", va > rdd = sc.parallelize([1,2,3]).foreach(foo) > When ran the code snippet produced the following results: > ... > va is: None > va is: None > va is: None > ... > I have verified that the global variables are visible inside foo() called by > foreach only if they are for scalars or lists like in the API doc at > http://spark.apache.org/docs/latest/api/python/ > The problem exists with standard dictionaries and collections. > I also verified that if foo() is called directly, i.e. outside foreach, then > the global variables are visible OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3992) Spark 1.1.0 python binding cannot use any collections but list as Accumulators
Walter Bogorad created SPARK-3992: - Summary: Spark 1.1.0 python binding cannot use any collections but list as Accumulators Key: SPARK-3992 URL: https://issues.apache.org/jira/browse/SPARK-3992 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Reporter: Walter Bogorad A dictionary accumulator defined as a global variable is not visible inside a function called by "foreach()". Here is the minimal code snippet: from collections import defaultdict from pyspark import SparkContext from pyspark.accumulators import AccumulatorParam class DictAccumParam(AccumulatorParam): def zero(self, value): value.clear() def addInPlace(self, val1, val2): return val1 sc = SparkContext("local", "Dict Accumulator Bug") va = sc.accumulator(defaultdict(int), DictAccumParam()) def foo(x): global va print "va is:", va rdd = sc.parallelize([1,2,3]).foreach(foo) When ran the code snippet produced the following results: ... va is: None va is: None va is: None ... I have verified that the global variables are visible inside foo() called by foreach only if they are for scalars or lists like in the API doc at http://spark.apache.org/docs/latest/api/python/ The problem exists with standard dictionaries and collections. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3935) Unused variable in PairRDDFunctions.scala
[ https://issues.apache.org/jira/browse/SPARK-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3935. Resolution: Fixed Fix Version/s: 1.2.0 > Unused variable in PairRDDFunctions.scala > - > > Key: SPARK-3935 > URL: https://issues.apache.org/jira/browse/SPARK-3935 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: wangfei >Priority: Minor > Fix For: 1.2.0 > > > There is a unused variable (count) in saveAsHadoopDataset function in > PairRDDFunctions.scala. > It is better to add a log statement to record the line of output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3817) BlockManagerMasterActor: Got two different block manager registrations with Mesos
[ https://issues.apache.org/jira/browse/SPARK-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175279#comment-14175279 ] Brenden Matthews commented on SPARK-3817: - Appears to be a duplicate of SPARK-2445 and SPARK-3535. > BlockManagerMasterActor: Got two different block manager registrations with > Mesos > - > > Key: SPARK-3817 > URL: https://issues.apache.org/jira/browse/SPARK-3817 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Timothy Chen > > 14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different block > manager registrations on 20140711-081617-711206558-5050-2543-5 > Here is the log from the mesos-slave where this container was running. > http://pastebin.com/Q1Cuzm6Q -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175167#comment-14175167 ] Debasish Das commented on SPARK-2426: - 1. [~mengxr] Our legal was clear that Stanford and Verizon copyright should show up on the COPYRIGHT.txt file...I saw other company's copyrights and I did not think it will be a big issue... 2. For the new interface, we have two more requirements: Convex loss function (supporting huber loss / hinge loss etc) and no explicit AtA construction since once we start scaling to 1 factors for LSA then AtA construction will start to choke...Can I work on your branch ? https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala 3. I agree to refactor the core solver including NNLS to breeze. That was the initial plan but since we wanted to test out the features in our internal datasets, integrating with mllib was faster. I am testing NNLS's CG implementation since as soon as explicit AtA construction is taken out, we need to rely on CG in-place of direct solvers...But I will refactor the solver out to breeze and that will take the copyright msgs to breeze as well. 4. Let me add the Matlab scripts and point to the repository. ECOS and MOSEK will need Matlab to run. PDCO and Proximal variants will run fine on Octave. I am not sure if MOSEK is supported on Octave. > Quadratic Minimization for MLlib ALS > > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Debasish Das >Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode
[ https://issues.apache.org/jira/browse/SPARK-3991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] eblaas updated SPARK-3991: -- Attachment: not_serializable_exception.patch > Not Serializable , Nullpinter Exceptions in SQL server mode > --- > > Key: SPARK-3991 > URL: https://issues.apache.org/jira/browse/SPARK-3991 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: eblaas >Priority: Blocker > Attachments: not_serializable_exception.patch > > > I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it > works but there are some bugs to fix. > I customized the HiveThriftServer2 class to load, transform and register > tables (ETL) with the HiveContext. Data tables are generated from Cassandra > and from a relational database. > * 1 st problem : > hiveContext.registerRDDAsTable(treeSchema,"tree") , does not register the > table in hive metastore ("show tables;" via JDBC does not list the table, but > I can query it e.g. select * from tree) dirty workaround create a table with > same name and schema, this was necessary because mondrian validates table > existence > hiveContext.sql("CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3 > STRING)") > * 2 nd problem : > mondrian creates complex joins, witch results in Serialization Exceptions > 2 classes in hibeUdfs.scala have to be serializable > - DeferredObjectAdapter and HiveGenericUdaf > * 3 td problem > Nullpointer Exception in InMemoryRelation > 42: override lazy val statistics = Statistics(sizeInBytes = > child.sqlContext.defaultSizeInBytes) > the sqlContext in child was null, quick fix set default value from > SparkContext > override lazy val statistics = Statistics(sizeInBytes = 1) > I'm not sure how to fix this bugs but with the patch file it works at least. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3991) Not Serializable , Nullpinter Exceptions in SQL server mode
eblaas created SPARK-3991: - Summary: Not Serializable , Nullpinter Exceptions in SQL server mode Key: SPARK-3991 URL: https://issues.apache.org/jira/browse/SPARK-3991 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: eblaas Priority: Blocker I'm working on connecting Mondrian with Spark SQL via JDBC. Good news, it works but there are some bugs to fix. I customized the HiveThriftServer2 class to load, transform and register tables (ETL) with the HiveContext. Data tables are generated from Cassandra and from a relational database. * 1 st problem : hiveContext.registerRDDAsTable(treeSchema,"tree") , does not register the table in hive metastore ("show tables;" via JDBC does not list the table, but I can query it e.g. select * from tree) dirty workaround create a table with same name and schema, this was necessary because mondrian validates table existence hiveContext.sql("CREATE TABLE tree (dp_id BIGINT, h1 STRING, h2 STRING, h3 STRING)") * 2 nd problem : mondrian creates complex joins, witch results in Serialization Exceptions 2 classes in hibeUdfs.scala have to be serializable - DeferredObjectAdapter and HiveGenericUdaf * 3 td problem Nullpointer Exception in InMemoryRelation 42: override lazy val statistics = Statistics(sizeInBytes = child.sqlContext.defaultSizeInBytes) the sqlContext in child was null, quick fix set default value from SparkContext override lazy val statistics = Statistics(sizeInBytes = 1) I'm not sure how to fix this bugs but with the patch file it works at least. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gen TANG updated SPARK-3990: I think this problem is related to https://issues.apache.org/jira/browse/SPARK-1977 > kryo.KryoException caused by ALS.trainImplicit in pyspark > - > > Key: SPARK-3990 > URL: https://issues.apache.org/jira/browse/SPARK-3990 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.1.0 > Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2 > Linux > Python 2.6.8 >Reporter: Gen TANG > Labels: test > > When we tried ALS.trainImplicit() in pyspark environment, it only works for > iterations = 1. What is more strange, it is that if we try the same code in > Scala, it works very well.(I did several test, by now, in Scala > ALS.trainImplicit works) > For example, the following code: > {code:title=test.py|borderStyle=solid} > r1 = (1, 1, 1.0) > r2 = (1, 2, 2.0) > r3 = (2, 1, 2.0) > ratings = sc.parallelize([r1, r2, r3]) > model = ALS.trainImplicit(ratings, 1) > '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)''' > {code} > It will cause the failed stage at count at ALS.scala:314 Info as: > {code:title=error information provided by ganglia} > Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most > recent failure: Lost task 6.3 in stage 90.0 (TID 484, > ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: > java.lang.ArrayStoreException: scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.Task.run(Task.scala:54) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > {code} > In the log of slave which failed the task, it has: > {code:title=error information in the log of slave} >
[jira] [Updated] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
[ https://issues.apache.org/jira/browse/SPARK-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gen TANG updated SPARK-3990: Description: When we tried ALS.trainImplicit() in pyspark environment, it only works for iterations = 1. What is more strange, it is that if we try the same code in Scala, it works very well.(I did several test, by now, in Scala ALS.trainImplicit works) For example, the following code: {code:title=test.py|borderStyle=solid} r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1) '''by default iterations = 5 or model = ALS.trainImplicit(ratings, 1, 2)''' {code} It will cause the failed stage at count at ALS.scala:314 Info as: {code:title=error information provided by ganglia} Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most recent failure: Lost task 6.3 in stage 90.0 (TID 484, ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: {code} In the log of slave which failed the task, it has: {code:title=error information in the log of slave} 14/10/17 13:20:54 ERROR executor.Executor: Exception in task 6.0 in stage 90.0 (TID 465) com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) at com.
[jira] [Created] (SPARK-3990) kryo.KryoException caused by ALS.trainImplicit in pyspark
Gen TANG created SPARK-3990: --- Summary: kryo.KryoException caused by ALS.trainImplicit in pyspark Key: SPARK-3990 URL: https://issues.apache.org/jira/browse/SPARK-3990 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.1.0 Environment: 5 slaves cluster(m3.large) in AWS launched by spark-ec2 Linux Python 2.6.8 Reporter: Gen TANG When we tried ALS.trainImplicit() in pyspark environment, it only works for iterations = 1. For example, the following code: r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1) [by default iterations = 5] or model = ALS.trainImplicit(ratings, 1, 2) It will cause the failed stage at count at ALS.scala:314 Info as: Job aborted due to stage failure: Task 6 in stage 90.0 failed 4 times, most recent failure: Lost task 6.3 in stage 90.0 (TID 484, ip-172-31-35-238.ec2.internal): com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: In the log of slave which failed the task, it has: 14/10/17 13:20:54 ERROR executor.Executor: Exception in task 6.0 in stage 90.0 (TID 465) com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: scala.collection.mutable.HashSet Serialization trace: shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) at com.esotericsoftware.kryo.Kryo.re
[jira] [Commented] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
[ https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175079#comment-14175079 ] Fred Cons commented on SPARK-2649: -- [~npanj] [~rxin] I ran into this issue, and the following diff worked for me : https://github.com/mesos/spark-ec2/pull/76 > EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge > --- > > Key: SPARK-2649 > URL: https://issues.apache.org/jira/browse/SPARK-2649 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 1.1.0 >Reporter: npanj >Priority: Minor > > On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm > machines like r3.4xlarge( deployed by spark-ec2 script,), > Here is an example error (it seems to be an issue with default ami > "spark.ami.hvm.v14 (ami-35b1885c)" ). Here is error message: > -- > Starting httpd: httpd: Syntax error on line 153 of > /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into > server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object > file: No such file or directory > -- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-576) Design and develop a more precise progress estimator
[ https://issues.apache.org/jira/browse/SPARK-576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175023#comment-14175023 ] Dev Lakhani commented on SPARK-576: --- I've created a PR for this: https://github.com/apache/spark/pull/2837/ > Design and develop a more precise progress estimator > > > Key: SPARK-576 > URL: https://issues.apache.org/jira/browse/SPARK-576 > Project: Spark > Issue Type: Improvement >Reporter: Mosharaf Chowdhury > > In addition to /, we need to have something that > says . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails
[ https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174999#comment-14174999 ] Thomas Graves commented on SPARK-3877: -- [~vanzin] I agree. The user code should be exiting with non-zero or throwing on failure. If they aren't then there is nothing we can do about it, other then tell them to change their code to properly exit if they want to see failure status. Perhaps we should better document what they should do on failure too. Its basically the same I did for the exit codes in ApplicationMaster. It relies on user code exiting non-zero and throwing. The only other option would be for us to actually look at the details in the scheduler ourselves to try to determine what happened. ie we see Stage X failed or Y tasks failed, etc. I would say we do that later if its needed. > The exit code of spark-submit is still 0 when an yarn application fails > --- > > Key: SPARK-3877 > URL: https://issues.apache.org/jira/browse/SPARK-3877 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Shixiong Zhu >Priority: Minor > Labels: yarn > > When an yarn application fails (yarn-cluster mode), the exit code of > spark-submit is still 0. It's hard for people to write some automatic scripts > to run spark jobs in yarn because the failure can not be detected in these > scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag
[ https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174981#comment-14174981 ] Ilya Ganelin commented on SPARK-3694: - Hello. I would like to work on this. Can you please assign it to me? Thank you. > Allow printing object graph of tasks/RDD's with a debug flag > > > Key: SPARK-3694 > URL: https://issues.apache.org/jira/browse/SPARK-3694 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell > Labels: starter > > This would be useful for debugging extra references inside of RDD's > Here is an example for inspiration: > http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html > We'd want to print this trace for both the RDD serialization inside of the > DAGScheduler and the task serialization in the TaskSetManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3968) Use parquet-mr filter2 api in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-3968: -- Description: The parquet-mr project has introduced a new filter api , along with several fixes , like filtering on OPTIONAL columns as well. It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself (will create a separate ticket for that) . was: The parquet-mr project has introduced a new filter api , along with several fixes , like filtering on OPTIONAL columns as well. It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself. Summary: Use parquet-mr filter2 api in spark sql (was: Using parquet-mr filter2 api in spark sql, add a custom filter for InSet clause) > Use parquet-mr filter2 api in spark sql > --- > > Key: SPARK-3968 > URL: https://issues.apache.org/jira/browse/SPARK-3968 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yash Datta >Priority: Minor > Fix For: 1.1.1 > > > The parquet-mr project has introduced a new filter api , along with several > fixes , like filtering on OPTIONAL columns as well. It can also eliminate > entire RowGroups depending on certain statistics like min/max > We can leverage that to further improve performance of queries with filters. > Also filter2 api introduces ability to create custom filters. We can create a > custom filter for the optimized In clause (InSet) , so that elimination > happens in the ParquetRecordReader itself (will create a separate ticket for > that) . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command
[ https://issues.apache.org/jira/browse/SPARK-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174879#comment-14174879 ] Apache Spark commented on SPARK-3989: - User 'ziky90' has created a pull request for this issue: https://github.com/apache/spark/pull/2836 > Added the possibility to install Python packages via pip for pyspark directly > from the ./spark_ec2 command > --- > > Key: SPARK-3989 > URL: https://issues.apache.org/jira/browse/SPARK-3989 > Project: Spark > Issue Type: New Feature > Components: EC2, PySpark >Reporter: Jan Zikeš >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command
[ https://issues.apache.org/jira/browse/SPARK-3989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174876#comment-14174876 ] Jan Zikeš commented on SPARK-3989: -- Implemented and sent pull request here: https://github.com/apache/spark/pull/2836 > Added the possibility to install Python packages via pip for pyspark directly > from the ./spark_ec2 command > --- > > Key: SPARK-3989 > URL: https://issues.apache.org/jira/browse/SPARK-3989 > Project: Spark > Issue Type: New Feature > Components: EC2, PySpark >Reporter: Jan Zikeš >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3989) Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command
Jan Zikeš created SPARK-3989: Summary: Added the possibility to install Python packages via pip for pyspark directly from the ./spark_ec2 command Key: SPARK-3989 URL: https://issues.apache.org/jira/browse/SPARK-3989 Project: Spark Issue Type: New Feature Components: EC2, PySpark Reporter: Jan Zikeš Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3900) ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.
[ https://issues.apache.org/jira/browse/SPARK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174857#comment-14174857 ] Apache Spark commented on SPARK-3900: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2755 > ApplicationMaster's shutdown hook fails and IllegalStateException is thrown. > > > Key: SPARK-3900 > URL: https://issues.apache.org/jira/browse/SPARK-3900 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 > Environment: Hadoop 0.23 >Reporter: Kousuke Saruta >Priority: Critical > > ApplicationMaster registers a shutdown hook and it calls > ApplicationMaster#cleanupStagingDir. > cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes > FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook. > In FileSystem of hadoop 0.23, the shutdown hook registration does not > consider whether shutdown is in progress or not (In 2.2, it's considered). > {code} > // 0.23 > if (map.isEmpty() ) { > ShutdownHookManager.get().addShutdownHook(clientFinalizer, > SHUTDOWN_HOOK_PRIORITY); > } > {code} > {code} > // 2.2 > if (map.isEmpty() > && !ShutdownHookManager.get().isShutdownInProgress()) { >ShutdownHookManager.get().addShutdownHook(clientFinalizer, > SHUTDOWN_HOOK_PRIORITY); > } > {code} > Thus, in 0.23, another shutdown hook can be registered when > ApplicationMaster's shutdown hook run. > This issue cause IllegalStateException as follows. > {code} > java.lang.IllegalStateException: Shutdown in progress, cannot add a > shutdownHook > at > org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3606. Resolution: Fixed Fix Version/s: 1.1.1 > Spark-on-Yarn AmIpFilter does not work with Yarn HA. > > > Key: SPARK-3606 > URL: https://issues.apache.org/jira/browse/SPARK-3606 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.1.1, 1.2.0 > > > The current IP filter only considers one of the RMs in an HA setup. If the > active RM is not the configured one, you get a "connection refused" error > when clicking on the Spark AM links in the RM UI. > Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174824#comment-14174824 ] Xiangrui Meng commented on SPARK-2426: -- [~debasish83] Thanks for working on this feature! This is definitely lots of work. We need to figure out couple high-level questions before looking into the code: 1. License. There are two files that requires special license: proximal, which ports cvxgrp/proximal (BSD) and QPMinimizer: {code} ... distributed with Copyright (c) 2014, Debasish Das (Verizon), all rights reserved. {code} Code contribution to Apache follows ICLA: http://www.apache.org/licenses/icla.txt . I'm not familiar with the terms. I saw {code} Except for the license granted herein to the Foundation and recipients of software distributed by the Foundation, You reserve all right, title, and interest in and to Your Contributions. {code} My understand is that if you want your code distributed with Apache License, we don't need special notice about your rights. Please check with Verizon's legal team to make sure they are okay with it. It would be really helpful If someone can explain in more details. 2. Interface. I'm doing a refactoring of ALS (SPARK-3541). I hope we can decouple the solvers (LS, QP) from ALS. In https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala The subproblem is wrapped in a NormalEquation, which stores AtA, Atb, and n. A Cholesky solver takes a NormalEquation instance, solves it, and returns the solution. We can plug-in other solvers as long as NormalEquation provides all information we need. Does it apply to your use cases? For public APIs, we should restrict parameters to simple types. For example, constraint = "none" | "nonnegative" | "box". This is good for adding Python APIs. Those options should be sufficient for normal use cases. We can provide a developer API that allows advanced users to plug-in their own solvers. You can check the current proposal of parameters at SPARK-3530. 3. Where to put the implementation? Including MLlib's NNLS, those solvers are for local problems. What sounds ideal to me is breeze.optimize, which already contains several optimization solvers and we use LBFGS implemented there and maybe OWLQN soon. 4. This PR definitely needs some time to testing. The feature freeze deadline for v1.2 is Oct 31. I cannot promise time for code review given my current bandwidth. It would be great if you can share your MATLAB code (hopefully Octave compatible) and some performance results. So more developers can help test. > Quadratic Minimization for MLlib ALS > > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Debasish Das >Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3541) Improve ALS internal storage
[ https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3541: - Shepherd: (was: Xiangrui Meng) > Improve ALS internal storage > > > Key: SPARK-3541 > URL: https://issues.apache.org/jira/browse/SPARK-3541 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Original Estimate: 96h > Remaining Estimate: 96h > > The internal storage of ALS uses many small objects, which increases the GC > pressure and makes ALS difficult to scale to very large scale, e.g., 50 > billion ratings. In such cases, the full GC may take more than 10 minutes to > finish. That is longer than the default heartbeat timeout and hence executors > will be removed under default settings. > We can use primitive arrays to reduce the number of objects significantly. > This requires big change to the ALS implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org