[jira] [Created] (SPARK-16688) OpenHashSet.MAX_CAPACITY is always based on Int even when using Long
Ben McCann created SPARK-16688: -- Summary: OpenHashSet.MAX_CAPACITY is always based on Int even when using Long Key: SPARK-16688 URL: https://issues.apache.org/jira/browse/SPARK-16688 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.2, 2.0.0 Reporter: Ben McCann MAX_CAPACITY is hardcoded to a value of 1073741824: {code}val MAX_CAPACITY = 1 << 30 class LongHasher extends Hasher[Long] { override def hash(o: Long): Int = (o ^ (o >>> 32)).toInt }{code} I'd like to stick more than 1B items in my hashmap. Spark's all about big data, right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes
[ https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387291#comment-15387291 ] Ben McCann commented on SPARK-16658: Also, regarding whether GraphX is still being updated, it appears so as the last commit was only two days ago: https://github.com/apache/spark/commit/5d92326be76cb15edc6e18e94a373e197f696803 > Add EdgePartition.withVertexAttributes > -- > > Key: SPARK-16658 > URL: https://issues.apache.org/jira/browse/SPARK-16658 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > I'm using cloudml/zen, which has forked graphx. I'd like to see their changes > upstreamed, so that they can go back to using the upstream graphx instead of > having a fork. > Their implementation of withVertexAttributes: > https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala > Their usage of that method: > https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes
[ https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387289#comment-15387289 ] Ben McCann commented on SPARK-16658: The code is licensed under an Apache 2 license, so it's fine to include. I've informed the authors and they responded positively (see here: https://github.com/cloudml/zen/issues/58). Let me know if you have any remaining concerns. > Add EdgePartition.withVertexAttributes > -- > > Key: SPARK-16658 > URL: https://issues.apache.org/jira/browse/SPARK-16658 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > I'm using cloudml/zen, which has forked graphx. I'd like to see their changes > upstreamed, so that they can go back to using the upstream graphx instead of > having a fork. > Their implementation of withVertexAttributes: > https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala > Their usage of that method: > https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16658) Add EdgePartition.withVertexAttributes
Ben McCann created SPARK-16658: -- Summary: Add EdgePartition.withVertexAttributes Key: SPARK-16658 URL: https://issues.apache.org/jira/browse/SPARK-16658 Project: Spark Issue Type: Improvement Reporter: Ben McCann I'm using cloudml/zen, which has forked graphx. I'd like to see their changes upstreamed, so that they can go back to using the upstream graphx instead of having a fork. Their implementation of withVertexAttributes: https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala Their usage of that method: https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16617) Upgrade to Avro 1.8.x
Ben McCann created SPARK-16617: -- Summary: Upgrade to Avro 1.8.x Key: SPARK-16617 URL: https://issues.apache.org/jira/browse/SPARK-16617 Project: Spark Issue Type: Improvement Reporter: Ben McCann Avro 1.8 makes Avro objects serializable so that you can easily have an RDD containing Avro objects. See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371816#comment-15371816 ] Ben McCann commented on SPARK-6567: --- [~hucheng] can you share your code for this? > Large linear model parallelism via a join and reduceByKey > - > > Key: SPARK-6567 > URL: https://issues.apache.org/jira/browse/SPARK-6567 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Reza Zadeh > Attachments: model-parallelism.pptx > > > To train a linear model, each training point in the training set needs its > dot product computed against the model, per iteration. If the model is large > (too large to fit in memory on a single machine) then SPARK-4590 proposes > using parameter server. > There is an easier way to achieve this without parameter servers. In > particular, if the data is held as a BlockMatrix and the model as an RDD, > then each block can be joined with the relevant part of the model, followed > by a reduceByKey to compute the dot products. > This obviates the need for a parameter server, at least for linear models. > However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255122#comment-15255122 ] Ben McCann edited comment on SPARK-7008 at 4/26/16 10:09 PM: - I've found a number of implementations: https://github.com/zhengruifeng/spark-libFM https://github.com/skrusche63/spark-fm https://github.com/blebreton/spark-FM-parallelSGD https://github.com/cloudml/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation was (Author: chengas123): I've found a number of implementations: https://github.com/zhengruifeng/spark-libFM https://github.com/skrusche63/spark-fm https://github.com/blebreton/spark-FM-parallelSGD https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation > An implementation of Factorization Machine (LibFM) > -- > > Key: SPARK-7008 > URL: https://issues.apache.org/jira/browse/SPARK-7008 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng > Labels: features > Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, > QQ20150421-2.png > > > An implementation of Factorization Machines based on Scala and Spark MLlib. > FM is a kind of machine learning algorithm for multi-linear regression, and > is widely used for recommendation. > FM works well in recent years' recommendation competitions. > Ref: > http://libfm.org/ > http://doi.acm.org/10.1145/2168752.2168771 > http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14882) Programming Guide Improvements
Ben McCann created SPARK-14882: -- Summary: Programming Guide Improvements Key: SPARK-14882 URL: https://issues.apache.org/jira/browse/SPARK-14882 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Ben McCann I'm reading http://spark.apache.org/docs/latest/programming-guide.html It says "Spark 1.6.1 uses Scala 2.10. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.10.X)." However, it doesn't seem to me that Scala 2.10 is required because I see versions compiled for both 2.10 and 2.11 in Maven Central. There are a few references to Tachyon that look like they should be changed to Alluxio -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255122#comment-15255122 ] Ben McCann commented on SPARK-7008: --- I've found a number of implementations: https://github.com/zhengruifeng/spark-libFM https://github.com/skrusche63/spark-fm https://github.com/blebreton/spark-FM-parallelSGD https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation > An implementation of Factorization Machine (LibFM) > -- > > Key: SPARK-7008 > URL: https://issues.apache.org/jira/browse/SPARK-7008 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng > Labels: features > Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, > QQ20150421-2.png > > > An implementation of Factorization Machines based on Scala and Spark MLlib. > FM is a kind of machine learning algorithm for multi-linear regression, and > is widely used for recommendation. > FM works well in recent years' recommendation competitions. > Ref: > http://libfm.org/ > http://doi.acm.org/10.1145/2168752.2168771 > http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org