[jira] [Created] (SPARK-16688) OpenHashSet.MAX_CAPACITY is always based on Int even when using Long

2016-07-22 Thread Ben McCann (JIRA)
Ben McCann created SPARK-16688:
--

 Summary: OpenHashSet.MAX_CAPACITY is always based on Int even when 
using Long
 Key: SPARK-16688
 URL: https://issues.apache.org/jira/browse/SPARK-16688
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2, 2.0.0
Reporter: Ben McCann


MAX_CAPACITY is hardcoded to a value of 1073741824:

{code}val MAX_CAPACITY = 1 << 30

  class LongHasher extends Hasher[Long] {
override def hash(o: Long): Int = (o ^ (o >>> 32)).toInt
  }{code}

I'd like to stick more than 1B items in my hashmap. Spark's all about big data, 
right?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-21 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387291#comment-15387291
 ] 

Ben McCann commented on SPARK-16658:


Also, regarding whether GraphX is still being updated, it appears so as the 
last commit was only two days ago: 
https://github.com/apache/spark/commit/5d92326be76cb15edc6e18e94a373e197f696803

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-21 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15387289#comment-15387289
 ] 

Ben McCann commented on SPARK-16658:


The code is licensed under an Apache 2 license, so it's fine to include. I've 
informed the authors and they responded positively (see here: 
https://github.com/cloudml/zen/issues/58). Let me know if you have any 
remaining concerns.

> Add EdgePartition.withVertexAttributes
> --
>
> Key: SPARK-16658
> URL: https://issues.apache.org/jira/browse/SPARK-16658
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ben McCann
>
> I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
> upstreamed, so that they can go back to using the upstream graphx instead of 
> having a fork.
> Their implementation of withVertexAttributes: 
> https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala
> Their usage of that method: 
> https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16658) Add EdgePartition.withVertexAttributes

2016-07-20 Thread Ben McCann (JIRA)
Ben McCann created SPARK-16658:
--

 Summary: Add EdgePartition.withVertexAttributes
 Key: SPARK-16658
 URL: https://issues.apache.org/jira/browse/SPARK-16658
 Project: Spark
  Issue Type: Improvement
Reporter: Ben McCann


I'm using cloudml/zen, which has forked graphx. I'd like to see their changes 
upstreamed, so that they can go back to using the upstream graphx instead of 
having a fork.

Their implementation of withVertexAttributes: 
https://github.com/cloudml/zen/blob/94ba7d7f216feb2bff910eec7285dd7caf9440f0/ml/src/main/scala/org/apache/spark/graphx2/impl/EdgePartition.scala

Their usage of that method: 
https://github.com/cloudml/zen/blob/8a64a141685d6637a993c3cc6d1788f414d6c3cf/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDADefines.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16617) Upgrade to Avro 1.8.x

2016-07-18 Thread Ben McCann (JIRA)
Ben McCann created SPARK-16617:
--

 Summary: Upgrade to Avro 1.8.x
 Key: SPARK-16617
 URL: https://issues.apache.org/jira/browse/SPARK-16617
 Project: Spark
  Issue Type: Improvement
Reporter: Ben McCann


Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
containing Avro objects.

See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2016-07-11 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371816#comment-15371816
 ] 

Ben McCann commented on SPARK-6567:
---

[~hucheng] can you share your code for this?

> Large linear model parallelism via a join and reduceByKey
> -
>
> Key: SPARK-6567
> URL: https://issues.apache.org/jira/browse/SPARK-6567
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Reza Zadeh
> Attachments: model-parallelism.pptx
>
>
> To train a linear model, each training point in the training set needs its 
> dot product computed against the model, per iteration. If the model is large 
> (too large to fit in memory on a single machine) then SPARK-4590 proposes 
> using parameter server.
> There is an easier way to achieve this without parameter servers. In 
> particular, if the data is held as a BlockMatrix and the model as an RDD, 
> then each block can be joined with the relevant part of the model, followed 
> by a reduceByKey to compute the dot products.
> This obviates the need for a parameter server, at least for linear models. 
> However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2016-04-26 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255122#comment-15255122
 ] 

Ben McCann edited comment on SPARK-7008 at 4/26/16 10:09 PM:
-

I've found a number of implementations:
https://github.com/zhengruifeng/spark-libFM
https://github.com/skrusche63/spark-fm
https://github.com/blebreton/spark-FM-parallelSGD
https://github.com/cloudml/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation


was (Author: chengas123):
I've found a number of implementations:
https://github.com/zhengruifeng/spark-libFM
https://github.com/skrusche63/spark-fm
https://github.com/blebreton/spark-FM-parallelSGD
https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14882) Programming Guide Improvements

2016-04-23 Thread Ben McCann (JIRA)
Ben McCann created SPARK-14882:
--

 Summary: Programming Guide Improvements
 Key: SPARK-14882
 URL: https://issues.apache.org/jira/browse/SPARK-14882
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Ben McCann


I'm reading http://spark.apache.org/docs/latest/programming-guide.html

It says "Spark 1.6.1 uses Scala 2.10. To write applications in Scala, you will 
need to use a compatible Scala version (e.g. 2.10.X)." However, it doesn't seem 
to me that Scala 2.10 is required because I see versions compiled for both 2.10 
and 2.11 in Maven Central.

There are a few references to Tachyon that look like they should be changed to 
Alluxio



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2016-04-22 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255122#comment-15255122
 ] 

Ben McCann commented on SPARK-7008:
---

I've found a number of implementations:
https://github.com/zhengruifeng/spark-libFM
https://github.com/skrusche63/spark-fm
https://github.com/blebreton/spark-FM-parallelSGD
https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/recommendation

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org