Re: Spark Matrix Factorization

2014-06-27 Thread Krakna H
Hi all,

Just found this thread -- is there an update on including DSGD in Spark? We
have a project that entails topic modeling on a document-term matrix using
matrix factorization, and were wondering if we should use ALS or attempt
writing our own matrix factorization implementation on top of Spark.

Thanks.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Spark Matrix Factorization

2014-06-27 Thread Debasish Das
Hi,

In my experiments with Jellyfish I did not see any substantial RMSE loss
over DSGD for Netflix dataset...

So we decided to stick with ALS and implemented a family of Quadratic
Minimization solvers that stays in the ALS realm but can solve interesting
constraints(positivity, bounds, L1, equality constrained bounds etc)...We
are going to show it at the Spark Summit...Also ALS structure is favorable
to matrix factorization use-cases where missing entries means zero and you
want to compute a global gram matrix using broadcast and use that for each
Quadratic Minimization for all users/products...

Implementing DSGD in the data partitioning that Spark ALS uses will be
straightforward but I would be more keen to see a dataset where DSGD is
showing you better RMSEs than ALS

If you have a dataset where DSGD produces much better result could you
please point it to us ?

Also you can use Jellyfish to run DSGD benchmarks to compare against
ALS...It is multithreaded and if you have good RAM, you should be able to
run fairly large datasets...

Be careful about the default Jellyfish...it has been tuned for netflix
dataset (regularization, rating normalization etc)...So before you compare
RMSE make sure ALS and Jellyfish is running same algorithm (L2 regularized
Quadratic Loss)

Thanks.
Deb


On Fri, Jun 27, 2014 at 3:40 AM, Krakna H  wrote:

> Hi all,
>
> Just found this thread -- is there an update on including DSGD in Spark? We
> have a project that entails topic modeling on a document-term matrix using
> matrix factorization, and were wondering if we should use ALS or attempt
> writing our own matrix factorization implementation on top of Spark.
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Matrix-Factorization-tp55p7097.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


RE: IntelliJ IDEA cannot compile TreeNode.scala

2014-06-27 Thread Ron Chung Hu (Ron Hu, ARC)
Thanks Reynold for advice.

Ron

-Original Message-
From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Thursday, June 26, 2014 8:57 PM
To: dev@spark.apache.org
Subject: Re: IntelliJ IDEA cannot compile TreeNode.scala

IntelliJ parser/analyzer/compiler behaves differently from Scala compiler,
and sometimes lead to inconsistent behavior. This is one of the case.

In general while we use IntelliJ, we don't use it to build stuff. I
personally always build in command line with sbt or Maven.



On Thu, Jun 26, 2014 at 7:43 PM, Ron Chung Hu (Ron Hu, ARC) <
ron...@huawei.com> wrote:

> Hi,
>
> I am a Spark newbie.  I just downloaded Spark1.0.0 and latest IntelliJ
> version 13.1 with Scala plug-in.  At spark-1.0.0 top level, I executed the
> following SBT commands and they ran successfully.
>
>
> -  ./sbt/sbt assembly
>
> -  ./sbt/sbt update gen-idea
>
> After opening IntelliJ IDEA, I tried to compile
> ./sql/catalyst/trees/TreeNode.scala inside IntelliJ.  I got many
> compile errors such as "cannot resolve symbol children", "cannot resolve
> symbol id".  Actually both symbols are defined in the same file.   As Spark
> was built successfully with "sbt/sbt assembly" command, I wondered what
> went wrong in compiling TreeNode.scala.  Any pointer will be appreciated.
>
> Thanks.
>
> Best,
> Ron Hu
>
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Matei Zaharia
+1

Tested it out on Mac OS X and Windows, looked through docs.

Matei

On Jun 26, 2014, at 7:06 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version 
> 1.0.1!
> 
> The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1020/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/
> 
> Please vote on releasing this package as Apache Spark 1.0.1!
> 
> The vote is open until Monday, June 30, at 03:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.0.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> === About this release ===
> This release fixes a few high-priority bugs in 1.0 and has a variety
> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> of the more visible patches are:
> 
> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
> SPARK-1790: Support r3 instance types on EC2.
> 
> This is the first maintenance release on the 1.0 line. We plan to make
> additional maintenance releases as new fixes come in.
> 
> - Patrick



Linear CG solver

2014-06-27 Thread Debasish Das
Hi,

I am looking for an efficient linear CG to be put inside the Quadratic
Minimization algorithms we added for Spark mllib.

With a good linear CG, we should be able to solve kernel SVMs with this
solver in mllib...

I use direct solves right now using cholesky decomposition which has higher
complexity as matrix sizes become large...

I found out some jblas example code:

https://github.com/mikiobraun/jblas-examples/blob/master/src/CG.java

I was wondering if mllib developers have any experience using this solver
and if this is better than apache commons linear CG ?

Thanks.
Deb


Re: Linear CG solver

2014-06-27 Thread David Hall
I have no ideas on benchmarks, but breeze has a CG solver:
https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala

https://github.com/scalanlp/breeze/blob/e2adad3b885736baf890b306806a56abc77a3ed3/math/src/test/scala/breeze/optimize/linear/ConjugateGradientTest.scala

It's based on the code from TRON, and so I think it's more targeted for
norm-constrained solutions of the CG problem.








On Fri, Jun 27, 2014 at 5:54 PM, Debasish Das 
wrote:

> Hi,
>
> I am looking for an efficient linear CG to be put inside the Quadratic
> Minimization algorithms we added for Spark mllib.
>
> With a good linear CG, we should be able to solve kernel SVMs with this
> solver in mllib...
>
> I use direct solves right now using cholesky decomposition which has higher
> complexity as matrix sizes become large...
>
> I found out some jblas example code:
>
> https://github.com/mikiobraun/jblas-examples/blob/master/src/CG.java
>
> I was wondering if mllib developers have any experience using this solver
> and if this is better than apache commons linear CG ?
>
> Thanks.
> Deb
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Andrew Or
There is an issue with the SparkUI: the storage page continues to display
RDDs that are dropped from memory. This is fixed in
https://github.com/apache/spark/commit/21e0f77b6321590ed86223a60cdb8ae08ea4057f
but is not part of this RC.


2014-06-27 11:18 GMT-07:00 Matei Zaharia :

> +1
>
> Tested it out on Mac OS X and Windows, looked through docs.
>
> Matei
>
> On Jun 26, 2014, at 7:06 PM, Patrick Wendell  wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> 1.0.1!
> >
> > The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.1-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1020/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.0.1!
> >
> > The vote is open until Monday, June 30, at 03:00 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.0.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > === About this release ===
> > This release fixes a few high-priority bugs in 1.0 and has a variety
> > of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> > of the more visible patches are:
> >
> > SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> > SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
> size.
> > SPARK-1790: Support r3 instance types on EC2.
> >
> > This is the first maintenance release on the 1.0 line. We plan to make
> > additional maintenance releases as new fixes come in.
> >
> > - Patrick
>
>


Re: Linear CG solver

2014-06-27 Thread Debasish Das
Thanks David...Let me try it...I am keen to see the results first and later
will look into runtime optimizations...

Deb





On Fri, Jun 27, 2014 at 3:12 PM, David Hall  wrote:

> I have no ideas on benchmarks, but breeze has a CG solver:
>
> https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala
>
>
> https://github.com/scalanlp/breeze/blob/e2adad3b885736baf890b306806a56abc77a3ed3/math/src/test/scala/breeze/optimize/linear/ConjugateGradientTest.scala
>
> It's based on the code from TRON, and so I think it's more targeted for
> norm-constrained solutions of the CG problem.
>
>
>
>
>
>
>
>
> On Fri, Jun 27, 2014 at 5:54 PM, Debasish Das 
> wrote:
>
> > Hi,
> >
> > I am looking for an efficient linear CG to be put inside the Quadratic
> > Minimization algorithms we added for Spark mllib.
> >
> > With a good linear CG, we should be able to solve kernel SVMs with this
> > solver in mllib...
> >
> > I use direct solves right now using cholesky decomposition which has
> higher
> > complexity as matrix sizes become large...
> >
> > I found out some jblas example code:
> >
> > https://github.com/mikiobraun/jblas-examples/blob/master/src/CG.java
> >
> > I was wondering if mllib developers have any experience using this solver
> > and if this is better than apache commons linear CG ?
> >
> > Thanks.
> > Deb
> >
>


RE: IntelliJ IDEA cannot compile TreeNode.scala

2014-06-27 Thread Yan Zhou.sc
One question, then, is what to use to debug Spark if Intellij can only be used 
for code browsing for the sake of unresolved symbols as mentioned by Ron? 
More specifically, if one builds from command line, but would like to debug a 
running Spark from a IDE, Intellij, e.g., what could he do?

Another note is that the problem seems to start to appear on Spark 1.0.0 and 
not with Spark 0.8.0 at least. Any lights to shed on this difference between 
the versions?

Thanks,

Yan

-Original Message-
From: Reynold Xin [mailto:r...@databricks.com] 
Sent: Thursday, June 26, 2014 8:57 PM
To: dev@spark.apache.org
Subject: Re: IntelliJ IDEA cannot compile TreeNode.scala

IntelliJ parser/analyzer/compiler behaves differently from Scala compiler, and 
sometimes lead to inconsistent behavior. This is one of the case.

In general while we use IntelliJ, we don't use it to build stuff. I personally 
always build in command line with sbt or Maven.



On Thu, Jun 26, 2014 at 7:43 PM, Ron Chung Hu (Ron Hu, ARC) < 
ron...@huawei.com> wrote:

> Hi,
>
> I am a Spark newbie.  I just downloaded Spark1.0.0 and latest IntelliJ 
> version 13.1 with Scala plug-in.  At spark-1.0.0 top level, I executed 
> the following SBT commands and they ran successfully.
>
>
> -  ./sbt/sbt assembly
>
> -  ./sbt/sbt update gen-idea
>
> After opening IntelliJ IDEA, I tried to compile 
> ./sql/catalyst/trees/TreeNode.scala inside IntelliJ.  I got many 
> compile errors such as "cannot resolve symbol children", "cannot resolve
> symbol id".  Actually both symbols are defined in the same file.   As Spark
> was built successfully with "sbt/sbt assembly" command, I wondered 
> what went wrong in compiling TreeNode.scala.  Any pointer will be appreciated.
>
> Thanks.
>
> Best,
> Ron Hu
>
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Andrew Or
(Forgot to mention, that UI bug is not in Spark 1.0.0, so it is technically
a regression)


2014-06-27 15:42 GMT-07:00 Andrew Or :

> There is an issue with the SparkUI: the storage page continues to display
> RDDs that are dropped from memory. This is fixed in
> https://github.com/apache/spark/commit/21e0f77b6321590ed86223a60cdb8ae08ea4057f
> but is not part of this RC.
>
>
> 2014-06-27 11:18 GMT-07:00 Matei Zaharia :
>
> +1
>>
>> Tested it out on Mac OS X and Windows, looked through docs.
>>
>> Matei
>>
>> On Jun 26, 2014, at 7:06 PM, Patrick Wendell  wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark
>> version 1.0.1!
>> >
>> > The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.1-rc1/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1020/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.0.1!
>> >
>> > The vote is open until Monday, June 30, at 03:00 UTC and passes if
>> > a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.0.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>> > === About this release ===
>> > This release fixes a few high-priority bugs in 1.0 and has a variety
>> > of smaller fixes. The full list is here: http://s.apache.org/b45. Some
>> > of the more visible patches are:
>> >
>> > SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
>> > SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
>> size.
>> > SPARK-1790: Support r3 instance types on EC2.
>> >
>> > This is the first maintenance release on the 1.0 line. We plan to make
>> > additional maintenance releases as new fixes come in.
>> >
>> > - Patrick
>>
>>
>


Re: Contributing to MLlib on GLM

2014-06-27 Thread 白刚
Hi Xiaokai,

My bad. I didn't notice this before I created another PR for Poisson 
regression. The mails were buried in junk by the corp mail master. Also, thanks 
for considering my comments and advice in your PR.

Adding my two cents here:

* PoissonRegressionModel and GammaRegressionModel have the same fields and 
prediction method. Shall we use one instead of two redundant classes? Say, a 
LogLinearModel.
* The LBFGS optimizer takes fewer iterations and results in better convergence 
than SGD. I implemented two GeneralizedLinearAlgorithm classes using LBFGS and 
SGD respectively. You may take a look into it. If it's OK to you, I'd be happy 
to send a PR to your branch.
* In addition to the generated test data, We may use some real-world data for 
testing. In my implementation, I added the test data from 
https://onlinecourses.science.psu.edu/stat504/node/223. Please check my test 
suite.

-Gang
Sent from my iPad

> On 2014年6月27日, at 下午6:03, "xwei"  wrote:
> 
> 
> Yes, that's what we did: adding two gradient functions to Gradient.scala and
> create PoissonRegression and GammaRegression using these gradients. We made
> a PR on this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-on-GLM-tp7033p7088.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Krishna Sankar
+1
Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2, YARN)
Smoke Tests (sparkPi,spark-shell, web UI) successful

Cheers



On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.0.1!
>
> The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1020/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.1!
>
> The vote is open until Monday, June 30, at 03:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> === About this release ===
> This release fixes a few high-priority bugs in 1.0 and has a variety
> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> of the more visible patches are:
>
> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
> SPARK-1790: Support r3 instance types on EC2.
>
> This is the first maintenance release on the 1.0 line. We plan to make
> additional maintenance releases as new fixes come in.
>
> - Patrick
>